In normal dictionary, the words are arranged according to alphabetical order. But we can arrange the words according the frequency of occurrence. That is, most used words will appear first, less used words follows. Then each word will have a rank and frequency. If we multiply rank and frequency, we get a number. That number will be the same (more or less) for all the words and it is a constant. That is zipf's law.
[In the above paragraph, five 'the' are present out of 70 words. hence frequency percentage is 5*100/70=7.1%]
Frequency percentage * rank = constant.
The most used words in English is 'The'. It occurs 7 times in 100 words. Or the percentage of frequency is 7 % . A list is given below.
Rank R word frequency constant
percentage F F * R = C
1. the 6.8% 1 * 6.8 =6.8 %
2. of 3.1 2 * 3.1 = 6.2
3. to 2.7 3 * 2.7 = 8.1
4. and 2.6 4 * 2.6 = 10.4
5. in 1.8 5*1.8 = 9.0
6. is 1.2 6*1.2 = 7.2
7 for 1 7*1 = 7
8 that 0.8 8 * .8 = 6.4
The constant of proportionality in English language is about 7.5% or .075
In a nutshell : We can have English frequency dictionary. The frequency of a word is inversely proportional to the rank. The constant of proportionality is 0.075.
Coming to our title question. Let us find out how much typical English text is made of top 1000 words. Or most commonly used 1000 words.
We have to note down the frequency percentage of each word (ranked 1 to 1000 ) from the dictionary. And add all of the percentages. Then we will get total percentage of English text that is written using top 1000 words.
There is a mathematical short-cut. If you are allergic to math, you can skip this portion.
Frequency * rank = constant
frequency % = constant /rank
So frequency percentage of first 1000 words
= 0.075/1 +0.075/2 + ..... 0.075/1000
or = 0.075(1/1+1/2+1/3+ ....+0.075/1000)
= 0.075(log(1000) + 0.58) math formula
= 0.56
=56%
That is, more than half of English text that we write or read is only made up of 1000 words. A lay man or learned man may mostly use 3000 words. Even Shakespeare said to have known only around 30 thousand words. But the mighty English language has about 300 thousands words. How little we know?
The zipf's law hold good for,
Ordering
1. companies by staff.
2. Universities by number of students.
3. Languages by number of speakers.
4. Websites by hits.
5. Cities by population
6. Countries by area.
and so on.
FOOT NOTE:
Amazon e books can be ranked according to daily sales and zipf's law can be applied. Even though there are millions of e books only top few hundred books make more than 50% of the sales daily.
-------------------------------------------------------------------------------------------------
Great Post with useful information. Thank you. Share more updates.
ReplyDeleteIELTS Classes Anna Nagar