hermitdave / frequencywords Goto Github PK

Repository for Frequency Word List Generator and processed files

License: MIT License

C# 97.43% JavaScript 2.57%

frequencywords's Introduction

FrequencyWords

Repository for Frequency Word List Generator and processed files

In early days I hosted the generated files on OneDrive with my blog https://invokeit.wordpress.com/frequency-word-lists/ linking to it. Moving forward, the code and the generated outputs are on GitHub.

OpenSubtitle tokenized source

The data used to generate 2016 lists can be found at http://opus.lingfil.uu.se/OpenSubtitles2016.php The data used to generate 2018 lists can be found at http://opus.nlpl.eu/OpenSubtitles2018.php

Format

Frequency lists are on the {word}{space}{numer_of_occurences_in_corpus}. By example, in file en_50k.txt :

you 22484400
i 19975318
the 17594291
to 13200962
...

Usages

These data are reused by various widely used opensource projects, among which Wikipedia, input methods and autocomplete keyoards, etc.

License

MIT License for code.
CC-by-sa-4.0 for content.

frequencywords's People

Contributors

Stargazers

Watchers

Forkers

mromanuk forcebold dzcpy wollmers cdmshrs radi-ratlh jonadem kastnerkyle nicolewhite mircealungu magnusnissel kalaspuffar ziaenezhad syedrizvi techpines philharton womenabc dssantana-zz navdeeppal sojvai tomasz-oponowicz andremoura1 nniehof vjekob waldo1001 nathaniaelvina jevgenijskaktins peterdocter yohane55 codisteinborn brittikbasu sicilian-lizard fideasu pratiknalage li-kwan-jel kubiako1996 dongdyang kashenfelter barnabasszabolcs maximejumelle hienbitnemo eglantine motazsaad tavogus1988 adiman9 hhhcommon lfrati jaygoodfellow devzys sinaak lovehoroscoper erikvdplas cragun alonsopg zzka andreaceolin noobpsybot rollingstone franciscogamarra10 masihsultani donkwillis robert1ridley zesantos82 allgebrist schlepfilter bala0786 cambaughn hunter-p ansable zengjatzau doubleen dysim garstka xiaoduozhou nigelkinney seanoc5 matthayes redaaffane ghalayini songguozi73 wyrun pauleikis jlev82 dksharp frutik thanhvuong1605 n3twork arhelogic ajdajdajd dvygolov honos2014 rajacsp joshuayongo hn-n polytronicgr blwinters alaric81li215 martinallison aly2021 jcarreja

frequencywords's Issues

Typo: Indonesian Top 202 "THE"

I believe "THE" listed on the top 202 should be "TEH", meaning "tea".
https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/id/id_full.txt
You'll find "THE" between "boleh" and "masara".

The most prominent Indonesian dictionary is KBBI issued by the government. Even KBBI doesn't register "THE" as an Indonesian word.
https://kbbi.kemdikbud.go.id/entri/the

When I work on Microsoft Excel, for example, "TEH" (tea) is always auto-corrected as "THE".

Wordlists without 50k words

The following language files are named with "_50k" appended to the filename, but do not contain 50k words:

10282 content/2016/af/af_50k.txt
 2350 content/2016/bn/bn_50k.txt
 7131 content/2016/br/br_50k.txt
31631 content/2016/eo/eo_50k.txt
 2968 content/2016/hi/hi_50k.txt
 1972 content/2016/hy/hy_50k.txt
 3402 content/2016/kk/kk_50k.txt
 2604 content/2016/ml/ml_50k.txt
 7504 content/2016/si/si_50k.txt
  927 content/2016/ta/ta_50k.txt
 1402 content/2016/te/te_50k.txt
 6033 content/2016/tl/tl_50k.txt

reference

Hi,
I used your frequency list for a master thesis in psychology.
How can I reference you? For now, I have the following:
"Now, the words of the LibreOffice English dictionary were matched with their respective frequencies extracted from a frequency list developed by Hermit (2016)."

Best regards,
Koen

Punctuations marks are not ignored in Urdu

Disclaimer

⚠️ I don't speak Urdu at all so please don't take directly in account what I say and ask a real Urdu speaker ⚠️

Issue

But there seem to be some characters that should be ignored in Urdu :
https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/ur/ur_full.txt

See lines 3 and 5

What makes me think these are not words but really punctuation as someone who doesn't speak Urdu are the characters' name :
ARABIC COMMA and ARABIC FULL STOP

Potential fix

If you made sure this is not just me that has not enough knowledge of the language but a real issue, my fix would be to add

،

and

۔

in the ignored characters list

Map words to word type

Currently those words are just "WORD OCCURENCECOUNT".

I think it is highly useful for many individuals to have "WORD OCCURENCECOUNT TYPE", whereas TYPE specifies the word type. This word type should have the format convention used in natural language processing: NN = Noun, VB= Verb, JJ = Adjective, ...

I am in the process of doing this, the stanford tagger in combination with the nltk module seems to be the most usable one. Having installation troubles at the moment.

German words

For german words it would be really beneficially if they could be written properly -> Nouns are written capitalized.
So not "freund" but "Freund".

This would allow this list to be used for spellchecking.

Russian words in Ukrainian files

Ukrainian files contain Russian-only (i.e. there are no such words in the Ukrainian language) words

The simplest first-order filter is to ignore words with letters ё, ъ, ы, э

Apostrophe was treated as a word boundary in "en"

Lookin at the "en" list you see words like don and 't
The issue presents a bit differently in 2016 and 2018 but it exists in both of them.

CC-by-sa-4.0 appears to be inadequate

All of the terms of the CC licence revolve around "copy and redistribute the material"
There is nothing that grants anyone the ability to simply use the "material" (i.e. the word data).
For example, store a word list in a database in order to make decisions based on the popularity of words entered by a user.
In that case the data is not actually shared. Some aspect of a display might change, but the "material" is not shared.

Missing "'" sign

In Ukrainian (and Russian, Bulgarian) where is plenty of words with "'" sign in it. I believe it is a completely different character then latin "'". It's not like in English where you can drop this "'" and words will still have a sense ("he's" will become "he"). It's more like Ukrainian word "Computer" is "комп'ютер" and "комп" does not mean anything on its own. There are hundreds of words like that.

https://en.wikipedia.org/wiki/Ukrainian_alphabet#Letter_names_and_pronunciation

Can anyone change that and rerun these words calculations for Ukrainian?

Add "Install" and "Run" sections in README.md

Use a Wikipedia-compatible open license for the data

MIT license permits reuse [...] provided that all copies of the licensed software include a copy of the MIT License terms.

MIT license is a "non copyleft permissive license" which is adapted for software but not for data.

Data would be better of under a more convenient license...
Preferably Public domain.
CC-by-sa would do as well : )

Is the list compiled from the parallel subset of the corpus?

Hi @hermitdave ,
first, thanks for putting this list together! I just wanted to ask whether these words were collected from the whole database in OpenSubtitle, or only for the parallel subset. Concretely, I am working with the English and the Hebrew corpora, and it would be useful for me to know whether they were collected from the same movies.
Thanks!,

Raquel

Generate Dataset for OpenSubtitles 2018

Upgrade license to CC-BY-SA 4.0

Hi @hermitdave,

Thanks for putting together these frequency lists and making them available on GitHub! We'd like to use the frequency words lists on your site for our software project. Specifically, we'd like to use them as stop words in our text analysis.

Our software project is an open-source project licensed under GPL. You can find it here: https://github.com/Yoast/javascript.

Unfortunately, the CC-BY-SA 3.0 license used for the data in your project isn't compatible with GPL. Would you perhaps consider upgrading the license of your data to a CC-BY-SA 4.0? That license is compatible with GPL, see https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses/. Alternatively, a GPL or LGPL license would also work for us.

Changing the license to CC-BY-SA 4.0 would open up the data for wider usage in open source software development, so I'm sure many others would benefit from it as well.

Best,
Manuel