GithubHelp home page GithubHelp logo

frequencywords's Introduction

FrequencyWords

Repository for Frequency Word List Generator and processed files

In early days I hosted the generated files on OneDrive with my blog https://invokeit.wordpress.com/frequency-word-lists/ linking to it. Moving forward, the code and the generated outputs are on GitHub.

OpenSubtitle tokenized source

The data used to generate 2016 lists can be found at http://opus.lingfil.uu.se/OpenSubtitles2016.php The data used to generate 2018 lists can be found at http://opus.nlpl.eu/OpenSubtitles2018.php

Format

Frequency lists are on the {word}{space}{numer_of_occurences_in_corpus}. By example, in file en_50k.txt :

you 22484400
i 19975318
the 17594291
to 13200962
...

Usages

These data are reused by various widely used opensource projects, among which Wikipedia, input methods and autocomplete keyoards, etc.

License

MIT License for code.
CC-by-sa-4.0 for content.

frequencywords's People

Contributors

dandv avatar hermitdave avatar hugolpz avatar jd-imi avatar ziaenezhad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

frequencywords's Issues

Typo: Indonesian Top 202 "THE"

I believe "THE" listed on the top 202 should be "TEH", meaning "tea".
https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2018/id/id_full.txt
You'll find "THE" between "boleh" and "masara".

The most prominent Indonesian dictionary is KBBI issued by the government. Even KBBI doesn't register "THE" as an Indonesian word.
https://kbbi.kemdikbud.go.id/entri/the

When I work on Microsoft Excel, for example, "TEH" (tea) is always auto-corrected as "THE".

Wordlists without 50k words

The following language files are named with "_50k" appended to the filename, but do not contain 50k words:

10282 content/2016/af/af_50k.txt
 2350 content/2016/bn/bn_50k.txt
 7131 content/2016/br/br_50k.txt
31631 content/2016/eo/eo_50k.txt
 2968 content/2016/hi/hi_50k.txt
 1972 content/2016/hy/hy_50k.txt
 3402 content/2016/kk/kk_50k.txt
 2604 content/2016/ml/ml_50k.txt
 7504 content/2016/si/si_50k.txt
  927 content/2016/ta/ta_50k.txt
 1402 content/2016/te/te_50k.txt
 6033 content/2016/tl/tl_50k.txt

reference

Hi,
I used your frequency list for a master thesis in psychology.
How can I reference you? For now, I have the following:
"Now, the words of the LibreOffice English dictionary were matched with their respective frequencies extracted from a frequency list developed by Hermit (2016)."

Best regards,
Koen

Punctuations marks are not ignored in Urdu

Disclaimer

⚠️ I don't speak Urdu at all so please don't take directly in account what I say and ask a real Urdu speaker ⚠️

Issue

But there seem to be some characters that should be ignored in Urdu :
https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/ur/ur_full.txt

See lines 3 and 5

What makes me think these are not words but really punctuation as someone who doesn't speak Urdu are the characters' name :
ARABIC COMMA and ARABIC FULL STOP

Potential fix

If you made sure this is not just me that has not enough knowledge of the language but a real issue, my fix would be to add

،

and

۔

in the ignored characters list

Map words to word type

Currently those words are just "WORD OCCURENCECOUNT".

I think it is highly useful for many individuals to have "WORD OCCURENCECOUNT TYPE", whereas TYPE specifies the word type. This word type should have the format convention used in natural language processing: NN = Noun, VB= Verb, JJ = Adjective, ...

I am in the process of doing this, the stanford tagger in combination with the nltk module seems to be the most usable one. Having installation troubles at the moment.

German words

For german words it would be really beneficially if they could be written properly -> Nouns are written capitalized.
So not "freund" but "Freund".

This would allow this list to be used for spellchecking.

Russian words in Ukrainian files

Ukrainian files contain Russian-only (i.e. there are no such words in the Ukrainian language) words

The simplest first-order filter is to ignore words with letters ё, ъ, ы, э

CC-by-sa-4.0 appears to be inadequate

All of the terms of the CC licence revolve around "copy and redistribute the material"
There is nothing that grants anyone the ability to simply use the "material" (i.e. the word data).
For example, store a word list in a database in order to make decisions based on the popularity of words entered by a user.
In that case the data is not actually shared. Some aspect of a display might change, but the "material" is not shared.

Missing "'" sign

In Ukrainian (and Russian, Bulgarian) where is plenty of words with "'" sign in it. I believe it is a completely different character then latin "'". It's not like in English where you can drop this "'" and words will still have a sense ("he's" will become "he"). It's more like Ukrainian word "Computer" is "комп'ютер" and "комп" does not mean anything on its own. There are hundreds of words like that.

https://en.wikipedia.org/wiki/Ukrainian_alphabet#Letter_names_and_pronunciation

Can anyone change that and rerun these words calculations for Ukrainian?

Use a Wikipedia-compatible open license for the data

MIT license permits reuse [...] provided that all copies of the licensed software include a copy of the MIT License terms.

MIT license is a "non copyleft permissive license" which is adapted for software but not for data.

sdfg

Data would be better of under a more convenient license...
Preferably Public domain.
CC-by-sa would do as well : )

Is the list compiled from the parallel subset of the corpus?

Hi @hermitdave ,
first, thanks for putting this list together! I just wanted to ask whether these words were collected from the whole database in OpenSubtitle, or only for the parallel subset. Concretely, I am working with the English and the Hebrew corpora, and it would be useful for me to know whether they were collected from the same movies.
Thanks!,

Raquel

Upgrade license to CC-BY-SA 4.0

Hi @hermitdave,

Thanks for putting together these frequency lists and making them available on GitHub! We'd like to use the frequency words lists on your site for our software project. Specifically, we'd like to use them as stop words in our text analysis.

Our software project is an open-source project licensed under GPL. You can find it here: https://github.com/Yoast/javascript.

Unfortunately, the CC-BY-SA 3.0 license used for the data in your project isn't compatible with GPL. Would you perhaps consider upgrading the license of your data to a CC-BY-SA 4.0? That license is compatible with GPL, see https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses/. Alternatively, a GPL or LGPL license would also work for us.

Changing the license to CC-BY-SA 4.0 would open up the data for wider usage in open source software development, so I'm sure many others would benefit from it as well.

Best,
Manuel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.