GithubHelp home page GithubHelp logo

twitterphrases's Introduction

Twitterphrases

Generate optimal phrases for tapping the Twitter Streaming API .

Requirements

Currently, the language identification relies on a 176-class FastText language model (and in effect the fastText package for python3) which can be found here.

The format that is assumed of your data input and output is JSON lines, which essentially means a json object can be read on each line to do lazy loading at steps where it is required.

Example

A basic example of generating optimal precision phrases for Dutch is included in main.py. The link to the FastText blogpost lists possible iso language codes for generating twitterphrases for other languages.

Cite

The optimal key phrase lists for the 50 most common languages on Twitter can also be directly downloaded via the links in the below performance table. The table also lists expected performance when using the list for a particular language. Please cite the following paper:

Kreutz, T. and Daelemans, W. (2019). How to Optimize your Twitter Collection. Computational Linguistics in the Netherlands Journal, 9:55โ€“66.

Results

Language ISO (link) Precision Bound Recall F-score
English en 40.21% 1.81% 3.46%
Japanese ja 65.82% 2.96% 5.66%
Spanish es 24.40% 2.18% 4.01%
Arabic ar 80.03% 6.07% 11.28%
Portuguese pt 89.36% 8.80% 16.03%
Korean ko 97.73% 10.95% 19.70%
Thai th 86.80% 11.20% 19.83%
Turkish tr 94.64% 20.13% 33.19%
French fr 95.65% 22.28% 36.15%
Chinese zh 29.98% 3.64% 6.50%
German de 91.44% 34.05% 49.62%
Indonesian id 94.51% 39.04% 55.25%
Russian ru 99.26% 56.17% 71.74%
Italian it 93.75% 48.48% 63.91%
Telugu tl 96.84% 81.02% 88.23%
Catalan ca 97.74% 68.35% 80.44%
Hindi hi 99.63% 97.86% 98.74%
Polish pl 98.87% 59.60% 74.37%
Dutch nl 98.25% 66.12% 79.04%
Persian fa 99.36% 59.14% 74.15%
Malaysian ms 93.45% 58.05% 71.62%
Egyptian Ar. arz 99.78% 54.77% 70.73%
Urdu ur 99.54% 87.52% 93.15%
Greek el 99.69% 82.69% 90.39%
Esperanto eo 81.03% 8.47% 15.33%
Finnish fi 92.08% 27.70% 42.59%
Swedish sv 97.42% 63.76% 77.07%
Bulgarian bg 94.47% 72.51% 82.04%
Tamil ta 99.80% 79.79% 88.68%
Ukranian uk 94.62% 44.33% 60.38%
Hungarian hu 88.78% 25.06% 39.09%
Serbian sr 93.14% 58.11% 71.57%
Galician gl 49.28% 8.67% 14.75%
Cebuano ceb 89.63% 57.10% 69.76%
Czech cs 98.06% 43.64% 60.40%
Vietnamese vi 96.06% 76.45% 85.14%
Kurdish ckb 99.51% 36.72% 53.64%
Norwegian no 96.05% 51.92% 67.41%
Danish da 97.14% 56.03% 71.07%
Romanian ro 95.59% 52.53% 67.80%
Hebrew he 99.95% 77.91% 87.56%
Nepali ne 99.32% 88.09% 93.37%
Bengali bn 99.94% 69.82% 82.21%
Macedonian mk 99.01% 62.42% 76.57%
Mongolian mn 99.83% 81.35% 89.65%
Azerbaijani az 96.97% 33.98% 50.32%
Marathi mr 97.87% 68.31% 80.46%
Gujarati gu 99.60% 80.15% 88.82%
Albanian sq 98.18% 64.01% 77.50%
Kannada kn 98.72% 60.61% 75.11%

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.