Generate optimal phrases for tapping the Twitter Streaming API .
Currently, the language identification relies on a 176-class FastText language model (and in effect the fastText package for python3) which can be found here.
The format that is assumed of your data input and output is JSON lines, which essentially means a json object can be read on each line to do lazy loading at steps where it is required.
A basic example of generating optimal precision phrases for Dutch is included in main.py. The link to the FastText blogpost lists possible iso language codes for generating twitterphrases for other languages.
The optimal key phrase lists for the 50 most common languages on Twitter can also be directly downloaded via the links in the below performance table. The table also lists expected performance when using the list for a particular language. Please cite the following paper:
Language | ISO (link) | Precision | Bound Recall | F-score |
---|---|---|---|---|
English | en | 40.21% | 1.81% | 3.46% |
Japanese | ja | 65.82% | 2.96% | 5.66% |
Spanish | es | 24.40% | 2.18% | 4.01% |
Arabic | ar | 80.03% | 6.07% | 11.28% |
Portuguese | pt | 89.36% | 8.80% | 16.03% |
Korean | ko | 97.73% | 10.95% | 19.70% |
Thai | th | 86.80% | 11.20% | 19.83% |
Turkish | tr | 94.64% | 20.13% | 33.19% |
French | fr | 95.65% | 22.28% | 36.15% |
Chinese | zh | 29.98% | 3.64% | 6.50% |
German | de | 91.44% | 34.05% | 49.62% |
Indonesian | id | 94.51% | 39.04% | 55.25% |
Russian | ru | 99.26% | 56.17% | 71.74% |
Italian | it | 93.75% | 48.48% | 63.91% |
Telugu | tl | 96.84% | 81.02% | 88.23% |
Catalan | ca | 97.74% | 68.35% | 80.44% |
Hindi | hi | 99.63% | 97.86% | 98.74% |
Polish | pl | 98.87% | 59.60% | 74.37% |
Dutch | nl | 98.25% | 66.12% | 79.04% |
Persian | fa | 99.36% | 59.14% | 74.15% |
Malaysian | ms | 93.45% | 58.05% | 71.62% |
Egyptian Ar. | arz | 99.78% | 54.77% | 70.73% |
Urdu | ur | 99.54% | 87.52% | 93.15% |
Greek | el | 99.69% | 82.69% | 90.39% |
Esperanto | eo | 81.03% | 8.47% | 15.33% |
Finnish | fi | 92.08% | 27.70% | 42.59% |
Swedish | sv | 97.42% | 63.76% | 77.07% |
Bulgarian | bg | 94.47% | 72.51% | 82.04% |
Tamil | ta | 99.80% | 79.79% | 88.68% |
Ukranian | uk | 94.62% | 44.33% | 60.38% |
Hungarian | hu | 88.78% | 25.06% | 39.09% |
Serbian | sr | 93.14% | 58.11% | 71.57% |
Galician | gl | 49.28% | 8.67% | 14.75% |
Cebuano | ceb | 89.63% | 57.10% | 69.76% |
Czech | cs | 98.06% | 43.64% | 60.40% |
Vietnamese | vi | 96.06% | 76.45% | 85.14% |
Kurdish | ckb | 99.51% | 36.72% | 53.64% |
Norwegian | no | 96.05% | 51.92% | 67.41% |
Danish | da | 97.14% | 56.03% | 71.07% |
Romanian | ro | 95.59% | 52.53% | 67.80% |
Hebrew | he | 99.95% | 77.91% | 87.56% |
Nepali | ne | 99.32% | 88.09% | 93.37% |
Bengali | bn | 99.94% | 69.82% | 82.21% |
Macedonian | mk | 99.01% | 62.42% | 76.57% |
Mongolian | mn | 99.83% | 81.35% | 89.65% |
Azerbaijani | az | 96.97% | 33.98% | 50.32% |
Marathi | mr | 97.87% | 68.31% | 80.46% |
Gujarati | gu | 99.60% | 80.15% | 88.82% |
Albanian | sq | 98.18% | 64.01% | 77.50% |
Kannada | kn | 98.72% | 60.61% | 75.11% |