GithubHelp home page GithubHelp logo

hiddenlights / emoji-counter Goto Github PK

View Code? Open in Web Editor NEW

This project forked from minikanyas/emoji-counter

0.0 0.0 0.0 35 KB

How to handle emoji in Python + a quick Python script to count emoji in Tweets as an example. (python 2.7)

Python 71.65% Jupyter Notebook 28.35%

emoji-counter's Introduction

This is a quick python script to count emoji in Tweets. The iPython notebook is a long explanation of how character encoding works.

Credit to http://apps.timwhitlock.info/emoji/tables/unicode and, of course, unicode.org.

There is also a pretty helpful (for understanding character encodings) Stack Overflow post here: http://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences

And webpage: http://csharpindepth.com/Articles/General/Unicode.aspx

I have taken the emoji data from http://www.unicode.org/Public/emoji/2.0/emoji-data.txt (emoji_data.txt), parsed the modifiers to add skin tone modifiers, and added regional country letter indicators (to detect flags). If this page is updated you should be able to copy it and update the dict.

Run parse_unicode_tables.py to create the emoji dictionary that we use to count emoji. Edit the encoding (top of the file) to create a UTF-8 vs UTF-32 encoded dictionary (right now it creates a UTF-8 dictionary). The parse_unicode_tables.py assigns a unique ID to each emoji in the dict, so that you can chose which characters to count (uncomment print statements at the end to see a table of all of the emoji being recorded and thier modifiers). The dictionaries are saved as pickled files emoji_dict_utf-{8,32}.pkl. Running for UTF-8 will also save unicode_markers_utf-8.pkl (a dictionary of marker bytes in utf-8).

Then run: cat tweet_fie.json | python parse_utf8.py (if you created a UTF-8 dict)

Or: cat tweet_fie.json | python parse_utf32.py (if you created the UTF-32 dict)

They both do exactly the same thing, just use different inputs and parse the strings differently. No huge speed difference. I've added them both as two examples of solving the problem.

Old version (don't use, this is a very silly way to solve the problem): find_emoji.py, emoji_dict.py

All scripts expect a json payload, one record per line, and count emoji in the "body" field.

example_tweet_ids.txt is a file of ids for Tweet containing emoji. Most of the emoji in emoji_dict are covered here. Try using twurl to get these Tweets from the public Twitter API.

Feel free to use/modify. No guarantees of anything.

emoji-counter's People

Contributors

fionapigott avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.