GithubHelp home page GithubHelp logo

Comments (4)

shaochengcheng avatar shaochengcheng commented on August 31, 2024

Descriptions

The purpose of tweet parsing is to split a tweet JSON data into several associated data structures, e.g., tweet, url, twitter_user, hashtags and etc. If we would like to save these data structures into a database, the relationships of tables should also be taken considered. For relationships like one-to-one (e.g., table tweet vs. ass_tweet) and one-to-many (e.g., table twitter_user vs. table tweet), we should finish the insertion of tables with no foreign key dependencies (e.g., table twitter_user) because of the unknown of the foreign keys. After the insertion, the foreign keys become known, then we can use them to insert the depending tables (e.g., table tweet). For many-to-many relationship (e.g., table tweet vs. url), things become a little more complicated. An intermediated table is needed (e.g., table ass_tweet_url), which is many-to-one to the other two table that associate with it. To accomplish the insertion of URLs of a tweet, we need to first finish the insertion of the tweet table and the url table, then fetch these inserted primary ids and insert them into ass_tweet_url table correspondingly.

One Per Time Implementation

In twitter streaming, the tweets are received one by one. Thus, the straightforward idea is to parse and save a tweet one by one. In this implementation, the parsing and saving operations are working interactively. For example, when we parse the necessary data of the twitter_user, we insert it and get its primary id in the database. In the following parsing and saving operations, we will use this primary id when necessary.

Bulk Implementation

From above, we can see that saving the parsed objects from a tweet JSON data into the database require many database queries operations. The performance of the one-per-time implementation is very limited, which cannot consume large number tweets in a short time. In the twitter streaming, the consuming of tweets must be fast enough to keep the connection of the stream. To overcome it, in the current implementation, we use a queue to cache the coming tweets and then parse them. Even though, the one-per-time implementation would generate so many queries that may overload the shared database server. Moreover, sometimes we want to reparse the tweets (e.g., when we have requirements of new tables or bug fix), the performance of the parser would be the bottleneck.

Therefore, we propose this bulk implementation. In this implementation, the parsing and saving operations are separated. The parsing operation will split a tweet into different objects and have no interaction with the database. In this way, we can parse a large block of tweets and merge the parsed objects of the same kind together. For each kind of parsed objects, the saving operations would take the whole block and use one query to save them into the database. Please note that the saving operation should also take care of tables with foreign keys.

Working progress

  • The parsing operation
  • The saving operation
  • Tests

from hoaxy-backend.

filmenczer avatar filmenczer commented on August 31, 2024

@shaochengcheng will change requirements; pandas, networkx and newspaper3k will stay '==' and others will change to '>='.

from hoaxy-backend.

filmenczer avatar filmenczer commented on August 31, 2024

@chathuriw will update server after this update is pushed to master.

from hoaxy-backend.

filmenczer avatar filmenczer commented on August 31, 2024

It seems everything is working fine; no errors. Closing. @shaochengcheng let us know if there are any remaining tasks related to this issue.

from hoaxy-backend.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.