Switch to batch insertion

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It seems everything is working fine; no errors. Closing. <a class="user-mention notran

Optimize tweet insertion about hoaxy-backend HOT 4 CLOSED

iunetsci commented on August 31, 2024

Optimize tweet insertion

from hoaxy-backend.

Comments (4)

shaochengcheng commented on August 31, 2024

Descriptions

The purpose of tweet parsing is to split a tweet JSON data into several associated data structures, e.g., tweet, url, twitter_user, hashtags and etc. If we would like to save these data structures into a database, the relationships of tables should also be taken considered. For relationships like one-to-one (e.g., table tweet vs. ass_tweet) and one-to-many (e.g., table twitter_user vs. table tweet), we should finish the insertion of tables with no foreign key dependencies (e.g., table twitter_user) because of the unknown of the foreign keys. After the insertion, the foreign keys become known, then we can use them to insert the depending tables (e.g., table tweet). For many-to-many relationship (e.g., table tweet vs. url), things become a little more complicated. An intermediated table is needed (e.g., table ass_tweet_url), which is many-to-one to the other two table that associate with it. To accomplish the insertion of URLs of a tweet, we need to first finish the insertion of the tweet table and the url table, then fetch these inserted primary ids and insert them into ass_tweet_url table correspondingly.

One Per Time Implementation

In twitter streaming, the tweets are received one by one. Thus, the straightforward idea is to parse and save a tweet one by one. In this implementation, the parsing and saving operations are working interactively. For example, when we parse the necessary data of the twitter_user, we insert it and get its primary id in the database. In the following parsing and saving operations, we will use this primary id when necessary.

Bulk Implementation

From above, we can see that saving the parsed objects from a tweet JSON data into the database require many database queries operations. The performance of the one-per-time implementation is very limited, which cannot consume large number tweets in a short time. In the twitter streaming, the consuming of tweets must be fast enough to keep the connection of the stream. To overcome it, in the current implementation, we use a queue to cache the coming tweets and then parse them. Even though, the one-per-time implementation would generate so many queries that may overload the shared database server. Moreover, sometimes we want to reparse the tweets (e.g., when we have requirements of new tables or bug fix), the performance of the parser would be the bottleneck.

Therefore, we propose this bulk implementation. In this implementation, the parsing and saving operations are separated. The parsing operation will split a tweet into different objects and have no interaction with the database. In this way, we can parse a large block of tweets and merge the parsed objects of the same kind together. For each kind of parsed objects, the saving operations would take the whole block and use one query to save them into the database. Please note that the saving operation should also take care of tables with foreign keys.

Working progress

The parsing operation
The saving operation
Tests

from hoaxy-backend.

filmenczer commented on August 31, 2024

@shaochengcheng will change requirements; pandas, networkx and newspaper3k will stay '==' and others will change to '>='.

from hoaxy-backend.

filmenczer commented on August 31, 2024

@chathuriw will update server after this update is pushed to master.

from hoaxy-backend.

filmenczer commented on August 31, 2024

It seems everything is working fine; no errors. Closing. @shaochengcheng let us know if there are any remaining tasks related to this issue.

from hoaxy-backend.

Optimize tweet insertion about hoaxy-backend HOT 4 CLOSED

Comments (4)

Descriptions

One Per Time Implementation

Bulk Implementation

Working progress

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs