GithubHelp home page GithubHelp logo

Comments (9)

shaochengcheng avatar shaochengcheng commented on August 31, 2024

Hi Fil,

I tested the API flow with random words selected from the title of articles. Here are the time escaped in the API flow (in seconds, 100 rounds):

t0_lucene_query query from article 0.554636
t1_article_filtering filtering disabled site 0.004863
t2_article_sharing query twitter sharing of the article. 11.400047
t3_network_building_old build network by old api 21.456027
t4_network_buiding_new build network by new api . 15.631910

Please note that this testing is running on the server directly without the mashape middleware and only background data flow without front-end part.

The sum of first three items is about our first step in the front-end. And third and four item is about the second step in the front-end. As you can see, Lucene itself is really fast. The problem is that the query of the database does take tens seconds. As you can see that the new network API did have better performance.

The possible solution could be indexing and partition the database. However, I am not a database expert. Unfortunately, I cannot make much progress on the performance.

Thanks
Chengcheng

from hoaxy-backend.

filmenczer avatar filmenczer commented on August 31, 2024

Thank you @shaochengcheng that explains clearly --- I attributed the delay of the first phase to Lucene when in fact it is the retrieval of the tweets.

I wonder if we could speed up tweet retrieval by better indexing. @glciampaglia can we discuss this?

Also I understand that the network API is faster now. Thank you for that too! I expected a larger speedup because I thought that the network API now uses the edge table (per issue #4)? Is that a separate issue still being worked on?

from hoaxy-backend.

glciampaglia avatar glciampaglia commented on August 31, 2024

Thank you @shaochengcheng for running this analysis. This explains the bottleneck perfectly. I think that adding indexing to the article_sharing query could speed up things significantly, like what happened with the Botometer database. I can work on it. Could you please point me to the source code of the article_sharing query? What about the new API? Is it also an SQL query, or are you still parsing things in Python? Perhaps we could add indexes there too.

@filmenczer let's talk about this on Monday, if you are around.

from hoaxy-backend.

filmenczer avatar filmenczer commented on August 31, 2024

@shaochengcheng -- the table ass_tweet_url had an index (tweet_id, url_id). Therefore when querying by url, it was not using the index, therefore it was slow. Giovanni and I created a new index (url_id, tweet_id). In this way, when querying by url, this is executed as an index scan and is MUCH faster!!!

Please update the code that creates the table to add this new index, than you can close this issue. Thanks!

from hoaxy-backend.

shaochengcheng avatar shaochengcheng commented on August 31, 2024

Hi @filmenczer and @glciampaglia

I am not sure whether we need an extra index on table ass_tweet_url, because there is a unique constraint on it when creating. Let us look at the table info:

hoaxy=> \d+ ass_tweet_url
                                      Table "public.ass_tweet_url"
  Column  |  Type   |                         Modifiers                          | Storage | Description
----------+---------+------------------------------------------------------------+---------+-------------
 id       | integer | not null default nextval('ass_tweet_url_id_seq'::regclass) | plain   |
 tweet_id | integer |                                                            | plain   |
 url_id   | integer |                                                            | plain   |
Indexes:
    "ass_tweet_url_pkey" PRIMARY KEY, btree (id)
    "tweet_url_uq" UNIQUE, btree (tweet_id, url_id)
    "url_tweet" btree (url_id, tweet_id)
Foreign-key constraints:
    "ass_tweet_url_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES tweet(id) ON UPDATE CASCADE ON DELETE CASCADE
    "ass_tweet_url_url_id_fkey" FOREIGN KEY (url_id) REFERENCES url(id) ON UPDATE CASCADE ON DELETE CASCADE
Has OIDs: no

As you can see, the index is already there, "tweet_url_uq" UNIQUE, btree (tweet_id, url_id). And according to PostgreSQL docs

One should, however, be aware that there's no need to manually create indexes on unique columns; doing so would just duplicate the automatically-created index.

Thus I think, table ass_tweet_url does not need a manual index.

Am I right?

Thanks
Chengcheng

from hoaxy-backend.

filmenczer avatar filmenczer commented on August 31, 2024

Giovanni will answer more definitely, but as I recall:

  • Before we added the index, the query was not running on the index, it was scanning the database. And it took several seconds (consistent with your measurements).

  • After we added the index, the query ran scanning only the index, and it was super fast (less than a second). The difference is very noticeable in the live demo.

So I think that the index was needed.

from hoaxy-backend.

glciampaglia avatar glciampaglia commented on August 31, 2024

The index is composite so when you look up a row by URL ID you are doing a partial lookup. However, with b-tree indexes (like the one in that table) this only works if you are using the leftmost part of the index. In other words, the index was being used when the reference was the tweet-ID, but not the other way round. Adding another index (URL ID, Tweet ID), does the trick.

from hoaxy-backend.

glciampaglia avatar glciampaglia commented on August 31, 2024

Btw Clayton pointed out that hash indexes would be even faster than b-tree indexes. I am not sure we need the extra speed at the moment though.

from hoaxy-backend.

glciampaglia avatar glciampaglia commented on August 31, 2024

Closed via 3dea321

from hoaxy-backend.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.