We noticed that the search from the article search engine (Lucene) is often very slow.

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Closed via <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="htt

Speed of query retrieval about hoaxy-backend HOT 9 CLOSED

iunetsci commented on August 31, 2024

Speed of query retrieval

from hoaxy-backend.

Comments (9)

shaochengcheng commented on August 31, 2024

Hi Fil,

I tested the API flow with random words selected from the title of articles. Here are the time escaped in the API flow (in seconds, 100 rounds):

t0_lucene_query query from article 0.554636
t1_article_filtering filtering disabled site 0.004863
t2_article_sharing query twitter sharing of the article. 11.400047
t3_network_building_old build network by old api 21.456027
t4_network_buiding_new build network by new api . 15.631910

Please note that this testing is running on the server directly without the mashape middleware and only background data flow without front-end part.

The sum of first three items is about our first step in the front-end. And third and four item is about the second step in the front-end. As you can see, Lucene itself is really fast. The problem is that the query of the database does take tens seconds. As you can see that the new network API did have better performance.

The possible solution could be indexing and partition the database. However, I am not a database expert. Unfortunately, I cannot make much progress on the performance.

Thanks
Chengcheng

from hoaxy-backend.

filmenczer commented on August 31, 2024

Thank you @shaochengcheng that explains clearly --- I attributed the delay of the first phase to Lucene when in fact it is the retrieval of the tweets.

I wonder if we could speed up tweet retrieval by better indexing. @glciampaglia can we discuss this?

Also I understand that the network API is faster now. Thank you for that too! I expected a larger speedup because I thought that the network API now uses the edge table (per issue #4)? Is that a separate issue still being worked on?

from hoaxy-backend.

glciampaglia commented on August 31, 2024

Thank you @shaochengcheng for running this analysis. This explains the bottleneck perfectly. I think that adding indexing to the article_sharing query could speed up things significantly, like what happened with the Botometer database. I can work on it. Could you please point me to the source code of the article_sharing query? What about the new API? Is it also an SQL query, or are you still parsing things in Python? Perhaps we could add indexes there too.

@filmenczer let's talk about this on Monday, if you are around.

from hoaxy-backend.

filmenczer commented on August 31, 2024

@shaochengcheng -- the table ass_tweet_url had an index (tweet_id, url_id). Therefore when querying by url, it was not using the index, therefore it was slow. Giovanni and I created a new index (url_id, tweet_id). In this way, when querying by url, this is executed as an index scan and is MUCH faster!!!

Please update the code that creates the table to add this new index, than you can close this issue. Thanks!

from hoaxy-backend.

shaochengcheng commented on August 31, 2024

Hi @filmenczer and @glciampaglia

I am not sure whether we need an extra index on table ass_tweet_url, because there is a unique constraint on it when creating. Let us look at the table info:

hoaxy=> \d+ ass_tweet_url
                                      Table "public.ass_tweet_url"
  Column  |  Type   |                         Modifiers                          | Storage | Description
----------+---------+------------------------------------------------------------+---------+-------------
 id       | integer | not null default nextval('ass_tweet_url_id_seq'::regclass) | plain   |
 tweet_id | integer |                                                            | plain   |
 url_id   | integer |                                                            | plain   |
Indexes:
    "ass_tweet_url_pkey" PRIMARY KEY, btree (id)
    "tweet_url_uq" UNIQUE, btree (tweet_id, url_id)
    "url_tweet" btree (url_id, tweet_id)
Foreign-key constraints:
    "ass_tweet_url_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES tweet(id) ON UPDATE CASCADE ON DELETE CASCADE
    "ass_tweet_url_url_id_fkey" FOREIGN KEY (url_id) REFERENCES url(id) ON UPDATE CASCADE ON DELETE CASCADE
Has OIDs: no

As you can see, the index is already there, "tweet_url_uq" UNIQUE, btree (tweet_id, url_id). And according to PostgreSQL docs

One should, however, be aware that there's no need to manually create indexes on unique columns; doing so would just duplicate the automatically-created index.

Thus I think, table ass_tweet_url does not need a manual index.

Am I right?

Thanks
Chengcheng

from hoaxy-backend.

filmenczer commented on August 31, 2024

Giovanni will answer more definitely, but as I recall:

Before we added the index, the query was not running on the index, it was scanning the database. And it took several seconds (consistent with your measurements).
After we added the index, the query ran scanning only the index, and it was super fast (less than a second). The difference is very noticeable in the live demo.

So I think that the index was needed.

from hoaxy-backend.

glciampaglia commented on August 31, 2024

The index is composite so when you look up a row by URL ID you are doing a partial lookup. However, with b-tree indexes (like the one in that table) this only works if you are using the leftmost part of the index. In other words, the index was being used when the reference was the tweet-ID, but not the other way round. Adding another index (URL ID, Tweet ID), does the trick.

from hoaxy-backend.

glciampaglia commented on August 31, 2024

Btw Clayton pointed out that hash indexes would be even faster than b-tree indexes. I am not sure we need the extra speed at the moment though.

from hoaxy-backend.

glciampaglia commented on August 31, 2024

Closed via 3dea321

from hoaxy-backend.

Speed of query retrieval about hoaxy-backend HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs