One of the recent direction runs took 90 minutes to complete, which seems excessive.</

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

First pass of implementation is in <a class="issue-link js-issue-link" data-error-text

Increase the speed optimizations for computing direction on covidcast signals about delphi-epidata HOT 3 CLOSED

cmu-delphi commented on August 15, 2024

Increase the speed optimizations for computing direction on covidcast signals

from delphi-epidata.

Comments (3)

krivard commented on August 15, 2024

@AlSaeed proposed the following:

The primary id of each row should be also retrieved when retrieving the time series.
4 lists of row ids should be computed (across timeseries). These are the rows that will change their direction value to [-1,0,1,NULL].
4 batched queries to update the direction field only, based on the ids of the lists above.
When retrieving the stale timeseries we should store them in temporary table.
Updating timestamp2 can be done in 1 query, using the temporary table from [4.].
For the retrieval part, if we expect the data of all timeseries to fit in memory simultaneously, we can use the temporary table from [4.] to retrieve the entire data and use pandas to separate them into independent series.

This will work, with some modifications to support the upcoming shift to include issue dates. I've consulted with @jacobbien, and he suggests that since direction is relatively complex to compute (it's the slope of a line fit to all values of the previous 7 days, thresholded based on the variance of historical data), that it's really a computed product, not raw data. Therefore, updating the direction column should be done in-place without creating a new issue for the affected timepoint.

As an example, consider the following input to the direction updater:

geo value	value	time value	issue	direction	direction timestamp
ca	4.1	20200601	20200601	0	stale
ca	4.0	20200602	20200602	0	stale
ca	4.1	20200603	20200603	0	stale
ca	4.2	20200604	20200604	0	stale
ca	5.0	20200605	20200605	null	stale

The proposed solution is the following:

geo value	value	time value	issue	direction	direction timestamp
ca	4.1	20200601	20200601	1	fresh
ca	4.0	20200602	20200602	1	fresh
ca	4.1	20200603	20200603	1	fresh
ca	4.2	20200604	20200604	1	fresh
ca	5.0	20200605	20200605	1	fresh

The alternative is the following:

geo value	value	time value	issue	direction	direction timestamp
ca	4.1	20200601	20200601	0	stale
ca	4.1	20200601	20200605	1	fresh
ca	4.0	20200602	20200602	0	stale
ca	4.0	20200602	20200605	1	fresh
ca	4.1	20200603	20200603	0	stale
ca	4.1	20200603	20200605	1	fresh
ca	4.2	20200604	20200604	0	stale
ca	4.2	20200604	20200605	1	fresh
ca	5.0	20200605	20200605	1	fresh

from delphi-epidata.

krivard commented on August 15, 2024

@melange396, barring unforeseen complications we're expecting a PR for query optimizations on this tomorrow -- if database calls turn out to be the top contributor in your profiling efforts, ignore them for now.

from delphi-epidata.

krivard commented on August 15, 2024

First pass of implementation is in #133

from delphi-epidata.

Recommend Projects

Increase the speed optimizations for computing direction on covidcast signals about delphi-epidata HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs