Comments (22)
devils advocate question: I assume the streaming may take a while and will consume a database connection. How do we prevent denial of service attacks?
from delphi-epidata.
Sure, sounds reasonable to me. Plan:
- run a local containerized instance of the epidata api
- add 25M rows of data for a hypothetical JHU-like source
- remove the API's row limit
- client requests all the data
- server sends all the data
- make sure client doesn't crash due to OOM, and is able to iterate all 25M rows
I'm very confident that this won't fail, but attempting now to be extra safe.
Assuming this works, next step would be to implement streaming response in the server and then repeat the same test. Assuming that goes well, should be a short path to a PR.
from delphi-epidata.
How do we prevent denial of service attacks?
Fair question. In a streaming approach, the database connection is kept alive until all the data has been sent. By contrast, in other approaches, like row limit or pagination, we quickly consume the data into memory, close the database connection, and finally write serialized JSON to a socket. So it seems like steaming introduces a new DoS vulnerability.
However, I don't think that the other approaches are immune either. The current API with its tiny row limit is vulnerable to slowloris and even basic syn flooding. Even without API streaming, a flood of requests could quickly saturate all database connections.
I'm not sure the best defense, but it's probably a combination of things — no silver bullet.
- apache has a number of mods which are intended to harden against these kinds of denial of service attacks. For example,
mod_qos
lets you set a minimum bandwidth for upload/download, and closes connections that don't meet that minimum. So, for example, if you request a huge dataset, and then don't read the response, then the server would close the connection rather than keeping it alive indefinitely, which would prevent someone from indefinitely holding open a database connection. - We can (and do) replicate the database and webserver. Of course it doesn't solve the problem, but it makes it harder to exhaust all the resources.
- We could dump the database response into a temp file (maybe in memory), close the database connection, and then stream that temp file to the client. That would ensure that the database connection is never kept alive longer than needed, but it would consume either memory or disk, and also a file handle.
IMO the best approach is to investigate and deploy something generic like mod_qos
, and if/when we get hit with a real attack, decide from there the most effective defense.
from delphi-epidata.
Some preliminary results from the streaming test:
- inserted 25,000,000 fake rows into
covidcast
table - to prevent the server from eating all the RAM before the client gets the response, I went ahead and implemented streaming
- use unbuffered version of
mysqli_query
inexecute_query
(prior to this, bumpedmemory_limit
to 8G, but not needed anymore) - write out json encoded version of each row individually
- bump the global
$MAX_RESULTS
to 999,999,999 - add request param
format=stream
to trigger streaming mode - bump set execution time limit from 30s to 5m, otherwise gets killed
- use unbuffered version of
- request all data with
curl
, pipe to file- 03:52 download time, 3.4 GiB file size
- load response in python client
- requires ~4 GiB RAM per 5M rows, so 20 GiB RAM for the whole response
So it worked, but with more workarounds than I'd like. The client can handle the response technically, but oof that's a lot of RAM. The client could read the data incrementally from the stream, but does that help things? Presumably the caller would still need to load and work with the data anyway.
from delphi-epidata.
(noting for posterity that the format
param is already in use with possible values csv
json
and empty/other)
Is pandas doing something inefficient, or is that about as much memory as we would expect?
The two use cases I'm aware of for getting rid of truncation are:
- Doing analysis on county time series data (which would require that much RAM for the analysis anyhow)
- Batch downloading of the dataset (which is just going to get serialized to disk, so folks should probably use csv instead of trying to load a proper data frame)
from delphi-epidata.
Is pandas doing something inefficient, or is that about as much memory as we would expect?
That was just the memory used by the object returned by json.decode
— no pandas were harmed involved in this experiment.
from delphi-epidata.
there are different options how to implement that:
- pages: e.g.,
?page=2&page_size=100
-> will return rows 101..200 - offset: e..g,
?offset=100&limit=100
-> will also return rows 101...200
question is whether we should also return the total number of rows / pages or at least as has_more flag
from delphi-epidata.
Go with offset
and has_more
.
from delphi-epidata.
To summarize the problem: we don't want technical hurdles to prevent users from accessing large datasets.
Current handling of large responses is problematic from a usability standpoint. After 3650* rows, the server calls it a day and stops sending data. (On the bright side, the server at least tells when this happens though response code and message.) In order to get the full dataset, a caller would have to then issue a number of smaller queries, e.g. by single date or location. This solution probably isn't obvious, and it can become extremely inefficient as the number of requests grows and the size of responses shrinks (worst case: requesting each row individually).
[* Why 3650? That allows for a ~10 year time-series at daily resolution, which, sortof like 640K memory, "Ought to be Enough for Anyone", and if you requested more rows than that, you're doing something wrong. Well, as it turns out for both, this is not quite true.]
One short-term bandaid we can apply is to increase the row limit by a factor of, say, 10 or 100. But this doesn't fix the problem. So what does fix the problem? There are a couple of options (see also ideas from Alex):
- sharding: have clients transparently shard requests (e.g. by location), and reassemble responses in to a single payload. con: this requires clients to be "smart" about making requests, requiring knowledge of geography and expected response sizes. and this logic has to live across all of our clients, in different languages.
- pagination: implement pagination (i.e. the feature request in this issue) on the server so that clients can effectively shard requests without specific knowledge of e.g. geography. con: it still requires that clients be able to reassemble multiple responses into a coherent dataset. it also introduces complexity on the server, as we'll need either a page counter or a row identifier, both of which require a total ordering to be defined in keyspace, which is problematic in multiple dimensions for dynamic datasets.
- streaming: implement streaming such that rows are delivered from the server to the client on the fly rather than buffering up a giant response object. con: json decode isn't compatible with streaming, so clients would still need to buffer a large response. if that's a problem (i don't think it will be), then rows would need to be sent in a different format than a single large json object.
I strongly recommend streaming over the alternatives, as I think that streaming is the best approach from a design perspective. I'm also not opposed to increasing the row limit as a stop-gap.
from delphi-epidata.
Agree; streaming is optimal. Would it be reasonable to use a prototype to determine whether buffering a large response will be a problem without having to do the full workup in the epidata codebase?
from delphi-epidata.
That is indeed important to consider, especially given some of the politically-charged survey signals waiting in the wings (cc @capnrefsmmat). I think for now I'm happy with "if we get a denial of service attack, the site goes down; we turn the row limit back on before bringing everything back up again" and revisit the question if/when it happens.
Leaving the option open to reactivate the row limit will require some cleverer handling in the covidcast clients, but I think the additional complexity is worth it.
from delphi-epidata.
@korlaxxalrok to get your perspective on hardening
from delphi-epidata.
@dfarrow0 Your Apache suggestions sound good, as do the others for implementing streaming. The test results are interesting. Lots-o-RAM, yes :)
We are in AWS, at least for now, so for additional hardening we could look at WAF. Would need to look into it a bit more, but could be helpful to head off certain attacks earlier in the path.
from delphi-epidata.
Cool. @dfarrow0 since you're needed on Hotspots for November, we should either backburner productionizing the prototype or have you deliver it to someone else for that work. Preference?
from delphi-epidata.
No preference here, just give me a heads-up if someone else takes this because otherwise I might poke at it a bit more in my spare(?) time.
from delphi-epidata.
@dfarrow0 atm I'm assigned to this issue but as far as I understand your streaming api approach is the current one to use. So, am I still assigned and if yes what is missing to complete this issue?
from delphi-epidata.
Streaming appears to be viable from my testing, but I didn't get far enough along to make a proper PR. We still need to implement streaming in the server and then update the clients to request/handle streaming responses. I don't have the time to work on it this month, so if you're on board with taking the task then I'm happy to help review PRs etc.
from delphi-epidata.
(and if you'd rather not pick it up lmk and I can drop you as assignee)
from delphi-epidata.
@dfarrow0 what is the name of your branch that you created your prototype, such that I continue from that?
from delphi-epidata.
@sgratzl it's the streaming
branch on my fork here: https://github.com/dfarrow0/delphi-epidata/tree/dfarrow/streaming
the fork has a single commit, so a more direct link is this: dfarrow0@e010b84
pls ignore the python file there (test_oom.py
) as that was just a test and shouldn't be committed. also, the line in api.php
like ini_set('memory_limit', '8G');
can be omitted as that was an intermediate test and isn't needed for streaming.
finally, as you duly noted, streaming consumes a connection for a as long as the client is accepting data. 30 seconds (which i think is the default php script timeout on the server?) was too short, so i added set_time_limit(300);
as a quick solution. feel free to adjust as you see fit :)
from delphi-epidata.
Merged prematurely. Once the server is running PHP7, revert #376 to close.
from delphi-epidata.
closing with new flask server
from delphi-epidata.
Related Issues (20)
- Schedule regular production job restarts
- insert_or_update can have problems with multiple "issue"s
- Fix python client's user-agent version HOT 1
- update python client's CHANGELOG
- Add special case to get_real_ip_addr() for additional proxy
- Caching in python client
- Tons of new dependencies introduced in requirements for delphi-epidata (python client) HOT 9
- Consider standardizing Python client packaging
- Find all signals for a location
- Universal Revision analysis
- Update release CI in other repos
- `test_csv_uploading` hangs when running with debugger turned on
- API keys / user maintenance fixes and additions
- Remove usage of "covidcast" python client HOT 1
- Fix pip version hack in gdocs sync GH action HOT 1
- Add exception-on-error mode to python client
- Add descriptive links to other endpoint pages HOT 2
- Consider enabling flexible page width for doc site HOT 3
- Change format of other endpoints' doc pages to be like the covidcast endpoints
- List and describe field names in other endpoint return values
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from delphi-epidata.