GithubHelp home page GithubHelp logo

Comments (22)

sgratzl avatar sgratzl commented on August 15, 2024 2

devils advocate question: I assume the streaming may take a while and will consume a database connection. How do we prevent denial of service attacks?

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024 1

Sure, sounds reasonable to me. Plan:

  • run a local containerized instance of the epidata api
  • add 25M rows of data for a hypothetical JHU-like source
  • remove the API's row limit
  • client requests all the data
  • server sends all the data
  • make sure client doesn't crash due to OOM, and is able to iterate all 25M rows

I'm very confident that this won't fail, but attempting now to be extra safe.

Assuming this works, next step would be to implement streaming response in the server and then repeat the same test. Assuming that goes well, should be a short path to a PR.

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024 1

How do we prevent denial of service attacks?

Fair question. In a streaming approach, the database connection is kept alive until all the data has been sent. By contrast, in other approaches, like row limit or pagination, we quickly consume the data into memory, close the database connection, and finally write serialized JSON to a socket. So it seems like steaming introduces a new DoS vulnerability.

However, I don't think that the other approaches are immune either. The current API with its tiny row limit is vulnerable to slowloris and even basic syn flooding. Even without API streaming, a flood of requests could quickly saturate all database connections.

I'm not sure the best defense, but it's probably a combination of things — no silver bullet.

  1. apache has a number of mods which are intended to harden against these kinds of denial of service attacks. For example, mod_qos lets you set a minimum bandwidth for upload/download, and closes connections that don't meet that minimum. So, for example, if you request a huge dataset, and then don't read the response, then the server would close the connection rather than keeping it alive indefinitely, which would prevent someone from indefinitely holding open a database connection.
  2. We can (and do) replicate the database and webserver. Of course it doesn't solve the problem, but it makes it harder to exhaust all the resources.
  3. We could dump the database response into a temp file (maybe in memory), close the database connection, and then stream that temp file to the client. That would ensure that the database connection is never kept alive longer than needed, but it would consume either memory or disk, and also a file handle.

IMO the best approach is to investigate and deploy something generic like mod_qos, and if/when we get hit with a real attack, decide from there the most effective defense.

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024 1

Some preliminary results from the streaming test:

  • inserted 25,000,000 fake rows into covidcast table
  • to prevent the server from eating all the RAM before the client gets the response, I went ahead and implemented streaming
    • use unbuffered version of mysqli_query in execute_query (prior to this, bumped memory_limit to 8G, but not needed anymore)
    • write out json encoded version of each row individually
    • bump the global $MAX_RESULTS to 999,999,999
    • add request param format=stream to trigger streaming mode
    • bump set execution time limit from 30s to 5m, otherwise gets killed
  • request all data with curl, pipe to file
    • 03:52 download time, 3.4 GiB file size
  • load response in python client
    • requires ~4 GiB RAM per 5M rows, so 20 GiB RAM for the whole response

So it worked, but with more workarounds than I'd like. The client can handle the response technically, but oof that's a lot of RAM. The client could read the data incrementally from the stream, but does that help things? Presumably the caller would still need to load and work with the data anyway.

from delphi-epidata.

krivard avatar krivard commented on August 15, 2024 1

(noting for posterity that the format param is already in use with possible values csv json and empty/other)

Is pandas doing something inefficient, or is that about as much memory as we would expect?

The two use cases I'm aware of for getting rid of truncation are:

  • Doing analysis on county time series data (which would require that much RAM for the analysis anyhow)
  • Batch downloading of the dataset (which is just going to get serialized to disk, so folks should probably use csv instead of trying to load a proper data frame)

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024 1

Is pandas doing something inefficient, or is that about as much memory as we would expect?

That was just the memory used by the object returned by json.decode — no pandas were harmed involved in this experiment.

from delphi-epidata.

sgratzl avatar sgratzl commented on August 15, 2024

there are different options how to implement that:

  • pages: e.g., ?page=2&page_size=100 -> will return rows 101..200
  • offset: e..g, ?offset=100&limit=100 -> will also return rows 101...200

question is whether we should also return the total number of rows / pages or at least as has_more flag

from delphi-epidata.

krivard avatar krivard commented on August 15, 2024

Go with offset and has_more.

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024

To summarize the problem: we don't want technical hurdles to prevent users from accessing large datasets.

Current handling of large responses is problematic from a usability standpoint. After 3650* rows, the server calls it a day and stops sending data. (On the bright side, the server at least tells when this happens though response code and message.) In order to get the full dataset, a caller would have to then issue a number of smaller queries, e.g. by single date or location. This solution probably isn't obvious, and it can become extremely inefficient as the number of requests grows and the size of responses shrinks (worst case: requesting each row individually).

[* Why 3650? That allows for a ~10 year time-series at daily resolution, which, sortof like 640K memory, "Ought to be Enough for Anyone", and if you requested more rows than that, you're doing something wrong. Well, as it turns out for both, this is not quite true.]

One short-term bandaid we can apply is to increase the row limit by a factor of, say, 10 or 100. But this doesn't fix the problem. So what does fix the problem? There are a couple of options (see also ideas from Alex):

  • sharding: have clients transparently shard requests (e.g. by location), and reassemble responses in to a single payload. con: this requires clients to be "smart" about making requests, requiring knowledge of geography and expected response sizes. and this logic has to live across all of our clients, in different languages.
  • pagination: implement pagination (i.e. the feature request in this issue) on the server so that clients can effectively shard requests without specific knowledge of e.g. geography. con: it still requires that clients be able to reassemble multiple responses into a coherent dataset. it also introduces complexity on the server, as we'll need either a page counter or a row identifier, both of which require a total ordering to be defined in keyspace, which is problematic in multiple dimensions for dynamic datasets.
  • streaming: implement streaming such that rows are delivered from the server to the client on the fly rather than buffering up a giant response object. con: json decode isn't compatible with streaming, so clients would still need to buffer a large response. if that's a problem (i don't think it will be), then rows would need to be sent in a different format than a single large json object.

I strongly recommend streaming over the alternatives, as I think that streaming is the best approach from a design perspective. I'm also not opposed to increasing the row limit as a stop-gap.

from delphi-epidata.

krivard avatar krivard commented on August 15, 2024

Agree; streaming is optimal. Would it be reasonable to use a prototype to determine whether buffering a large response will be a problem without having to do the full workup in the epidata codebase?

from delphi-epidata.

krivard avatar krivard commented on August 15, 2024

That is indeed important to consider, especially given some of the politically-charged survey signals waiting in the wings (cc @capnrefsmmat). I think for now I'm happy with "if we get a denial of service attack, the site goes down; we turn the row limit back on before bringing everything back up again" and revisit the question if/when it happens.

Leaving the option open to reactivate the row limit will require some cleverer handling in the covidcast clients, but I think the additional complexity is worth it.

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024

@korlaxxalrok to get your perspective on hardening

from delphi-epidata.

korlaxxalrok avatar korlaxxalrok commented on August 15, 2024

@dfarrow0 Your Apache suggestions sound good, as do the others for implementing streaming. The test results are interesting. Lots-o-RAM, yes :)

We are in AWS, at least for now, so for additional hardening we could look at WAF. Would need to look into it a bit more, but could be helpful to head off certain attacks earlier in the path.

from delphi-epidata.

krivard avatar krivard commented on August 15, 2024

Cool. @dfarrow0 since you're needed on Hotspots for November, we should either backburner productionizing the prototype or have you deliver it to someone else for that work. Preference?

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024

No preference here, just give me a heads-up if someone else takes this because otherwise I might poke at it a bit more in my spare(?) time.

from delphi-epidata.

sgratzl avatar sgratzl commented on August 15, 2024

@dfarrow0 atm I'm assigned to this issue but as far as I understand your streaming api approach is the current one to use. So, am I still assigned and if yes what is missing to complete this issue?

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024

Streaming appears to be viable from my testing, but I didn't get far enough along to make a proper PR. We still need to implement streaming in the server and then update the clients to request/handle streaming responses. I don't have the time to work on it this month, so if you're on board with taking the task then I'm happy to help review PRs etc.

from delphi-epidata.

krivard avatar krivard commented on August 15, 2024

(and if you'd rather not pick it up lmk and I can drop you as assignee)

from delphi-epidata.

sgratzl avatar sgratzl commented on August 15, 2024

@dfarrow0 what is the name of your branch that you created your prototype, such that I continue from that?

from delphi-epidata.

dfarrow0 avatar dfarrow0 commented on August 15, 2024

@sgratzl it's the streaming branch on my fork here: https://github.com/dfarrow0/delphi-epidata/tree/dfarrow/streaming

the fork has a single commit, so a more direct link is this: dfarrow0@e010b84

pls ignore the python file there (test_oom.py) as that was just a test and shouldn't be committed. also, the line in api.php like ini_set('memory_limit', '8G'); can be omitted as that was an intermediate test and isn't needed for streaming.

finally, as you duly noted, streaming consumes a connection for a as long as the client is accepting data. 30 seconds (which i think is the default php script timeout on the server?) was too short, so i added set_time_limit(300); as a quick solution. feel free to adjust as you see fit :)

from delphi-epidata.

krivard avatar krivard commented on August 15, 2024

Merged prematurely. Once the server is running PHP7, revert #376 to close.

from delphi-epidata.

sgratzl avatar sgratzl commented on August 15, 2024

closing with new flask server

from delphi-epidata.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.