High-level architecture Crawler : On a regul

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Quick intermediate report on my progress: I'm working with the

Approach, Architecture and technical proposals about gbif-dataset-metrics HOT 11 CLOSED

datafable commented on July 29, 2024

Approach, Architecture and technical proposals

from gbif-dataset-metrics.

Comments (11)

bartaelterman commented on July 29, 2024

I was wondering wether we could use a hosted backend such as cartodb. Put data in a table there, and use their API. I'm not sure wether that would be feasible given the amount of data. (what do you think @peterdesmet ?) In that case, we would only need a place to run the scraper and have him send the metric data to cartodb (using the API).

If that is not an option (or anything similar), we'll need to develop and host our own backend, but that won't be free. (Although VPS indeed looks cheap). In that case I agree with Nico. I would personally think Django contains a lot of stuff that we don't need, so maybe Flask would be a better choice, but I have little experience with both, so best to stick to known technology.

Furthermore, for calculating the metrics we could use pandas. I'm starting to get a hang on that one and it's data frame can save us some iterating over a dataset.

Considering the workload, I see myself mostly working on the calculation of the metrics (so that needs a tight integration with the scraper), generally on other backend stuff where needed, and on the frontend visualizations.

I think Peter took off with a good start, adding a POC for the downloads bar chart (looking good by the way!). I'll try some stuff out while this discussion is going.

from gbif-dataset-metrics.

peterdesmet commented on July 29, 2024

Thanks @niconoe, nice summary. I agree with Bart that for our metrics store, it would be best to choose the solution where we'd have to invest the fewest time. CartoDB seems like a good candidate.

Regarding the crawler: we won't be able to use this for the occurrence store: we ideally want to crawl all GBIF data, but the occurrence API has a hard limit of 1 million records. We would have to use a download instead (estimated compressed size: 75GB).

Note: for some metrics, we can use: http://www.gbif.org/developer/occurrence#metrics

from gbif-dataset-metrics.

niconoe commented on July 29, 2024

@bartaelterman, yep also agree on what you said. Maybe we will finally set the tool choices once we have properly defined interfaces and workload separation. I'd still be faster/more at ease with my solution if I had to implement it very quickly myself, but I'm pretty sure yours would would work great too. Do you have any experience with accessing Cartodb from python (tools stability, API [sql or something higher level?], ...)

@peterdesmet: Allright, I was imagining using the API with a dataset filter for the first prototype, but your proposal is a bit better. I should probably use my python-dwca-reader. There may be difficulties in opening a 75Gb compressed file, but nothing impossible.

from gbif-dataset-metrics.

bartaelterman commented on July 29, 2024

@niconoe 👍 for those last three words! 🤘

We have more experience with reading from the Cartodb SQL API, but we're very happy with that so far. In the end we should also remember, if this tool is really going to be used, our backend will probably be replaced by something on GBIF's side because it makes much more sense to compute the metrics there. So let's keep our backend indeed to the bare minimum. Peter estimated the metric data to be small enough to fit in a free Cartodb table, so we're good to go. @peterdesmet can you create a table on cartodb where we can get access to?

The SQL API is documented here and there is also a link to a python client for the API. I haven't worked with it yet though and I'm not sure whether it can do much more than what requests does.

Concerning the 75GB file. Well, if the python-dwca-reader can work with that, that would be... well... I'll just pop in the same icon: 🤘 Can you give that a try @niconoe ?

from gbif-dataset-metrics.

peterdesmet commented on July 29, 2024

CartoDB SQL API is pretty awesome, I wish all APIs were like that. Python client might help. I created a table at: https://peterdesmet.cartodb.com/tables/gbif_dataset_metrics/table. The API key for writes is 4ef58d8caa05b8ee1b02a01929de1e035963eb78. It doesn't have any structure yet, let me know what you need.

Just created a full GBIF download: it's 82.7GB (compressed, not downloaded yet). Good luck with that.

from gbif-dataset-metrics.

bartaelterman commented on July 29, 2024

Downloaded 1GB of zipped GBIF data. Unzipped, there's 10GB of data in there, 6GB of occurrences and 4GB in a file called verbatim.txt (not sure what that is). So unzipping a 80GB file of occurrences could require 800GB of free space on your hard drive. I don't have that :-)

So if we're running against the limitations of our machines to process this, we'll need to come up with another solution.

from gbif-dataset-metrics.

peterdesmet commented on July 29, 2024

GBIF downloads are quite bloated. The occurrence file contains over 200 columns and most of the data is repeated in the verbatim.txt (which contains the data as provided by the provider). In addition, there is the rights statements, etc.

The occurrence API would give much cleaner and useful results, but is limited to 300 records per page and 1 million records total per query. Also, we need to verify that all fields we want to use for our POC use cases are included. If so, I would propose to reduce the scope for this POC to datasets with less than 1 million records and use the API to retrieve the data. For a user using the plugin, that seems like a reasonable (and easy detectable) limit.

Looking at the data we downloaded on October 27, 2014 for the GBIF data licenses, there are 13,802 datasets, 1,269 of which have 0 occurrences (9%) - probably all taxonomic checklists - and 60 with more than 1 million occurrences (4%). That means there are 12,473 datasets we could query, good for 145,000,000 occurrences. If we do this dataset per dataset, at 300 records a page, we need to make 491.000 API calls. Seems doable?

from gbif-dataset-metrics.

niconoe commented on July 29, 2024

Peter,

I plan to investigate both solutions soon, and find a bit uncomfortable taking strong decision before more concrete testing (500,000 api calls are not trivial neither). I'm still confident that we'll be able to find a solution to process a substancial amount of records for the beginning of March. I'll let you know !

from gbif-dataset-metrics.

niconoe commented on July 29, 2024

Peter,

Something that may be think about early is the web services interface (between the backend and your JS code). Once we have confirmed the 5 use cases with Bart, could you start imagining how the exchanges would be for each (in terms of parameters and returned values).

from gbif-dataset-metrics.

niconoe commented on July 29, 2024

Quick intermediate report on my progress:

I'm working with the downloads so far since it would be the best solution, if possible.
I've improved python-dwca-reader and can now comfortably open archives around 10Gb uncompressed archives.
Then uncompressed whole archive, after cutting out the verbatim records (keeping interpreted+metadata) is around 350Gb. I'm in the process in splitting it into multiple manageable 10Gb archives.
So far, those process are very slow (on my laptop with an USB external HDD...), but everything works.
I'm therefore confident that we'll be eventually able to parse the whole download on a rather standard machine. Each indexing will probably take a few days.

Here is my proposal:

Go with the downloads to populate the backend (instead of API).
Do our developments with a subset of data (for example 10 manually selected datasets), so we can change/test things quickly during development.
In parallel, continue to improve the process to index the whole archive, so it's getting easier and faster over time.

I continue investigating all of this and will keep you informed !

from gbif-dataset-metrics.

bartaelterman commented on July 29, 2024

Looks great Nico. I agree with your proposal. Anything I can help you with? Maybe I can start with the subset and write code to calculate the metrics while you chew further on the indexing?

from gbif-dataset-metrics.

Approach, Architecture and technical proposals about gbif-dataset-metrics HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs