High-level architecture
- Crawler: On a regular basis, retreive data from GBIF, run analyses and fill a database with configurable metrics.
- Webservices: Expose the content of the database with a friendly API
- Client: consume the webservices and present results in a beautiful way
- Packaging: embed the client so GBIF pages are enriched in place (instead of a separate website)
Technology proposal
These are the choices I'd made if I had to implement all of this myself. They were selected for two main reasons: familiariry and fitness for use. I'm open to all cricticism since 1) many other tools would work well too, 2) familiarity is important and related the person in charge of each module.
I'd also propose to adhere to the KISS principle and avoid jumping on every cool kid's tool before it's clear that their technical benefits overweight the cost of use (learning curve / hidden complexity / maintenance cost).
As a first step, I think the best solution is to implement it transversally (a minimal working prototype of each component), then iterate on each in parallel. That gives amximum flexibility, giving plenty of oppurtinities to refine and fine-tune the techological choices and interfaces betwen modules.
Backend: Crawler + webservices
Django (exposing JSON) + PostgreSQL:
- Very proven solution.
- Using Django for the first two modules will provide good facilities and an integrated solution: for example using the ORM and helpers from both the crawler (django commands run by cron) and the webservices.
- Plenty of available extensions for Django on every topic, including REST/webservices (django-tastypie, django-rest-framework, ...) while I'm not sure they will be needed at all.
- Super easy to add an admin interface and additional web pages if necessary.
Alternative solution: lighter tools glued together (Flask + external crawler scripts + Postgres + ... )
TODO: design the basis of the data model.
TODO: major question: get data from API or Darwin Core Archive.
Note: if the crawler consume GBIF web services, I recently developed a (currently tiny, quick and dirty) package to use them. It is currenly embedded into another project. To avoid reinventing the wheel, I'd like to take time to extract it to a proper (documented, tested and PyPi distributed) python package. Opinions?
Note: if we consume DwCA: python-dwca-reader.
Frontend: client:
D3.js + jQuery + optional client MVC framework
Frontend: Packaging:
Chrome extension? Greasemonkey? Both? Additional web pages?
Workload divison
- A critical and question urgent question, IMHO.
- May have an impact of the technological choices.
- I think we can basically divide in 4 work packages that follows the architecture/modules. We will need to have a good rough idea of what will be the interface between these 4 modules. We may also have to add one or two "utility" work packages: sysadmin-deployment/project management/...
- I (Nico) am primarily interested in module 1) Crawler and if time allows, 2) Webservices
- I (Nico) am willing to create soon a rough prototype of 1) and 2) that will allow (after also creating a quick prototype of 3 and 4) to validate the whole dataflow/architecture/technology choices.
Hosting
- Also a choice that could impact technological choices.
- Options: using a server we already have access to / VPS / Cloud-based solution
- At first look, I prefer the VPS solution: it's cheap, we are totally independent and we have full flexibility (root access). See for example https://www.ovh.com/fr/vps/vps-classic.xml. I generally love working with Heroku but I've been recently suprised by all the hidden costs that appears once you need a few options (background processes, static file hosting, redis-queue, mail sending service, ...)
Next steps
- Discuss all of the above
- Agree on the workload division
- Brainstorm on the basic interfaces (top importance: between webservices and client, but also the database that acts as an interface between crawler and webservices and the consistency of the whole dataflow.)
- Code!