edgi-govdata-archiving / archivers-harvesting-tools Goto Github PK
View Code? Open in Web Editor NEWARCHIVED--Collection of scripts and code snippets for data harvesting after generating the zip starter
License: GNU General Public License v3.0
ARCHIVED--Collection of scripts and code snippets for data harvesting after generating the zip starter
License: GNU General Public License v3.0
The current tool chain for data harvesting doesn't have any hooks for working with AWS. For instance, running this tooling on an EC2 instance & plugging the data into S3 automagically.
Do we have a procedure for pulling down whole ESRI REST endpoints? Eg https://map11.epa.gov/arcgis/rest/services.
OpenAddresses uses this useful scraper to get individual datasets from mapservers, imageservers, and featureservers: https://github.com/openaddresses/esri-dump
cc @louh
Just noticing that the way we are documenting stuff is a little out of sync with all the wonderful contributions. We should do a pass for consistency across tools :)
Do we want to require paths.txt and urls.txt in these tools? that would mean standardizing them all a bit better.
Feedback from @rpattcorner event:
a prebuilt toolkit with all major tools installed, as a docker
I have someone who wrote a short script to harvest data from a specific page (at DOE, I think). The script is pretty specific to the particular page & purpose, so it (probably) doesn't merit a whole subdirectory + readme file and so on. At the same time, for the purposes of good reproducibility and documentation, it seems like such things ought to be kept somewhere, and it might serve as an example for others.
Perhaps there could be a subdirectory for such things? Something like "single-purpose-scripts" or "site-specific-scripts" or something like that? Each script could be placed as a single file there, and there could be a single readme file that contains a paragraph about each file in the directory.
The current ftp tool is single threaded and doesn't take advantage (or have an option to) make file downloading concurrent.
check-ia.py
is currently orphaned in the workflow repo (once we move these tools here, no other code will live in the repo). What should we do with it.
At the very least we should indicate that it is a tool for checking seeds, not for harvesting.
Reflect the structure of other repos
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.