Comments (7)
The @unitedstates set of repositories seem to be attempting to accomplish many of dat
's goals using only git and a selection of scraper scripts. The congress-legislators repository is a particularly good example, because it contains a list of scrapers/parsers that contribute to a YAML dataset that can be further refined by hand (or other scripts) before being committed.
I'm not a huge YAML evangelist, but it works exceptionally well in this case, because it's line-delimited, and produces readable diffs.
from dat.
I'm super interested in the github one. There could even be separate tools for extracting data from the GitHub JSON API (paginating results and transforming them into more tabular structures), like: $ ditty commits shawnbot/aight | dat update
(P.S. I call dibs on the name "ditty").
The use case that I'm most interested in, though, is Crimespotting. There are two slightly different processes for Oakland and San Francisco, both of which run daily because that's as often as the data changes:
Oakland
- Read a single sheet from an Excel file at a URL
- Fix broken dates and discard unfixable ones
- Map Oakland's report types to Crimespotting's (FBI-designated) types
- Geocode addresses (caching lookups as you go)
- Update the database
San Francisco
- Read in one of many available report listings from a URL:
- rolling 90-day Shapefile
- 1-day delta in GeoJSON
- all reports since January 1, 2008 packaged as a zip of yearly CSVs
- Map SF's report types to Crimespotting's (FBI-designated) types
- Update the database
Updating the database is the trickiest part right now, for both cities. When @migurski originally built Oakland Crimespotting, the process was to bundle up reports by day and replace per-day chunks in the database (MySQL). We ended up using the same system for San Francisco, but it doesn't scale well to backfilling the database with reports from 2008 to present day, which requires about ~2,000 POST requests.
My wish is that dat can figure out the diff when you send it updates, and generate a database-agnostic "patch" (which I also mentioned here) that notes which records need to be inserted or updated. These could easily be translated into INSERT and UPDATE queries and piped directly to the database, or collections to be POSTed and PUT to API endpoints.
Here's how I would love to be able to update San Francisco:
# San Francisco: download the 90-day rolling reports shapefile
$ curl http://apps.sfgov.org/.../CrimeIncidentShape_90-All.zip > reports.shp.zip
# an imaginary `shp2csv` utility would convert a shapefile to CSV,
# `update-report-types` would convert the report types into the ones we know about,
# then `dat update` would read CSV on stdin and produce a diff as JSON
$ shp2csv reports.shp.zip \
| update-report-types \
| dat update --input-format csv --diff diff.json
# if our diff is an object with an array of "updates" and "inserts",
# we can grab both of those using jq and send them to the server
# (ideally with at least HTTP Basic auth)
$ upload=curl --upload-file - -H "Content-Type: application/json"
$ jq -M ".updates" diff.json | $upload --request POST "http://sf.crime.org/reports"
$ jq -M ".inserts" diff.json | $upload --request PUT "http://sf.crime.org/reports"
from dat.
There are lots of academic research data repositories out there, this is one open access service with quite a bit:
http://www.datadryad.org/ approximately 10,579 data files.
There are more, especially lots of small repositories, here are some lists: general http://databib.org/, general http://oad.simmons.edu/oadwiki/Data_repositories, health http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html, various other (social site) http://figshare.com/.
I'll start a separate issue about WikiData...
from dat.
We are going to start a significant data gathering process very soon for communicable diseases circulating in the USA (first starting with NYC). Code will live here.
Unfortunately, the data is currently really dispersed and mostly lives in PDFs and scans. Health agencies or the CDC typically report communicable diseases on a weekly or monthly basis. After each update a lot of analysis have to be re run so a tool like dat
would help.
Our plan is to convert this dispersed data into a database. We are going to have to implement some transformation modules along the way, so it would be great to share our effort with dat
. We will work with node.js and mongoDB.
In essence we will have a primary collection containing atomic JSON documents (at the row level) for each data entry; and we will implement SLEEP as another history
collection tracking the transaction log of changes.
from dat.
There's a lot of data published and/or indexed by JPL that could benefit from dat. For instance, there's a portal for CO2 data. In this case, the data is coming in from multiple sources, and the end result of a search is just a download link.
There's also a really interesting visual browser with maps and scatterplots, but you can't currently download the subset of data you find with that tool.
I may be working with this group to create the next iteration of the data portal, so I'll probably be able to learn more about it and suggest dat or dat-like approaches.
from dat.
Traditional ETL Replacement
For several years, we have been using traditional ETL tools like Pentaho to take large (4-10 gigs) delimited data sets and transform them into other data formats such as JSON, database inserts, RESTFUL web service calls, import into big data infrastructures like Hadoop.
Our use case is unique because we may take a large CSV file, transform the data, and then load into multiple repositories. For example, this could be a direct database insert on one end and also an HTTP post. It will be important for us to have a mechanism to determine if records in a batch have failed, and which records.
Node.js steams, pipes, and the fact it's JavaScript makes a very intriguing replacement for a traditional ETL tool.
from dat.
This issue was moved to dat-ecosystem-archive/datproject-discussions#41
from dat.
Related Issues (20)
- Weekly Digest (5 January, 2020 - 12 January, 2020) HOT 1
- Weekly Digest (12 January, 2020 - 19 January, 2020)
- Weekly Digest (19 January, 2020 - 26 January, 2020)
- An in-range update of request is breaking the build 🚨 HOT 1
- dat doctor crashes when running inside docker container HOT 5
- dat share until a threshold of peers have an up to date version
- request is deprecated HOT 4
- Module missing from dat 14.0.0 Linux binary HOT 3
- Link to Dat Desktop in README.md is incorrect. HOT 3
- Error: Could not satisfy length
- Cannot publish a dat, doctor command is missing, any ports should I forward to me? HOT 6
- dat is ignoring all files in folder? HOT 23
- Looking for maintainers HOT 6
- How could Dat protocol be suitable for blockchain or transactional ledger
- dat-14.0.2-win-x64 is not starting HOT 1
- dat not sharing files other than dat.json HOT 18
- dat not connecting on any machine or network
- Install error with dat using npm on MacBook Pro (Intel version)
- Cannot connect to Dat network HOT 3
- Using dat as a background process HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dat.