GithubHelp home page GithubHelp logo

pilot dataset/use cases about dat HOT 7 CLOSED

dat-ecosystem avatar dat-ecosystem commented on July 17, 2024
pilot dataset/use cases

from dat.

Comments (7)

schmod avatar schmod commented on July 17, 2024

The @unitedstates set of repositories seem to be attempting to accomplish many of dat's goals using only git and a selection of scraper scripts. The congress-legislators repository is a particularly good example, because it contains a list of scrapers/parsers that contribute to a YAML dataset that can be further refined by hand (or other scripts) before being committed.

I'm not a huge YAML evangelist, but it works exceptionally well in this case, because it's line-delimited, and produces readable diffs.

from dat.

shawnbot avatar shawnbot commented on July 17, 2024

I'm super interested in the github one. There could even be separate tools for extracting data from the GitHub JSON API (paginating results and transforming them into more tabular structures), like: $ ditty commits shawnbot/aight | dat update (P.S. I call dibs on the name "ditty").

The use case that I'm most interested in, though, is Crimespotting. There are two slightly different processes for Oakland and San Francisco, both of which run daily because that's as often as the data changes:

Oakland

  1. Read a single sheet from an Excel file at a URL
  2. Fix broken dates and discard unfixable ones
  3. Map Oakland's report types to Crimespotting's (FBI-designated) types
  4. Geocode addresses (caching lookups as you go)
  5. Update the database

San Francisco

  1. Read in one of many available report listings from a URL:
    • rolling 90-day Shapefile
    • 1-day delta in GeoJSON
    • all reports since January 1, 2008 packaged as a zip of yearly CSVs
  2. Map SF's report types to Crimespotting's (FBI-designated) types
  3. Update the database

Updating the database is the trickiest part right now, for both cities. When @migurski originally built Oakland Crimespotting, the process was to bundle up reports by day and replace per-day chunks in the database (MySQL). We ended up using the same system for San Francisco, but it doesn't scale well to backfilling the database with reports from 2008 to present day, which requires about ~2,000 POST requests.

My wish is that dat can figure out the diff when you send it updates, and generate a database-agnostic "patch" (which I also mentioned here) that notes which records need to be inserted or updated. These could easily be translated into INSERT and UPDATE queries and piped directly to the database, or collections to be POSTed and PUT to API endpoints.

Here's how I would love to be able to update San Francisco:

# San Francisco: download the 90-day rolling reports shapefile
$ curl http://apps.sfgov.org/.../CrimeIncidentShape_90-All.zip > reports.shp.zip
# an imaginary `shp2csv` utility would convert a shapefile to CSV,
# `update-report-types` would convert the report types into the ones we know about,
# then `dat update` would read CSV on stdin and produce a diff as JSON
$ shp2csv reports.shp.zip \
  | update-report-types \
  | dat update --input-format csv --diff diff.json
# if our diff is an object with an array of "updates" and "inserts",
# we can grab both of those using jq and send them to the server
# (ideally with at least HTTP Basic auth)
$ upload=curl --upload-file - -H "Content-Type: application/json"
$ jq -M ".updates" diff.json | $upload --request POST "http://sf.crime.org/reports"
$ jq -M ".inserts" diff.json | $upload --request PUT "http://sf.crime.org/reports"

:trollface:

from dat.

msenateatplos avatar msenateatplos commented on July 17, 2024

There are lots of academic research data repositories out there, this is one open access service with quite a bit:
http://www.datadryad.org/ approximately 10,579 data files.

There are more, especially lots of small repositories, here are some lists: general http://databib.org/, general http://oad.simmons.edu/oadwiki/Data_repositories, health http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html, various other (social site) http://figshare.com/.

I'll start a separate issue about WikiData...

from dat.

sballesteros avatar sballesteros commented on July 17, 2024

We are going to start a significant data gathering process very soon for communicable diseases circulating in the USA (first starting with NYC). Code will live here.

Unfortunately, the data is currently really dispersed and mostly lives in PDFs and scans. Health agencies or the CDC typically report communicable diseases on a weekly or monthly basis. After each update a lot of analysis have to be re run so a tool like dat would help.

Our plan is to convert this dispersed data into a database. We are going to have to implement some transformation modules along the way, so it would be great to share our effort with dat. We will work with node.js and mongoDB.

In essence we will have a primary collection containing atomic JSON documents (at the row level) for each data entry; and we will implement SLEEP as another history collection tracking the transaction log of changes.

from dat.

jkriss avatar jkriss commented on July 17, 2024

There's a lot of data published and/or indexed by JPL that could benefit from dat. For instance, there's a portal for CO2 data. In this case, the data is coming in from multiple sources, and the end result of a search is just a download link.

There's also a really interesting visual browser with maps and scatterplots, but you can't currently download the subset of data you find with that tool.

I may be working with this group to create the next iteration of the data portal, so I'll probably be able to learn more about it and suggest dat or dat-like approaches.

from dat.

IronBridge avatar IronBridge commented on July 17, 2024

Traditional ETL Replacement
For several years, we have been using traditional ETL tools like Pentaho to take large (4-10 gigs) delimited data sets and transform them into other data formats such as JSON, database inserts, RESTFUL web service calls, import into big data infrastructures like Hadoop.

Our use case is unique because we may take a large CSV file, transform the data, and then load into multiple repositories. For example, this could be a direct database insert on one end and also an HTTP post. It will be important for us to have a mechanism to determine if records in a batch have failed, and which records.

Node.js steams, pipes, and the fact it's JavaScript makes a very intriguing replacement for a traditional ETL tool.

from dat.

joehand avatar joehand commented on July 17, 2024

This issue was moved to dat-ecosystem-archive/datproject-discussions#41

from dat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.