describe and link to datasets that I can use as a way to pilot/test out <code class="n

The <a class="user-mention notranslate" data-hovercard-type="organization" data-hoverc

This issue was moved to <a class="issue-link js-issue-link" data-error-text="Failed to

pilot dataset/use cases about dat HOT 7 CLOSED

dat-ecosystem commented on July 17, 2024

pilot dataset/use cases

from dat.

Comments (7)

schmod commented on July 17, 2024

The @unitedstates set of repositories seem to be attempting to accomplish many of dat's goals using only git and a selection of scraper scripts. The congress-legislators repository is a particularly good example, because it contains a list of scrapers/parsers that contribute to a YAML dataset that can be further refined by hand (or other scripts) before being committed.

I'm not a huge YAML evangelist, but it works exceptionally well in this case, because it's line-delimited, and produces readable diffs.

from dat.

shawnbot commented on July 17, 2024

I'm super interested in the github one. There could even be separate tools for extracting data from the GitHub JSON API (paginating results and transforming them into more tabular structures), like: $ ditty commits shawnbot/aight | dat update (P.S. I call dibs on the name "ditty").

The use case that I'm most interested in, though, is Crimespotting. There are two slightly different processes for Oakland and San Francisco, both of which run daily because that's as often as the data changes:

Oakland

Read a single sheet from an Excel file at a URL
Fix broken dates and discard unfixable ones
Map Oakland's report types to Crimespotting's (FBI-designated) types
Geocode addresses (caching lookups as you go)
Update the database

San Francisco

Read in one of many available report listings from a URL:
- rolling 90-day Shapefile
- 1-day delta in GeoJSON
- all reports since January 1, 2008 packaged as a zip of yearly CSVs
Map SF's report types to Crimespotting's (FBI-designated) types
Update the database

Updating the database is the trickiest part right now, for both cities. When @migurski originally built Oakland Crimespotting, the process was to bundle up reports by day and replace per-day chunks in the database (MySQL). We ended up using the same system for San Francisco, but it doesn't scale well to backfilling the database with reports from 2008 to present day, which requires about ~2,000 POST requests.

My wish is that dat can figure out the diff when you send it updates, and generate a database-agnostic "patch" (which I also mentioned here) that notes which records need to be inserted or updated. These could easily be translated into INSERT and UPDATE queries and piped directly to the database, or collections to be POSTed and PUT to API endpoints.

Here's how I would love to be able to update San Francisco:

# San Francisco: download the 90-day rolling reports shapefile
$ curl http://apps.sfgov.org/.../CrimeIncidentShape_90-All.zip > reports.shp.zip
# an imaginary `shp2csv` utility would convert a shapefile to CSV,
# `update-report-types` would convert the report types into the ones we know about,
# then `dat update` would read CSV on stdin and produce a diff as JSON
$ shp2csv reports.shp.zip \
  | update-report-types \
  | dat update --input-format csv --diff diff.json
# if our diff is an object with an array of "updates" and "inserts",
# we can grab both of those using jq and send them to the server
# (ideally with at least HTTP Basic auth)
$ upload=curl --upload-file - -H "Content-Type: application/json"
$ jq -M ".updates" diff.json | $upload --request POST "http://sf.crime.org/reports"
$ jq -M ".inserts" diff.json | $upload --request PUT "http://sf.crime.org/reports"

from dat.

msenateatplos commented on July 17, 2024

There are lots of academic research data repositories out there, this is one open access service with quite a bit:
http://www.datadryad.org/ approximately 10,579 data files.

There are more, especially lots of small repositories, here are some lists: general http://databib.org/, general http://oad.simmons.edu/oadwiki/Data_repositories, health http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html, various other (social site) http://figshare.com/.

I'll start a separate issue about WikiData...

from dat.

sballesteros commented on July 17, 2024

We are going to start a significant data gathering process very soon for communicable diseases circulating in the USA (first starting with NYC). Code will live here.

Unfortunately, the data is currently really dispersed and mostly lives in PDFs and scans. Health agencies or the CDC typically report communicable diseases on a weekly or monthly basis. After each update a lot of analysis have to be re run so a tool like dat would help.

Our plan is to convert this dispersed data into a database. We are going to have to implement some transformation modules along the way, so it would be great to share our effort with dat. We will work with node.js and mongoDB.

In essence we will have a primary collection containing atomic JSON documents (at the row level) for each data entry; and we will implement SLEEP as another history collection tracking the transaction log of changes.

from dat.

jkriss commented on July 17, 2024

There's a lot of data published and/or indexed by JPL that could benefit from dat. For instance, there's a portal for CO2 data. In this case, the data is coming in from multiple sources, and the end result of a search is just a download link.

There's also a really interesting visual browser with maps and scatterplots, but you can't currently download the subset of data you find with that tool.

I may be working with this group to create the next iteration of the data portal, so I'll probably be able to learn more about it and suggest dat or dat-like approaches.

from dat.

IronBridge commented on July 17, 2024

Traditional ETL Replacement
For several years, we have been using traditional ETL tools like Pentaho to take large (4-10 gigs) delimited data sets and transform them into other data formats such as JSON, database inserts, RESTFUL web service calls, import into big data infrastructures like Hadoop.

Our use case is unique because we may take a large CSV file, transform the data, and then load into multiple repositories. For example, this could be a direct database insert on one end and also an HTTP post. It will be important for us to have a mechanism to determine if records in a batch have failed, and which records.

Node.js steams, pipes, and the fact it's JavaScript makes a very intriguing replacement for a traditional ETL tool.

from dat.

joehand commented on July 17, 2024

This issue was moved to dat-ecosystem-archive/datproject-discussions#41

from dat.

pilot dataset/use cases about dat HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs