GithubHelp home page GithubHelp logo

socrata / datasync Goto Github PK

View Code? Open in Web Editor NEW
76.0 76.0 33.0 32.89 MB

Desktop / Console application for updating Socrata datasets automatically.

Home Page: http://socrata.github.io/datasync/

License: MIT License

Java 99.96% Shell 0.04%
core-platform engineering

datasync's Introduction

Socrata QGIS Plugin

Created By Peter Moore

Requirements

Due to TLS security issues with older versions, the plugin requires QGIS3 or above.

Usage

The Socrata plugin is added the 'Web' menu.

Enter in the domain. Username and Password are optional fields.

You can use the "Get Maps" button to get a list of all map assets on the domain or conversely enter in the dataset ID if you have it.

By clicking OK the map will then be added to your project.

NOTES:

All maps on Socrata are in WGS84 and are retrieved through the GeoJSON endpoint.

datasync's People

Contributors

aescobarcruz avatar alaurenz avatar bhwilliamson avatar catstavi avatar charlottewest avatar chitang avatar courtneyspurgeon avatar dependabot[bot] avatar gregorrichardson avatar louisfettet avatar malindac avatar michaelb990 avatar peteraustinmoore avatar rjmac avatar spaceballone avatar urmilan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasync's Issues

Migrate to maven

Use Maven for package management. Move mail, soda-java, and org.json packages to be imported from Maven. Also update the README to reflect this change.

Restructure Publish methods to be less confusing

Although it is unclear within the existing UI of DataSync the 'upsert' and 'append' methods behave in exactly the same way. When an upsert or append is performed and no Row ID is set for the dataset being published to, the rows are all appended. If a Row ID is set, whether you use 'upsert' or 'append' the rows being uploaded that match existing rows will be updated.

Proposed change: When the user selects append as the publish method DataSync should check if a row ID is set in dataset. If one it set have a message appear that warns the user about it. Have a help bubble explaining everything next to "Publish method" field in UI as well.

Also need to update knowledge base articles.

Optimize DataSync upserting API calls and chunking strategy

To optimize the upsert API calls we should investigate using the upsert CSV method directly (but this may only work for files that contain headers). This may be challenging to support with chunking.

It makes a lot more sense to have chunk size be based on file size rather than number of rows.

Silent failure when running a job

I ran a job (interactively) and it fails silently -- meaning no alert, no apparent change to the dataset, no entry in the Log Dataset. The only signal is that the Run Job Now button becomes clickable again.

It is a Replace job and the source CSV is 76 MB. I have the chunking threshold at 64 MB. I initially had the chunk size at the default of 25,000 rows but lowered it as low as 10,000, with no success.

What other diagnostic information can I provide?

Thank you.

Trim leading and trailing whitespace in CSV header column

If the the header row of a CSV file has leading or trailing whitespace around any of the column names the upload will fail for those columns. This can be fixed by simply trimming off whitespace around each header column name automatically before publishing the CSV.

Enable user to (optionally) specify a :deleted column marking rows to be deleted

This issue is related another issue: #14

There should be an upload method call append/upsert/delete which allows specifying a column in the CSV with header :deleted and then having any value of true in that column trigger a delete of that row. Be sure to notify user if there is no Row ID set in the dataset. Include a Help icon explaining how this works.

Implement SmartUpdate feature for efficient publishing of data from very large files

Enable normal DataSync jobs to be 'smart' about the data they publish by only publishing the rows of data that were updated or added in the CSV (File to publish). DataSync will maintain a record of the changes made to the CSV file to be published since the last publish operation was performed (i.e. last time job was run) in order to identify which rows in the CSV file were updated or changed since the last run of the job. Then when it runs the job it will only publish those newly updated or added rows (rather than all rows in the CSV as it does now). This feature will only work for append/upsert (not replace).

Enable file chunking support for 'replace' method

Currently chunking (which enables uploading very large files) is only supported by 'append' and 'upsert'.

This should be implemented by creating a dataset working copy via .copySchema method and then pushing rows to the resulting working copy in chunks. This is to prevent the dataset being in a bad/inconsistent state if a job fails part-way through.

Allow saving settings on a machine that is not able to run the GUI interface

With the GUI interface, one can set certain settings and preferences once and have them applied to all jobs (unless specifically overridden by a command-line parameter, I assume). They are saved in the Windows registry or other OS-specific location.

However, this only works for a machine capable of running the GUI. In some cases, such as a Linux server accessed through terminal emulator, that is not possible. It would be great if there were a way to save these settings through the command line. Maybe some parameter that effectively said "Save all these other parameters as if entered through the GUI"? Would that allow for substantial reuse of the code the GUI uses to save these settings?

I realize this does not add a lot of security vs. using command-line parameters or a configuration JSON file but it adds a little, even if only security through obscurity and adds convenience vs. the command-line.

Thank you.

Enable uploading CSV files that do not contain a column header row

DataSync should give the user the option of uploading a CSV file without column headers. Have a checkbox to user checks to tell DataSync their CSV does not contain a column header row. If this box is checked DataSync should refer to the order of the columns in the CSV.

It might also be useful to give the user feedback in the UI if the column headers of the CSV file (or lack thereof) does not match that of the dataset.

Trying to populate a URL Type field with an invalid URL format produces an error

A dataset will allow, although grey out, an invalid URL but DataSync produces an error message. My preference would be that if the portal, itself, allows the value, DataSync allow it. Dropping invalid values might be ok if necessary.

I am not in a position to give the specific example to reproduce the error since it is a private dataset under development (happy to provide the link to Adrian or someone else at Socrata but cannot post it quite this publicly) and I am changing the field to Plain Text in order to be able to proceed for now. However, I suspect it would be easy to reproduce with a test dataset. Leaving off the top level domain and using a comma instead of a period both create this error, although they are not the only conditions that produce it.

Integrate DataPort into DataSync

DataPort is a command line tool that allows you to transfer data from one Socrata data to another or duplicate an existing Socrata dataset. DataSync will have a new job type called a Port job where you can configure a job to do the things DataPort supports as part of DataSync.

Implement "SmartUpdate" capability for highly efficient replace operations

SmartUpdate will allow customers to efficiently do replace operations in DataSync on very large datasets (1 million+ rows). SmartUpdate works by DataSync sending a zipped CSV file with all data over FTP and then the Socrata platform determines which rows have been added/updated and only publishes those to the dataset. This leads to dramatic gains in efficiency/performance so customs can have uploads complete faster and data will also be geocoded and indexed for searching much more quickly. SmartUpdate will enable using the "replace" method for datasets with upwards of 1 million rows which should remove the need for customers to handle determining which rows have been added/updated/deleted and using upsert. Instead it is as simple as dump all the data into a CSV and upload the CSV through DataSync "SmartUpdate".

UnrecognizedPropertyException occurs when publishing via 'upsert' on some datasets

When running jobs to publish data to certain datasets (with specific sharing configurations) this error occurs when using 'upsert' in DataSync:

com.sun.jersey.api.client.ClientHandlerException: org.codehaus.jackson.map.exc.UnrecognizedPropertyException: Unrecognized field "userEmail" (Class com.socrata.model.importer.Grant), not marked as ignorable

This is a bug in the soda-java library which was fixed in version 0.9.4

Add the ability to specify a job (.sij) file to open when launching DataSync from the command line or a shortcut

This would allow for creating shortcuts that not only launched DataSync but also opened a specified job file, similar to what can do with many applications that open/edit files. I know you can specify a job file when running a job but there could be situations where someone wants to open it before running it interactively.

It is possible this can already be done and I am just not figuring out how to do it.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.