GithubHelp home page GithubHelp logo

pipeline-generator's People

Contributors

adyork avatar cschloer avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

ashepherd adyork

pipeline-generator's Issues

Concatenate from list

Need processor that takes a file list and creates all the needed steps to concatenate them all into one resource.

  • Default is to use the col-names from first file, but have option to add columns for all those present in the files in the list. Some may have extra columns other do not.

Test the find/replace in dataflows to see if it also has the "None" replacement issue

If we decide to integrate dataflows into our pipelines there could potentially be three different ways to find and replace.

The current way we are using, pipeline built-in processor way, has an issue where non-matches that are also blank cells are filled in with the string "None." Currently the only way to avoid this is to do a second find and replace using find pattern: "None" (without quotes) and leave the replace with field blank.

  1. pipeline processor - has "None" fill issue.
  2. dataflow "find_replace" - unknown whether same issue as pipeline find replace.
  3. custom flow. If all else fails we can write a custom find replace (python re) which seems silly to do but if it works then ....

BCO-DMO specific metadata as an inline data resource

To fully leverage the "profile" property, let make BCO-DMO specific metadata an inline data resource.
https://frictionlessdata.io/specs/data-resource/

Proposal

{
  ...
  "resources": [
  {
    "id": "http://datadocs.bco-dmo.org/submissiox/yz123",
     ...the description...
  },
  {
    "profile": "http://schema.bco-dmo.org/odo.json"
    "format": "json",
    "data": {
      "@context": { 
        "odo": "http://ocean-data.org/schema/"
      },
      "@graph": [
        {
          "@type": "odo:Dataset",
          "@id": "http://datadocs.bco-dmo.org/submissiox/yz123"
        }
      ]
    },
  }]
  ...
}

The profile JSON file http://schema.bco-dmo.org/odo.json would help us validate the required information for ingest. For example, does the Dataset have a name, are all resources in the data pkg described, etc.

Investigate usefulness of existing pipeline extensions

https://github.com/frictionlessdata?utf8=%E2%9C%93&q=datapackage-pipelines&type=&language=

Is there something in the pattern here, that we should pivot towards?
Installation of custom processors into a Docker container:
https://github.com/frictionlessdata/datapackage-pipelines/blob/master/Dockerfile#L9

Example: datapackage-pipelines-aws provides aws.dump.to_s3
https://github.com/frictionlessdata/datapackage-pipelines-aws

QUESTION: how do we package our bcodmo_pipeline so that we can also include our custom processors in a Dockerfile, something like:

FROM frictionlessdata/datapackage-pipelines:1.7.1

COPY bcodmo_pipeline /bcodmo_pipeline
ENV DPP_PROCESSOR_PATH=/bcodmo_pipeline

????

@akariv says, "you can pip install [custom processors] in case these are datapackage-pipelines extension packages (such as datapackage-pipelines-aws).
If these are processors of your own, you should add them in the container (using Docker's ADD or COPY commands) and then set the DPP_PROCESSOR_PATH environment variable so that they become discoverable by dpp"

QUESTION: Do we make datapackage-pipelines-bcodmo_pipeline ?
https://github.com/frictionlessdata/datapackage-pipelines#plugins-and-source-descriptors

Write flow for importing seabird data

What needs to be done in the flow

  • read in the file(s)
  • capture the comments (often containing station info or notes about casts) into add_metadata.
  • maybe capture some of the calibration info?
  • If no header line, then parse the seabird way of including column info in the xml
  • skip the rest of the XML section
  • read the fixed width tabular dataset that is left. Columns fixed to 11chars each.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.