bcodmo / pipeline-generator Goto Github PK

View Code? Open in Web Editor NEW

2.0 7.0 2.0 121 KB

Generates a pipeline .yml file

Python 100.00%

frictionlessdata data-management

pipeline-generator's People

Contributors

Stargazers

Watchers

Forkers

ashepherd adyork

pipeline-generator's Issues

test unpivot/pivot from dataflows

Make a dataflow .py for pivot/unpivot and see if it will work for our identified use-cases.

If this works out we can add flows directly into the pipeline yamls.

Example custom flow called from a pipeline:
https://github.com/frictionlessdata/datapackage-pipelines#dataflows-integration

Add tests for v0.0.2

rename_fields
reorder_fields
fixed_width with load
https://github.com/BCODMO/pipeline-generator/releases/tag/v0.0.2

Using custom python processing with pipelines

RE: @mbiddle-bcodmo request for including custom python.

Investigate Dataflows and how they are incorporated into FDPs
https://github.com/frictionlessdata/datapackage-pipelines#dataflows-integration

Concatenate from list

Need processor that takes a file list and creates all the needed steps to concatenate them all into one resource.

Default is to use the col-names from first file, but have option to add columns for all those present in the files in the list. Some may have extra columns other do not.

Test the find/replace in dataflows to see if it also has the "None" replacement issue

If we decide to integrate dataflows into our pipelines there could potentially be three different ways to find and replace.

The current way we are using, pipeline built-in processor way, has an issue where non-matches that are also blank cells are filled in with the string "None." Currently the only way to avoid this is to do a second find and replace using find pattern: "None" (without quotes) and leave the replace with field blank.

pipeline processor - has "None" fill issue.
dataflow "find_replace" - unknown whether same issue as pipeline find replace.
custom flow. If all else fails we can write a custom find replace (python re) which seems silly to do but if it works then ....

BCO-DMO specific metadata as an inline data resource

To fully leverage the "profile" property, let make BCO-DMO specific metadata an inline data resource.
https://frictionlessdata.io/specs/data-resource/

Proposal

{
  ...
  "resources": [
  {
    "id": "http://datadocs.bco-dmo.org/submissiox/yz123",
     ...the description...
  },
  {
    "profile": "http://schema.bco-dmo.org/odo.json"
    "format": "json",
    "data": {
      "@context": { 
        "odo": "http://ocean-data.org/schema/"
      },
      "@graph": [
        {
          "@type": "odo:Dataset",
          "@id": "http://datadocs.bco-dmo.org/submissiox/yz123"
        }
      ]
    },
  }]
  ...
}

The profile JSON file http://schema.bco-dmo.org/odo.json would help us validate the required information for ingest. For example, does the Dataset have a name, are all resources in the data pkg described, etc.

Investigate usefulness of existing pipeline extensions

https://github.com/frictionlessdata?utf8=%E2%9C%93&q=datapackage-pipelines&type=&language=

Is there something in the pattern here, that we should pivot towards?
Installation of custom processors into a Docker container:
https://github.com/frictionlessdata/datapackage-pipelines/blob/master/Dockerfile#L9

Example: datapackage-pipelines-aws provides aws.dump.to_s3
https://github.com/frictionlessdata/datapackage-pipelines-aws

QUESTION: how do we package our bcodmo_pipeline so that we can also include our custom processors in a Dockerfile, something like:

FROM frictionlessdata/datapackage-pipelines:1.7.1

COPY bcodmo_pipeline /bcodmo_pipeline
ENV DPP_PROCESSOR_PATH=/bcodmo_pipeline

????

@akariv says, "you can pip install [custom processors] in case these are datapackage-pipelines extension packages (such as datapackage-pipelines-aws).
If these are processors of your own, you should add them in the container (using Docker's ADD or COPY commands) and then set the DPP_PROCESSOR_PATH environment variable so that they become discoverable by dpp"

QUESTION: Do we make datapackage-pipelines-bcodmo_pipeline ?
https://github.com/frictionlessdata/datapackage-pipelines#plugins-and-source-descriptors

Write flow for importing seabird data

What needs to be done in the flow

read in the file(s)
capture the comments (often containing station info or notes about casts) into add_metadata.
maybe capture some of the calibration info?
If no header line, then parse the seabird way of including column info in the xml
skip the rest of the XML section
read the fixed width tabular dataset that is left. Columns fixed to 11chars each.

bcodmo / pipeline-generator Goto Github PK

pipeline-generator's People

Contributors

Stargazers

Watchers

Forkers

pipeline-generator's Issues

test unpivot/pivot from dataflows

Add tests for v0.0.2

Using custom python processing with pipelines

Concatenate from list

Test the find/replace in dataflows to see if it also has the "None" replacement issue

BCO-DMO specific metadata as an inline data resource

Proposal

Investigate usefulness of existing pipeline extensions

Write flow for importing seabird data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs