GithubHelp home page GithubHelp logo

intake-pattern-catalog's Introduction

Intake Pattern Catalog

Available on pypi

intake-pattern-catalog is a plugin for Intake which allows you to specify a file-path pattern which can represent a number of different entries.

Note that this is different from the patterns you can write with the csv driver which get turned into a single entry

Installation instructions

pip install intake-pattern-catalog
# or
conda install intake-pattern-catalog

Usage

Use driver: pattern_cat to use this driver in your catalogs.

Consider the following list of files in an S3 bucket:

  • bucket-name/folder/a_1.csv
  • bucket-name/folder/b_1.csv
  • bucket-name/folder/c_1.csv
  • bucket-name/folder/a_2.csv
  • bucket-name/folder/b_2.csv

And the following catalog definition yaml file:

---
metadata:
  version: 1
sources:
  stuff:
    description: Stuff and things
    driver: pattern_cat
    args:
      urlpath: "s3://bucket-name/folder/{foo}_{bar}.csv"
      driver: csv

Derived datasets

If you would like to create a derived dataset based on a pattern_cat dataset, you can use driver: pattern_cat_transform, which will apply a transformation function to each entry returned by get_entry. For example, you can add to the above example yaml file:

  stuff_transformed:
    description: Everything in stuff, doubled
    driver: pattern_cat_transform
    args:
      targets:
        - stuff
      transform: "path.to.doubling_function"

Catalog API

Access entry by kwargs:

> catalog.stuff.get_entry(foo='a', bar=1)
sources:
  foo_a_bar_1:
    args:
      storage_options:
        use_listings_cache: false
      urlpath: s3://bucket-name/folder/a_1.csv
    description: ''
    driver: intake.source.csv.CSVSource
    metadata:
      catalog_dir: ...

Note that this could also be accessed with catalog.stuff.foo_a_bar_1

See all valid kwarg combinations:

> catalog.stuff.get_entry_kwarg_sets()
[
    {"foo": "a", "bar": "1"},
    {"foo": "b", "bar": "1"},
    {"foo": "c", "bar": "1"},
    {"foo": "a", "bar": "2"},
    {"foo": "b", "bar": "2"},
]

Caching

The default way of controlling any caching with a pattern-catalog is using a ttl (in seconds), which is an optional value under args which specifies how long should wait after fetching a list of files which match the pattern before it loads them again. The default ttl is 60 seconds. If you want to force it to always get the latest list of available entries, set the ttl to 0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.