GithubHelp home page GithubHelp logo

gluish's Introduction

Gluish

Note: v0.2.X cleans up some cruft from v0.1.X. v0.2.X still passes the same tests as v0.1.X, but removes a lot of functionality unrelated to luigi. Please check, before you upgrade.

Luigi 2.0 compatibility: gluish 0.2.3 or higher.

Note that luigi dropped Python 2 support, and so does this package, starting with 0.3.0.


Build Status pypi version DOI Project Status: Active โ€“ The project has reached a stable, usable state and is being actively developed.

Some glue around luigi.

Provides a base class, that autogenerates its output filenames based on

  • some base path,
  • a tag,
  • the task id (the classname and the significant parameters)

Additionally, this package provides a few smaller utilities, like a TSV format, a benchmarking decorator and some task templates.

This project has been developed for Project finc at Leipzig University Library.

A basic task that knows its place

gluish.task.BaseTask is intended to be used as a supertask.

from gluish.task import BaseTask
import datetime
import luigi
import tempfile

class DefaultTask(BaseTask):
    """ Some default abstract task for your tasks. BASE and TAG determine
    the paths, where the artefacts will be stored. """
    BASE = tempfile.gettempdir()
    TAG = 'just-a-test'

class RealTask(DefaultTask):
    """ Note that this task has a `self.path()`, that figures out the full
    path for this class' output. """
    date = luigi.DateParameter(default=datetime.date(1970, 1, 1))
    def run(self):
        with self.output().open('w') as output:
            output.write('Hello World!')

    def output(self):
        return luigi.LocalTarget(path=self.path())

When instantiating a RealTask instance, it will automatically be assigned a structured output path, consisting of BASE, TAG, task name and a slugified version of the significant parameters.

task = RealTask()
task.output().path
# would be something like this on OS X:
# /var/folders/jy/g_b2kpwx0850/T/just-a-test/RealTask/date-1970-01-01.tsv

A TSV format

Was started on the mailing list. Continuing the example from above, lets create a task, that generates TSV files, named TabularSource.

from gluish.format import TSV

class TabularSource(DefaultTask):
    date = luigi.DateParameter(default=datetime.date(1970, 1, 1))
    def run(self):
        with self.output().open('w') as output:
            for i in range(10):
                output.write_tsv(i, 'Hello', 'World')

    def output(self):
        return luigi.LocalTarget(path=self.path(), format=TSV)

Another class, TabularConsumer can use iter_tsv on the handle obtained by opening the file. The row will be a tuple, or - if cols is specified - a collections.namedtuple.

class TabularConsumer(DefaultTask):
    date = luigi.DateParameter(default=datetime.date(1970, 1, 1))
    def requires(self):
        return TabularSource()

    def run(self):
        with self.input().open() as handle:
            for row in handle.iter_tsv(cols=('id', 'greeting', 'greetee'))
                print('{0} {1}!'.format(row.greeting, row.greetee))

    def complete(self):
        return False

Easy shell calls

Leverage command line tools with gluish.utils.shellout. shellout will take a string argument and will format it according to the keyword arguments. The {output} placeholder is special, since it will be automatically assigned a path to a temporary file, if it is not specified as a keyword argument.

The return value of shellout is the path to the {output} file.

Spaces in the given string are normalized, unless preserve_whitespace=True is passed. A literal curly brace can be inserted by {{ and }} respectively.

An exception is raised, whenever the commands exit with a non-zero return value.

Note: If you want to make sure an executable is available on you system before the task runs, you can use a gluish.common.Executable task as requirement.

from gluish.common import Executable
from gluish.utils import shellout
import luigi

class GIFScreencast(DefaultTask):
    """ Given a path to a screencast .mov, generate a GIF
        which is funnier by definition. """
    filename = luigi.Parameter(description='Path to a .mov screencast')
    delay = luigi.IntParameter(default=3)

    def requires(self):
        return [Executable(name='ffmpg'),
                Executable(name='gifsicle', message='http://www.lcdf.org/gifsicle/')]

    def run(self):
        output = shellout("""ffmpeg -i {infile} -s 600x400
                                    -pix_fmt rgb24 -r 10 -f gif - |
                             gifsicle --optimize=3 --delay={delay} > {output} """,
                             infile=self.filename, delay=self.delay)
        luigi.LocalTarget(output).move(self.output().path)

    def output(self):
        return luigi.LocalTarget(path=self.path())

Dynamic date parameter

Sometimes the effective date for a task needs to be determined dynamically.

Consider for example a workflow involving an FTP server.

A data source is fetched from FTP, but it is not known, when updates are supplied. So the FTP server needs to be checked in regular intervals. Dependent tasks do not need to be updated as long as there is nothing new on the FTP server.

To map an arbitrary date to the closest date in the past, where an update occured, you can use a gluish.parameter.ClosestDateParameter, which is just an ordinary DateParameter but will invoke task.closest() behind the scene, to figure out the effective date.

from gluish.parameter import ClosestDateParameter
import datetime
import luigi

class SimpleTask(DefaultTask):
    """ Reuse DefaultTask from above """
    date = ClosestDateParameter(default=datetime.date.today())

    def closest(self):
        # invoke dynamic checks here ...
        # for simplicity, map this task to the last monday
        return self.date - datetime.timedelta(days=self.date.weekday())

    def run(self):
        with self.output().open('w') as output:
            output.write("It's just another manic Monday!")

    def output(self):
        return luigi.LocalTarget(path=self.path())

A short, self contained example can be found in this gist.

gluish's People

Contributors

kutschkem avatar miku avatar zazi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

gluish's Issues

shellout broken

I am on Windows, and just went through a long session of debugging a shellout call, which failed because of "File not found" problems. The problem, I think, is that the command is given to subprocess.call as one big blob, which I think gets interpreted as the command file name. This problem appears on Windows, at the very least with cygwin.

list parameters are not slugified

Gluish is a great addition to luigi! I have noticed that list parameters are not properly sluggified. For example a task that takes a list like ['cd', 'statesd', 'dma', 'county'] resulted in the following filename: geo-["'cd'", "'statesd'", "'dma'", "'county'"]-state-nd.h5.

I think this might be rectified by adding a delist function:

def delist(x):
    if type(x) is list:
        return '-'.join(sorted(x))
    else:
        return x

parts = ('{k}-{v}'.format(k=k, v=delist(v))
                     for k, v in task_params.iteritems())

or something like this:

parts = []
 for k, v in task_params.iteritems():    
    if type(v) == list:
        v = '-'.join(sorted(v))
    parts.append('{k}-{v}'.format(k=k, v=v))

Migration from Goodtables to Frictionless Repository

Hi,

Goodtables.io is going to be deprecated in 2022, we, therefore, recommend migrating to the new Frictionless Repository (https://repository.frictionlessdata.io/) continuous data validation system provided by Frictionless Data. The core difference between the two projects is that Frictionless Repository doesn't rely on any hosted infrastructure except for Github Actions which makes this project more sustainable. Also, it uses a newer Frictionless Framework under the hood that brought many improvements over the old goodtables-py library in terms of validation quality and performance.

If you have any doubts or questions, please come and ask in our Discord chat or in the GitHub Discussion.

Encapsulate _init__ imports in try/catch (individually)

On Windows, it is an incredible pain to install sqlitebck. Because of that, I opted for installing gluish without dependencies, and just use the parts that I needed.

However, this cherry-picking approach does not work because you import all the sub-modules in the init_.py - this makes import fail no matter what parts I actually want to use, because this means all the dependencies have to be present.

I have two alternative solutions:

Solution 1: Don't import in init.py like you do.
Solution 2: Make the imports conditional, enclose them in try/catch like here: http://stackoverflow.com/a/3496790/1319284. This means convenient imports if the dependencies are there, no convenient imports otherwise.

shellout broken when using json-strings and .format() style because of internal double-format

e.g.
cmd="esbulk -server {server} -purge -mapping '{"mappings":{"{type}":{"properties":{"location":{"type":"geo_point"}}}}}' -index {index} -type {type} -w {workers} -id id -verbose {file}.ldj".format(**self.config)
output=shellout(cmd)

leads to:

Traceback (most recent call last):
File "/home/metadata/.local/lib/python3.5/site-packages/luigi/worker.py", line 194, in run
new_deps = self._run_get_new_deps()
File "/home/metadata/.local/lib/python3.5/site-packages/luigi/worker.py", line 131, in _run_get_new_deps
task_gen = self.task.run()
File "/home/metadata/git/efre-lod-elasticsearch-tools/luigi/update_gn.py", line 91, in run
cmd="esbulk -server {server} -purge -mapping '{"mappings":{"{type}":{"properties":{"location":{"type":"geo_point"}}}}}' -index {index} -type {type} -w {workers} -id id -verbose {file}.ldj".format(**self.config)
KeyError: '"mappings"'

escaping the braces doesn't work either

0.2 ideas / cleanup

gluish 0.1.X contains both luigi-related and non-luigi-related code. For 0.2 only luigi-related code should stay in the package:

Keep:

  • task
  • format (TSV)
  • intervals
  • database helpers (seems generic enough)

Cleanup:

  • parameters (keep ClosestDateParameter)
  • utils (keep shellout)

Remove:

  • benchmark (timing decorators are simple)
  • colors
  • common
  • configuration
  • esindex (it's in luigi.contrib)
  • oai
  • path

Python3 TSV unicode seems broken

I get an error when trying to use iter_tsv() from python3:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/luigi/worker.py", line 194, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.5/dist-packages/luigi/worker.py", line 131, in _run_get_new_deps
    task_gen = self.task.run()
  File "/home/bnewbold/DEBUG/luigi_utf8/small.py", line 21, in run
    cols=('col1', 'col2', 'col3')):
  File "/usr/local/lib/python3.5/dist-packages/gluish-0.2.8-py3.5.egg/gluish/format.py", line 86, in iter_tsv
    yield Record._make(str(line).rstrip('\n').split('\t'))
  File "<string>", line 21, in _make
TypeError: Expected 3 arguments, got 1

Running the gluish tests (particularly format_test.FormatTest) fails with the same error under python3 (nosetests3). Is Python3 (in my particular case, python3.5) indented to be supported? I can see there was a recent commit touching these lines, maybe i'm doing something wrong.

Minimal test case, and a patch that fixes things for python3 (but presumably breaks 2.7): https://gist.github.com/bnewbold/8919a20b1f01532b0da4f1c5594c6a05

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.