lisad / phaser Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 353 KB

library for batch-oriented complex data integration pipelines

License: MIT License

Python 100.00%

data data-integration etl etl-pipeline

phaser's Introduction

About me

Currently working at Data Transfer Initiative

phaser's People

Contributors

Stargazers

Watchers

Forkers

jh-bate

phaser's Issues

User should be able to gather all errors, warnings and dropped rows at end of pipeline

Not sure whether it's the Pipeline or the Context that solves this, but client code that is running a pipeline ought to be able to gather all the warnings and other events at the end and access all their location and message info

Make Phase runnable outside a pipeline, eg. and still save context extra outputs

I started working on saving extra_outputs from the context. Initially I tried to do it from 'save' in Phase, but that got muddled, because Phase doesn't know what the pipeline's working directory is. Also if we put more stuff in Phase.save, then it gets harder to override.

I moved the command for a context to Pipeline, which is more satisifying, because the Pipeline can tell the Context which directory to save outputs to, and tell it to do so between Phases. However, now when a phase is run on its own its extra outputs are not saved.

In generally the responsibilities for the following between Pipeline and Phase and Context are getting a bit muddled...

which class knows where the working directory is
which class knows what to name things in the working directory
which class does the work of saving
How do all the right steps get run when the Pipeline is running the phases
How do all the right steps get run when the Phase is running alone (or do we create a temp pipeline just to run one phase)

To turn phaser into a package, use twine?

James recommended twine, and also possibly https://hatch.pypa.io/latest/
(setup.py is not preferred these days)

Add a flag to Column so user can choose to strip line returns or tabs from within column values

    # LMDTODO: Add a flag which allows the user to say if line returns or tabs should be stripped from INSIDE
    # values of this column.  It's a frequent data error when data is edited or provided in Excel, to have
    # a multiword value like "United\nKingdom" or "United \nKingdom".  Yet we don't want to strip line returns
    # from long text values.  Also move this explanation to spec/docs

When loading data, collect all errors

Right now we use pandas.read_csv and pandas.read_csv implements "on_bad_lines=warn", so we could use that to report more errors before stopping.

If we implement our own CSV reader, should do the same.

Add MoneyColumn and handle currency symbols etc.

Would a money column have two values accessible - currency type and currency amount? Would it be able to take in the valeus like "$150,000" and output two columns like "USD" and "150000"?

A simple MoneyColumn (to start with) would:

Identify where the float/period is
Strip non-integer characters out of the value
wouldn't know which currency it is - just treat as a float

In the future we should probably have a CurrencyMoneyColumn that knows BOTH currency and value.

Phases can take multiple inputs as tables, and these are available in context.extra_inputs

Need a way to remove rows from a data set

Data sets can have records that are not valuable for import. We need a way to be able to select out the records that are useful.

My first attempt was to make a @row_step that only returned rows that match a predicate, but with the assertion in the _row_step_wrapper that the row returned by step_function be a dict, that was not working. Returning an empty dict results in a file that has all of the rows, but with lots of them filled with NULL. I'd prefer the rows not exist at all.

Here is the @row_step function I wrote to evaluate the predicate:

def select(fn):
    @row_step
    def _select(phase, row):
        if fn(row) == True:
            return row
        return None
    return _select

Add number column type(s)

If doing conversion with pandas read_csv:
'thousands' parameter defaultis None
'decimal' separator is '.'

clevercsv puts "NULL" in columns that do not have values

In the following csv file, the manager_id for the second row gets passed into the cast function of IntColumn as the string "NULL", which causes a conversion error.

employeeNumber,firstName,lastName,payType,paidPer,payRate,bonusAmount,Status,department,manager_id
1,Benjamin,Sisko,"salary","Year","188625","30000",Active,Marketing,4
2,Kira,Nerys,"salary","Year","118625","20000",Active,Finance
,None,Garak,"salary","Year", 100000,,Inactive,Finance,
4,Rasma,Son,"salary","Year",230000,24000,Active,Marketing,
5,Aldina,Sharrow,"salary","Year",140000,18000,Active,Finance,2
6,Viktor,Matic,"salary","Year",180000,25000,Active,Finance,2

This bug shows up on main when clevercsv is used for reading and Pandas is used for writing. And it only shows up when passing data between phases, because Pandas is asked to write out "NULL" for empty columns.

Experiment with lines starting with '#' in the middle of CSV

With the defaults chosen for read_csv parameters, does this mean that the following data would be screwed up?

employee_id,performance
#1,strong
#2,needs improvement

Add MultiValueColumn and both loading and saving logic

A MultiValueColumn needs a bit of configuration

Delimiters
Sub-delimiters?
An idea what type the values between the delimiters should be - cast them?

During load, parse the list into its multiple values:
e.g. df['languages'] = df['languages'].str.split(',')
But also we should strip spaces off of values after split which this doesn't do

During save, rejoin with commas and enclose in double-quotes (only double-quotes work)

Use case: lat,long is a list of values in one column - they should both be validated to be floats

Build a command-line app to run Phaser

Conventions to use or research:

Django scripts boot up the whole environment with settings and data models (Django command class handles this)
python style arguments
argparse module
using input to do interactive querying of the user along the way

Command-line app features (not to be implemented all at once with this feature):

generator for a new Phaser project
display step output incrementally
show diffs or warnings and errors
run a single phase at a time

Implement column class with controls like 'to_save'

What features should columns have independent of data type?
"to_save" -- defaults to True, but if False, this column won't be saved by the Phase and won't go on to the next phase. This control can be used to figure out what columns to put in to_csv 'columns' param

"input_name" vs "output_name" or call these something different - often column names stay consistent through a phase but when they don't it's easiest to say here how the output name should be different

Integrate column header fixups into phaser - update the documentation where I say it strips spaces

Probably there should be flags to override the automatic cleanup

Loading data should really strip spaces and handle quotes properly.

This test demonstrates the problem:


@pytest.mark.skip("This is annoying.  it's easy for CSVs to have quotes and spaces, but the combination is bad.")
def test_load_quoted_values(tmpdir):
    source = tmpdir / 'quoted_values.csv'
    with open(source, 'w') as f:
        # Importantly, Active is quoted AND has a space before it.
        f.write("""id,name,status\n1,"Jean-Luc Picard", "Active"\n""")
    phase= Phase()
    phase.load(source)
    # If values are not stripped of space and converted, status = ' "Active"'
    assert phase.row_data[0]['status'] == "Active"

If this isn't fixed, it's all too easy to have a bug in the pipeline. I encountered this in the employees.csv and employees pipeline work, where paidPer = ' "Year"' instead of 'Year', so the code failed to calculate salary correctly.

Fix phase instantiation in a pipeline

Currently, we call phase() to instantiate one in setup_phases, but the constructor of a Phase requires a positional argument for its name. Make that happen.

Make sure pandas 'nan' values work with column declarations, null checking, defaults, and fix_value

we'll need some tests for this

Pipeline handling PipelineErrorException needs to say which phase it occurred in

If the pipeline errors, the exception should say which phase the error occurred in.
(If Coming out of the phase, the exception should already say what step or part of the phase the error occurred in)

Make sure batch_step can't raise DropRowException

Right now process_exception just accepts droprowexception
also add a test that this can't be done

Include line of code where an error was raised - help users debug their own pipeline/phase/step code

So far we have errors like

ERROR row: 0, message: 'KeyError raised ('payRate')'

This is an error in the logic of the code that implements the steps, or perhaps the user should have made the pay rate column required. Anyway, it would sure help if exceptions raised like this had the file/line info.

Add 'black' or 'prettier' to project to automatically format code to pep style.

Choose an approach to converting between DataFrames and [{}] formats

Current approach, temporarily used for simplicity, is to convert the row_data object (a list of dicts) to a pandas DataFrame in the save method. This makes it harder for folks to override the save method.

To take it out of the 'save' method, here are some alternative approaches.

Data model options
How to save the data as attributes of the instance between steps
CANONICAL - always save as row_data. DataFrame format can be asked for which is what's done inside save(). DataFrame data must be converted back to row_data format at the end of or after the step.
BOTH versions always saved and correct. Slower.
LAZY: Either version could be saved - whichever one is not None is the correct one. This could be done inside a dict structure, for example, so setting self.data = {'row_data': [{'x': 1},{'x':2}] } both sets the row_data value and unsets the old obsolete dataframe value.

Syntax options
Some of which are usable together:
PARAM Pass the needed format option into 'get_data' and 'save_data' functions on Phase.
GETTER Have getters for both get_row_data and get_dataframe_data that know the data model choice
SETTER: Have setters for both set_row_data and set_dataframe_data
OBJECT: For extra information hiding, a DataManager object inside Phase, hides its methods from the Phase object so that it's harder for people to fuck around with the data model choices and shoot them selves. This may allow for future extensibility, for example there may be things we can do with a data manager object for low-memory or faster operation than normal; or the data manager object might be instrumentable for good debugging (imagine being able to diff between each step). Or is this YAGNI?

These are combinable with better or worse results.

The CANONICAL approach works well with parameters on both a get and save data function, or with both GETTER/SETTER functions that really are only doing work for the dataframe choice.
The LAZY approach works well with GETTER functions for both row and dataframe data.
The BOTH version works well with SETTER functions because both setter methods can always save both.

Decide whether adding errors and warnings transparently sends location info or the context uses it

When client code sends error and warning messages, how do we collect and give information about where that happened?
One option is to have the client code declare location fully transparently:

def add_error(message, phase=None, step=None, row_num=None, row_data=None)

In a step:

context.add_error("Manager ID not found in data", phase="Transformer", step='check-manager-in-batch, row_num=118, row_data=row)

The more the client code provides, the better the information in the logs. However, the client code often does not know what the current phase is ! a step like 'check-manager-in-batch' could be used in more than one phase, in more than one pipeline. It should be able to report an error from where it is. It also may not know row number.

In the other direction, reporting errors etc could be done fully magically:

def add_error(message)

In a step:

context.add_error("Manager ID not found in data")

For this to work, the context needs to keep track of current phase, current step and current row number. If this goes wrong, it's hard to see how the magic failed. Probably the client code overrode the Phase code that runs steps, or the Pipeline code that runs phases.

Of course we can also do a combination - like have the Phase be known by the context, but the step isn't?

Currently I'm leaning towards magic, but with putting all the responsibility for updating location information in the context in the hands of PhaseBase as much as possible[1]. It might require some coding to test out if this really works well. We may be able to put guard-rails in so that when the Phase is finishing up and cleaning up, it detects if in an earlier step it didn't set its own context stuff correctly. Or we may be able to guide and limit client coders to exactly which parts of Phase can safely be overridden without losing the specific error-location information as maintained in Context.

Footnote [1]: it is tempting to have the Pipeline tell the context what the next Phase is, but the Pipeline is looking more and more like what folks might over-ride in order to have phaser operate in a completely different context. The Pipeline should do I/O and error reporting which is also I/O dependent, and if a Pipeline is designed for use in a DAG where each Phase is a DAG node, then the Pipeline will also have custom phase running that tells the DAG environment what to run next.

Pipeline should be able to pipe outputs to inputs within its ordered Phase list

Write our own CSV importer/exporter for phaser

By building on top of the core Python CSV stuff, we can write an opinionated CSV reader/writer that guides folks towards a safer, data-cleaning oriented way of working with CSV input/output

By NOT using the pandas read_csv file, we aren't locked into reading an entire file at once, and we can do somewhat more aggressive cleaning of cases like

   row1,1, "value with space in front", more stuff

(In the above example, pandas includes the space and quotes around "value with space in front")

Pass probe parameter to step decorators to replace PROBE hack

Ideally:

probing a step to see how to call it wouldn't require the user defining a custom step to worry about this, they wouldn't have to handle any extra parameters
steps can still be tested in a very easy way by passing a row or batch data and asserting results with dead-simple Python

Alternatives known so far

Existing solution: magic value of row or phase
Annotate the step object itself with extra attributes? that's some fancy metaprogramming
Pass an extra parameter defined as "probe=None" into the step decorator that gets handled by the decorator and is never passed to step

So far 3 looks pretty good

Should row numbers be consistent from beginning to end of phase or pipeline?

if we don't do the work to keep row numbers consistent, then we'll end up with dropped rows and errors and warnings like this:

Validator messages
Row 14 dropped: not an active employee
...
Transformer messages
Row 14 error: employee's salary can't be parsed

It can even happen within a phase, because batch_step renumbers rows I think...

Yes, some actions (Reshape phase) are going to wipe the row number context completely, but that doesn't mean we shouldn't try to provide consistent row numbers for cases where no reshape phase was run.

There are a couple challenges to fixing this even with that limitation.

Saving and loading data (or passing data between phases). Either we could save the row numbers in the save files, which seems fine - it might be useful for debugging to be able to open the intermediary results and see row 14 missing, it reinforces the information that row 14 was dropped! Or, we could save "dropped rows" as a separate table with row numbers, and when the next phase is loaded, it uses that information when renumbering. E/g/ the first row gets # 1, then on to # 13, then # 14 is skipped because its in the dropped-row list, and so on.
Running batch step or dataframe step when they might drop rows. Well, we could optionally have the implementor of a batch step to specify whether the data has a natural index at this point. So if employee_id is stated as a hypothetically valid index, here's how the Phase would run a batch step that says "use employee_id to identify dropped rows":

First we already interrogate the batch_step to find out that it is a batch_step
The batch step declaration can say whether there's a natural index?
In memory, save the row numbers and index values that are associated together, e.g. Row 1 = EmployeeID 1075, row 2 = EmployeeID 1091 etc.
Run the batch_step
Re-add the row number based on the index column or columns

The above approach is compatible with an approach where, if the batch step doesn't tell us otherwise, we do the direct thing and add the row number to the data, so that when the data is returned we still have it in the fields of each row.

Add date column type

Add a DateColumn object to the library, and use it to read in incoming columns and parse as dates.

Use read_csv parse_dates and pass a list of expected column names.
Consider whether to support the "dates in multiple columns" feature in pandas - or let people use it if they overload Phase.load to use that
use 'date_formatstr' in to_csv and pick a SENSIBLE goddamn format

Note; we'll need to think about how read_csv parse_dates interacts with the plan to canonicalize header names. E.g. if we are going to accept both "Birthdate" and "birthdate" as the same column by lowercasing, then we can't just use read_csv once to both see what the column names are and also to do conversions. some options:

Use read_csv twice - once to detect column names, once to use detected column names to convert dates. Disadvantage: this makes it harder for folks to overload Phase.load(), right?.
Use read_csv and do no conversions, do conversions later with to_datetime. If we do type casting as a separate step, does this mean that people can have more control over type casting?

Side convo. How are people going to have extra control over type casting? I don't think it makes any sense to have a Phase method like "cast_to_types" that is overwritten, because then the user has to handle each one in that method. It would make more sense to have the column types allow this. Like "DateColumn" handles the casting to date, with a default if not overridden, but the user could override the "cast" method on "DateColumn" to specify an unusual approach.

Index rows starting at 1

I feel like there may be no "right" thing here, but I'd like to index rows starting with 1. When we report on rows in dropped rows, errors, warnings - indexing from 0 means that line numbers are 2 off of what you'd see in a file (because the header row is line 1 in the file, and the first data row is line 2, so that means that "row number 1" is the 3rd line.)

I suppose there's even an argument for reporting errors and warnings on file line number, hehheh. But I prefer row 1 is the first data row, and the last data row number is the same as the total number of rows
WDYT?

Log correct row number with header consistency check

In Phase#check_headers_consistent since the context.current_row is not updated when the warnings are logged, all of the warnings are associated with the last row number, even though the full row (along with its row number) are passed into the add_warning function call.

check_headers_consistent should log which row the error occurred on.

Pipeline should report on what files it all created

When a pipeline is used manually users can just look at what files it created

When the pipeline is used in automation, it might help for the automation to be able to run the pipeline then ask it for all its files so it can do something with them

In testing, it's nice to ask the pipeline for its outputs rather than hard-code what names we expect

Add a "blank=True" parameter to Column init so that if user sets "blank=False", empty string values error

THis is important to have along with "null=False" - if we don't also have "blank=False" then empty values can sneak in to the data

employee_id,name
1,Fred
"",Velma

Help users debug what's going on by allowing them to add a debug/print statement

Jeff wrote this for the example repo:

  @row_step
  def echo_step(row, **kwargs):
    print(row)
    return(row)

This should be a feature in some way.

one way is this step can exist in the library for users to put in their list of steps exactly where they want it, in which case we document how to use it
even better probably, add a decorator that goes along with row_step decorator, that figures out the diff of the row before and after the step.
or another syntax option is a parameter of the row_step decorator

    @row_step(debug_diff=True)
    def my_real_step(row, **kwargs)
       #etc
       return row

Handle RFC 4180 fieldname escaping or be more restrictive

https://www.rfc-editor.org/rfc/rfc4180

It's pretty permissive

Show how to use an extra phase to validate output of a pipeline

Writing a phase in employees.py, I'm encountering the error I put in that checks that fields weren't added. We could just do away with the error; or we could define in the 'columns' list the things we plan to create; or we could define a list of out columns separately from in.

ISTM that even a moderately simple phase might have some "incoming column requirements" which we already did but also "outgoing column requirements".

E.g.

class Transformer(Phase):
    in_columns = [
        Column("ID", required=True, on_error='drop_row'), 
        FloatColumn("Salary"), 
        FloatColumn("Bonus amount")]
    steps = [
           calculate_bonus_percent
    ]
    out_columns = [Column("ID", empty=False), FloatColumn("Bonus", required=True, null=False, zero=False)]

There can be requirements on the outgoing columns that confirm that what was intended in the steps really does happen,
in the example it would be that the bonus column is always given a value and never null or zero...

@jeffkole

Capture traceback information for errors as they occur

If an exception is thrown during a phase only the message is captured. This can make it hard to know exactly where the error occurred, as the location information is lost.

It would be nice to have more information being captured with the errors, such as their stack trace.

Look in PhaseBase.process_exception for where the exceptions are captured.

Replace pandas read_csv with our own CSV reader...

Along with not needing most of pandas read_csv features, we also need to error if columns are duplicated on import (at least by default). Generally we want a little more control and we can build that up from the native python read CSV code, rather than try to hack it out of the pandas function.

Also it will make it easier for the library to ship without a pandas dependency.

Add helper features for manipulating outputs and sources

In working on #60 and #62 it has become clear that developers will be changing csv records into something useful for addressing in memory and the other way around.

We should make it much easier for them to do so rather than having to write code like this over and over again:

def counts_to_output(counts):
    return [
        { 'parent_id': key, 'sibling_count': value }
        for key, value in counts.items()
    ]

def source_to_counts(rows):
    return {
        row['parent_id']: row['sibling_count'] for row in rows
    }

Our design doc had the notion of adding to the row_step annotation the ability to specify lookups from sources and index them.

DataFramePhase must be subclassed ... seems annoying

I went back to create a DataFramePhase and got surprised by stuff we'd done just a few weeks ago... that we have to subclass DataFramePhase to override 'df_transform'

We should have a way to init a DataFramePhase with a method name passed to the instantiation that gets run in df_transform

def explode_language_list(df, context=None):
   df['languages'] = df['languages'].str.split(',')
   df = df.explode('languages')
   return df.rename(columns={'languages': 'language'})

my_phase = DataFramePhase(step=explode_language_list)

... I'm also bending somewhat on the idea that only one step is allowed, I'm pretty sure people are going to put multiple logical steps in one if we only allow one step. Given that somebody will have a list of steps that they want to do on DataFrames (especially if they are migrating from a pandas oriented pipeline, or trying to automate a jupyter notebook worth of work) -- allowing only one step will encourage them to list all the existing in that one step.

On discussion we agree we should go back in this direction. Not only because dataframe work might have multiple steps and be passed into the constructor, but also just to allow more declarative coding rather than subclassing...

Add tests to confirm that unicode characters roundtrip from file to pipeline to phase back to file

Particularly currency characters...

Add CLI support for additional sources

#62 adds additional sources. The CLI needs to be able to support the user specifying them when they run a pipeline.

The CLI should be able to tell the user what sources are expected and, in interactive mode, ask for the necessary sources.

Errors and warnings generated by phases, and placed in the context, need to be separated by phase

Add Boolean column type to cast to strict or nullable values

Notes:

Have the boolean column be either strict or allow nulls (null=True? strict = False?)

If still using read_csv:

extend the list of true_valueslist to include "1", "Y", "Yes", "T"
extend the false_valueslist to include "0", "N", "No", "F"
extend the na_values to include "NULL" and set keep_default_na to True

Finish the end-to-end test

Experiment with pandas read_csv quote controls and pick an approach

We probably want 'quoting' parameter to be either QUOTE_ALL or QUOTE_NONNUMERIC.

the advantage of QUOTE_ALL is consistency and simplicity
the advantage of QUOTE_NONNUMERIC is

I understand how this quoting control works with to_csv but how does it apply to read_csv? What is different between those two choices when loading data in? Does read_csv assume that a column containing quoted numeric values is actually supposed to be a string if "QUOTE_NONNUMERIC " is chosen, whereas it will convert the same column to numbers if "QUOTE_ALL" is chosen?

While considering these choices, "doublequote" and "escapechar" are also interesting read_csv options. "doublequote" defaults to True, which seems to work OK. Is that consistent with how to_csv defaults to handling strings containing quotation marks?

Prevent column renames from collisions with a check in Phase over all its columns

    # LMDTODO: Add a check in PHase.__init__ to check if any of the column renames collide.
    # For example you can't rename 'div' to both "Division" and "Divisor"

Steps should always get COPY of data to protect foot/gun incidents

When the phase calls each step it can pass a copy of the row, or a copy of the batch data.

If the step modifies the copy and returns it, that is no problem.

Pipeline should do all I/O, and handle that work for phase

this makes Phase simpler not having to worry about I/O, it already has to worry about columns and steps.
It makes Pipeline very substitutable
it would be easier to define page-processing pipelines, for example - a pipeline could take data a page at a time, pass it through all the phases, and then decide what to do with the results
Improves testability - don't have to test file I/O as part of testing phases

Call steps with context, rather than with Phase

this would simplify how steps are declared and called

a step can be defined without worrying about context at all or with context

@row_step
def simple_step(row, **kwargs):
    ... 
    return row

@row_step
def check_context_step(row, context):
    if context[thing_to_check]:
        ...
    return row

ISSUE: Does the kwargs part really need to be there? can the decorator detect if the row defined by the user expects a context or not?

lisad / phaser Goto Github PK

phaser's Introduction

About me

phaser's People

Contributors

Stargazers

Watchers

Forkers

phaser's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs