- Currently working at Data Transfer Initiative
lisad / phaser Goto Github PK
View Code? Open in Web Editor NEWlibrary for batch-oriented complex data integration pipelines
License: MIT License
library for batch-oriented complex data integration pipelines
License: MIT License
Not sure whether it's the Pipeline or the Context that solves this, but client code that is running a pipeline ought to be able to gather all the warnings and other events at the end and access all their location and message info
I started working on saving extra_outputs from the context. Initially I tried to do it from 'save' in Phase, but that got muddled, because Phase doesn't know what the pipeline's working directory is. Also if we put more stuff in Phase.save, then it gets harder to override.
I moved the command for a context to Pipeline, which is more satisifying, because the Pipeline can tell the Context which directory to save outputs to, and tell it to do so between Phases. However, now when a phase is run on its own its extra outputs are not saved.
In generally the responsibilities for the following between Pipeline and Phase and Context are getting a bit muddled...
James recommended twine, and also possibly https://hatch.pypa.io/latest/
(setup.py is not preferred these days)
# LMDTODO: Add a flag which allows the user to say if line returns or tabs should be stripped from INSIDE
# values of this column. It's a frequent data error when data is edited or provided in Excel, to have
# a multiword value like "United\nKingdom" or "United \nKingdom". Yet we don't want to strip line returns
# from long text values. Also move this explanation to spec/docs
Right now we use pandas.read_csv and pandas.read_csv implements "on_bad_lines=warn", so we could use that to report more errors before stopping.
If we implement our own CSV reader, should do the same.
Would a money column have two values accessible - currency type and currency amount? Would it be able to take in the valeus like "$150,000" and output two columns like "USD" and "150000"?
A simple MoneyColumn (to start with) would:
In the future we should probably have a CurrencyMoneyColumn that knows BOTH currency and value.
Data sets can have records that are not valuable for import. We need a way to be able to select out the records that are useful.
My first attempt was to make a @row_step
that only returned rows that match a predicate, but with the assertion in the _row_step_wrapper
that the row returned by step_function
be a dict
, that was not working. Returning an empty dict
results in a file that has all of the rows, but with lots of them filled with NULL
. I'd prefer the rows not exist at all.
Here is the @row_step
function I wrote to evaluate the predicate:
def select(fn):
@row_step
def _select(phase, row):
if fn(row) == True:
return row
return None
return _select
If doing conversion with pandas read_csv:
'thousands' parameter defaultis None
'decimal' separator is '.'
In the following csv file, the manager_id for the second row gets passed into the cast
function of IntColumn
as the string "NULL", which causes a conversion error.
employeeNumber,firstName,lastName,payType,paidPer,payRate,bonusAmount,Status,department,manager_id
1,Benjamin,Sisko,"salary","Year","188625","30000",Active,Marketing,4
2,Kira,Nerys,"salary","Year","118625","20000",Active,Finance
,None,Garak,"salary","Year", 100000,,Inactive,Finance,
4,Rasma,Son,"salary","Year",230000,24000,Active,Marketing,
5,Aldina,Sharrow,"salary","Year",140000,18000,Active,Finance,2
6,Viktor,Matic,"salary","Year",180000,25000,Active,Finance,2
This bug shows up on main
when clevercsv is used for reading and Pandas is used for writing. And it only shows up when passing data between phases, because Pandas is asked to write out "NULL" for empty columns.
A MultiValueColumn needs a bit of configuration
During load, parse the list into its multiple values:
e.g. df['languages'] = df['languages'].str.split(',')
But also we should strip spaces off of values after split which this doesn't do
During save, rejoin with commas and enclose in double-quotes (only double-quotes work)
Use case: lat,long is a list of values in one column - they should both be validated to be floats
Conventions to use or research:
input
to do interactive querying of the user along the wayCommand-line app features (not to be implemented all at once with this feature):
What features should columns have independent of data type?
"to_save" -- defaults to True, but if False, this column won't be saved by the Phase and won't go on to the next phase. This control can be used to figure out what columns to put in to_csv 'columns' param
"input_name" vs "output_name" or call these something different - often column names stay consistent through a phase but when they don't it's easiest to say here how the output name should be different
Probably there should be flags to override the automatic cleanup
This test demonstrates the problem:
@pytest.mark.skip("This is annoying. it's easy for CSVs to have quotes and spaces, but the combination is bad.")
def test_load_quoted_values(tmpdir):
source = tmpdir / 'quoted_values.csv'
with open(source, 'w') as f:
# Importantly, Active is quoted AND has a space before it.
f.write("""id,name,status\n1,"Jean-Luc Picard", "Active"\n""")
phase= Phase()
phase.load(source)
# If values are not stripped of space and converted, status = ' "Active"'
assert phase.row_data[0]['status'] == "Active"
If this isn't fixed, it's all too easy to have a bug in the pipeline. I encountered this in the employees.csv and employees pipeline work, where paidPer = ' "Year"' instead of 'Year', so the code failed to calculate salary correctly.
Currently, we call phase()
to instantiate one in setup_phases
, but the constructor of a Phase
requires a positional argument for its name. Make that happen.
we'll need some tests for this
If the pipeline errors, the exception should say which phase the error occurred in.
(If Coming out of the phase, the exception should already say what step or part of the phase the error occurred in)
Right now process_exception just accepts droprowexception
also add a test that this can't be done
So far we have errors like
ERROR row: 0, message: 'KeyError raised ('payRate')'
This is an error in the logic of the code that implements the steps, or perhaps the user should have made the pay rate column required. Anyway, it would sure help if exceptions raised like this had the file/line info.
Current approach, temporarily used for simplicity, is to convert the row_data object (a list of dicts) to a pandas DataFrame in the save method. This makes it harder for folks to override the save method.
To take it out of the 'save' method, here are some alternative approaches.
Data model options
How to save the data as attributes of the instance between steps
CANONICAL - always save as row_data. DataFrame format can be asked for which is what's done inside save(). DataFrame data must be converted back to row_data format at the end of or after the step.
BOTH versions always saved and correct. Slower.
LAZY: Either version could be saved - whichever one is not None is the correct one. This could be done inside a dict structure, for example, so setting self.data = {'row_data': [{'x': 1},{'x':2}] } both sets the row_data value and unsets the old obsolete dataframe value.
Syntax options
Some of which are usable together:
PARAM Pass the needed format option into 'get_data' and 'save_data' functions on Phase.
GETTER Have getters for both get_row_data and get_dataframe_data that know the data model choice
SETTER: Have setters for both set_row_data and set_dataframe_data
OBJECT: For extra information hiding, a DataManager object inside Phase, hides its methods from the Phase object so that it's harder for people to fuck around with the data model choices and shoot them selves. This may allow for future extensibility, for example there may be things we can do with a data manager object for low-memory or faster operation than normal; or the data manager object might be instrumentable for good debugging (imagine being able to diff between each step). Or is this YAGNI?
These are combinable with better or worse results.
When client code sends error and warning messages, how do we collect and give information about where that happened?
One option is to have the client code declare location fully transparently:
def add_error(message, phase=None, step=None, row_num=None, row_data=None)
In a step:
context.add_error("Manager ID not found in data", phase="Transformer", step='check-manager-in-batch, row_num=118, row_data=row)
The more the client code provides, the better the information in the logs. However, the client code often does not know what the current phase is ! a step like 'check-manager-in-batch' could be used in more than one phase, in more than one pipeline. It should be able to report an error from where it is. It also may not know row number.
In the other direction, reporting errors etc could be done fully magically:
def add_error(message)
In a step:
context.add_error("Manager ID not found in data")
For this to work, the context needs to keep track of current phase, current step and current row number. If this goes wrong, it's hard to see how the magic failed. Probably the client code overrode the Phase code that runs steps, or the Pipeline code that runs phases.
Of course we can also do a combination - like have the Phase be known by the context, but the step isn't?
Currently I'm leaning towards magic, but with putting all the responsibility for updating location information in the context in the hands of PhaseBase as much as possible[1]. It might require some coding to test out if this really works well. We may be able to put guard-rails in so that when the Phase is finishing up and cleaning up, it detects if in an earlier step it didn't set its own context stuff correctly. Or we may be able to guide and limit client coders to exactly which parts of Phase can safely be overridden without losing the specific error-location information as maintained in Context.
Footnote [1]: it is tempting to have the Pipeline tell the context what the next Phase is, but the Pipeline is looking more and more like what folks might over-ride in order to have phaser operate in a completely different context. The Pipeline should do I/O and error reporting which is also I/O dependent, and if a Pipeline is designed for use in a DAG where each Phase is a DAG node, then the Pipeline will also have custom phase running that tells the DAG environment what to run next.
By building on top of the core Python CSV stuff, we can write an opinionated CSV reader/writer that guides folks towards a safer, data-cleaning oriented way of working with CSV input/output
By NOT using the pandas read_csv file, we aren't locked into reading an entire file at once, and we can do somewhat more aggressive cleaning of cases like
row1,1, "value with space in front", more stuff
(In the above example, pandas includes the space and quotes around "value with space in front")
Ideally:
Alternatives known so far
So far 3 looks pretty good
if we don't do the work to keep row numbers consistent, then we'll end up with dropped rows and errors and warnings like this:
Validator messages
Row 14 dropped: not an active employee
...
Transformer messages
Row 14 error: employee's salary can't be parsed
It can even happen within a phase, because batch_step renumbers rows I think...
Yes, some actions (Reshape phase) are going to wipe the row number context completely, but that doesn't mean we shouldn't try to provide consistent row numbers for cases where no reshape phase was run.
There are a couple challenges to fixing this even with that limitation.
Saving and loading data (or passing data between phases). Either we could save the row numbers in the save files, which seems fine - it might be useful for debugging to be able to open the intermediary results and see row 14 missing, it reinforces the information that row 14 was dropped! Or, we could save "dropped rows" as a separate table with row numbers, and when the next phase is loaded, it uses that information when renumbering. E/g/ the first row gets # 1, then on to # 13, then # 14 is skipped because its in the dropped-row list, and so on.
Running batch step or dataframe step when they might drop rows. Well, we could optionally have the implementor of a batch step to specify whether the data has a natural index at this point. So if employee_id is stated as a hypothetically valid index, here's how the Phase would run a batch step that says "use employee_id to identify dropped rows":
The above approach is compatible with an approach where, if the batch step doesn't tell us otherwise, we do the direct thing and add the row number to the data, so that when the data is returned we still have it in the fields of each row.
Add a DateColumn object to the library, and use it to read in incoming columns and parse as dates.
Note; we'll need to think about how read_csv parse_dates interacts with the plan to canonicalize header names. E.g. if we are going to accept both "Birthdate" and "birthdate" as the same column by lowercasing, then we can't just use read_csv once to both see what the column names are and also to do conversions. some options:
Use read_csv twice - once to detect column names, once to use detected column names to convert dates. Disadvantage: this makes it harder for folks to overload Phase.load(), right?.
Use read_csv and do no conversions, do conversions later with to_datetime. If we do type casting as a separate step, does this mean that people can have more control over type casting?
Side convo. How are people going to have extra control over type casting? I don't think it makes any sense to have a Phase method like "cast_to_types" that is overwritten, because then the user has to handle each one in that method. It would make more sense to have the column types allow this. Like "DateColumn" handles the casting to date, with a default if not overridden, but the user could override the "cast" method on "DateColumn" to specify an unusual approach.
I feel like there may be no "right" thing here, but I'd like to index rows starting with 1. When we report on rows in dropped rows, errors, warnings - indexing from 0 means that line numbers are 2 off of what you'd see in a file (because the header row is line 1 in the file, and the first data row is line 2, so that means that "row number 1" is the 3rd line.)
I suppose there's even an argument for reporting errors and warnings on file line number, hehheh. But I prefer row 1 is the first data row, and the last data row number is the same as the total number of rows
WDYT?
In Phase#check_headers_consistent
since the context.current_row
is not updated when the warnings are logged, all of the warnings are associated with the last row number, even though the full row (along with its row number) are passed into the add_warning
function call.
check_headers_consistent
should log which row the error occurred on.
When a pipeline is used manually users can just look at what files it created
When the pipeline is used in automation, it might help for the automation to be able to run the pipeline then ask it for all its files so it can do something with them
In testing, it's nice to ask the pipeline for its outputs rather than hard-code what names we expect
THis is important to have along with "null=False" - if we don't also have "blank=False" then empty values can sneak in to the data
employee_id,name
1,Fred
"",Velma
Jeff wrote this for the example repo:
@row_step
def echo_step(row, **kwargs):
print(row)
return(row)
This should be a feature in some way.
@row_step(debug_diff=True)
def my_real_step(row, **kwargs)
#etc
return row
https://www.rfc-editor.org/rfc/rfc4180
It's pretty permissive
Writing a phase in employees.py, I'm encountering the error I put in that checks that fields weren't added. We could just do away with the error; or we could define in the 'columns' list the things we plan to create; or we could define a list of out columns separately from in.
ISTM that even a moderately simple phase might have some "incoming column requirements" which we already did but also "outgoing column requirements".
E.g.
class Transformer(Phase):
in_columns = [
Column("ID", required=True, on_error='drop_row'),
FloatColumn("Salary"),
FloatColumn("Bonus amount")]
steps = [
calculate_bonus_percent
]
out_columns = [Column("ID", empty=False), FloatColumn("Bonus", required=True, null=False, zero=False)]
There can be requirements on the outgoing columns that confirm that what was intended in the steps really does happen,
in the example it would be that the bonus column is always given a value and never null or zero...
If an exception is thrown during a phase only the message is captured. This can make it hard to know exactly where the error occurred, as the location information is lost.
It would be nice to have more information being captured with the errors, such as their stack trace.
Look in PhaseBase.process_exception
for where the exceptions are captured.
Along with not needing most of pandas read_csv features, we also need to error if columns are duplicated on import (at least by default). Generally we want a little more control and we can build that up from the native python read CSV code, rather than try to hack it out of the pandas function.
Also it will make it easier for the library to ship without a pandas dependency.
In working on #60 and #62 it has become clear that developers will be changing csv records into something useful for addressing in memory and the other way around.
We should make it much easier for them to do so rather than having to write code like this over and over again:
def counts_to_output(counts):
return [
{ 'parent_id': key, 'sibling_count': value }
for key, value in counts.items()
]
def source_to_counts(rows):
return {
row['parent_id']: row['sibling_count'] for row in rows
}
Our design doc had the notion of adding to the row_step
annotation the ability to specify lookups from sources and index them.
I went back to create a DataFramePhase and got surprised by stuff we'd done just a few weeks ago... that we have to subclass DataFramePhase to override 'df_transform'
We should have a way to init a DataFramePhase with a method name passed to the instantiation that gets run in df_transform
def explode_language_list(df, context=None):
df['languages'] = df['languages'].str.split(',')
df = df.explode('languages')
return df.rename(columns={'languages': 'language'})
my_phase = DataFramePhase(step=explode_language_list)
... I'm also bending somewhat on the idea that only one step is allowed, I'm pretty sure people are going to put multiple logical steps in one if we only allow one step. Given that somebody will have a list of steps that they want to do on DataFrames (especially if they are migrating from a pandas oriented pipeline, or trying to automate a jupyter notebook worth of work) -- allowing only one step will encourage them to list all the existing in that one step.
On discussion we agree we should go back in this direction. Not only because dataframe work might have multiple steps and be passed into the constructor, but also just to allow more declarative coding rather than subclassing...
Particularly currency characters...
#62 adds additional sources. The CLI needs to be able to support the user specifying them when they run a pipeline.
The CLI should be able to tell the user what sources are expected and, in interactive mode, ask for the necessary sources.
Notes:
If still using read_csv:
We probably want 'quoting' parameter to be either QUOTE_ALL or QUOTE_NONNUMERIC.
I understand how this quoting control works with to_csv but how does it apply to read_csv? What is different between those two choices when loading data in? Does read_csv assume that a column containing quoted numeric values is actually supposed to be a string if "QUOTE_NONNUMERIC " is chosen, whereas it will convert the same column to numbers if "QUOTE_ALL" is chosen?
While considering these choices, "doublequote" and "escapechar" are also interesting read_csv options. "doublequote" defaults to True, which seems to work OK. Is that consistent with how to_csv defaults to handling strings containing quotation marks?
# LMDTODO: Add a check in PHase.__init__ to check if any of the column renames collide.
# For example you can't rename 'div' to both "Division" and "Divisor"
When the phase calls each step it can pass a copy of the row, or a copy of the batch data.
If the step modifies the copy and returns it, that is no problem.
this would simplify how steps are declared and called
a step can be defined without worrying about context at all or with context
@row_step
def simple_step(row, **kwargs):
...
return row
@row_step
def check_context_step(row, context):
if context[thing_to_check]:
...
return row
ISSUE: Does the kwargs part really need to be there? can the decorator detect if the row defined by the user expects a context or not?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.