ericpan64 / pydian Goto Github PK

This project forked from canvas-medical/pydian

Python framework for developer-friendly data interchange

License: MIT License

Python 100.00%

pydian's Introduction

Pydian - pythonic data interchange

Pydian is a pure Python library for readable and repeatable data mappings. Pydian reduces boilerplate for data manipulation and provides a framework for expressive data wrangling.

Using Pydian, developers can collaboratively and incrementally write data mappings that are expressive, safe, and reusable. Similar to how libraries like React were able to streamline UI components for frontend development, Pydian aims to streamline data transformations for backend development.

`get` specific data, then do stuff

The key idea behind is the following: get data from an object, and if it succeeded, do stuff to it.

from pydian import get

# Some arbitrary source dict
payload = {
    'some': {
        'deeply': {
            'nested': [{
                'value': 'here!'
            }]
        }
    },
    'list_of_objects': [
        {'val': 1},
        {'val': 2},
        {'val': 3}
    ]
}

# Conveniently get values and chain operations
assert get(payload, 'some.deeply.nested[0].value', apply=str.upper) == 'HERE!'

# Unwrap list structures with [*]
assert get(payload, 'list_of_objects[*].val') == [1,2,3]

# Safely specify your logic with built-in null checking (handle `None` instead of a stack trace!)
assert get(payload, 'some.deeply.nested[100].value', apply=str.upper) == None

That's it! Additional constructs are added for more complex mapping operations (Mapper).

What makes this different from regular operations? Pydian is designed with readibility and reusability in mind:

By default, on failure get returns None. This offers a more flexible alternative to direct indexing (e.g. array[0]).
For a specific field, you can concisely fit all of your functional logic into one line of Python. This improves readability and maintainability.
All functions are "pure" and can be effectively reused and imported without side effects. This encapsulates behavior and promotes reusability.

Developer-friendly API

If you are working with dicts, you can use:

A get function with JMESPath key syntax. Chain operations on success, else continue with None
A Mapper class that performs post-processing cleanup on "empty" values. For nuanced edge cases, condtionally DROP fields or KEEP specific values

(Experimental) If you're tired of writing one-off lambda functions, consider using:

The pydian.partials module which provides (possibly) common 1-input, 1-output functions (import pydian.partials as p). A generic p.do wrapper creates a partial function which defaults parameters starting from the second parameter (from functools import partial starts from the first parameter.)

(Experimental) If you are working with pl.DataFrames, you can use:

A select function simple SQL-like syntax (,-delimited, ~ for conditionals, * to get all)
Some functions for creating new dataframes (left_join, inner_join, insert for rows, alter for cols)

Note: the DataFrame module is not included by default. To install, use: pip install "pydian[dataframes]"

Examples

dicts: See get tests and Mapper tests

(Experimental) pl.DataFrames: See select tests

(Experimental) pydian.partials: See pydian.partial tests or snippet below:

from pydian import get
import pydian.partials as p

# Arbitrary example
source = {
    'some_values': [
        250,
        350,
        450
    ]
}

# Standardize how the partial functions are written for simpler management
assert p.equals(1)(1) == True
assert p.equivalent(False)(False) == True
assert get(source, 'some_values', apply=p.index(0), only_if=p.contains(350)) == 250
assert get(source, 'some_values', apply=p.index(0), only_if=p.contains(9000)) == None
assert get(source, 'some_values', apply=p.index(1)) == 350
assert get(source, 'some_values', apply=p.keep(2)) == [250, 350]

Future Work

After 1.0, Pydian will be considered done (barring other community contributions 😃)

There may be further language support in the future (e.g. JS, Rust, Go, Julia, etc.) which could make this pattern even more useful (though still very much tbd!)

Contact

Please submit a GitHub Issue for any bugs + feature requests 🙏

pydian's People

Contributors

pydian's Issues

Look into using Pydantic under the hood

Pydantic already solves many of the existing problems with #20 -- the main goal of the Pydian validation would be to make it easier to express some of the different rules and error cases (e.g. make it easier to slightly modify an existing model / set of rules)

E.g.

Could we get the best of both -- e.g. use Pydantic under the hood and make data models interoperable? Worth investigating further!

Integrate pola.rs to dataframe functionality

Requested feature

Make interface compatible with pola.rs, especially memory management and "lazy" loading

This is also a good time to:

review ways to optimize the existing dataframes, e.g. leverage memory management for dtypes and other things as part of the framework (see what libraries exist)
look into feature flags (e.g. if library dependencies get too large, do something like pydian[mm] for mem management

Alternatives considered

Additional context

Depends on #6

Update `Mapper` to handle non-dict, valid JSON cases

A list is valid JSON -- if the output of a mapping function is a list / valid JSON, should allow that appropriately (DROP logic then gets applied over the list of objects)

For each file in ... operator

Problem

Requested feature

For each file in folder matching regex, do ...

Alternatives considered

Additional context

Overload `with` for temporary data grabs

Problem

Often when fetching data, using it temporarily. Why not using with context manager?

Requested feature

with load_some_data(...) as source:
   res = get(source, ...)

# Here, `source` gets dropped automatically. Saves memory

Alternatives considered

Additional context

Rework `custom_dsl_fn`

Describe the bug

The custom_dsl_fn feature should be reworked since other parts of the repo make assumptions on the keypaths (e.g. Mapper cleanup, enforcing strict, etc.)

Can make this more abstract -- e.g. defining a keypath function that maps to a "regular" index, though at that point it might be a bit moot
Can also skip the None check for custom DSLs

Additional context
Opened from #15

Add database connection support

Requested feature

For common database operations, support connecting + converting a dataframe -> database table

E.g. something like: create_table(df, 'namespace', conn), update_table(...), etc.. Writing SQL code from scratch each time is currently necessary but cumbersome

Alternatives considered

... Should look into this more!

Additional context

Depends on #6

Add dataframe support

Use pola.rs as base foundation! Use available functions when possible

Initial thoughts on feature set:

Dictionary-based data mapping, from DataFrame -> DataFrame or DataFrame -> JSON
- Map by row
- SELECT, GROUP BY, ORDER BY equivalents
JOIN dataframes with logical links ("lazy" eval)
Have managed state by Mapper (over lifecycle of Mapper)

Add a "strict" / debug mode

Problem: Getting to "correct" code is difficult to debug, since if the string is typed incorrectly, the code just returns None instead of erroring-out. This is the desired behavior once the setup is "correct", though it's difficult to incrementally get there

Suggested solution: Add a strict bool, with strict: bool meaning it'll throw a stack trace on error. Then people can set an env variable or something. Consider also strict: bool | Callable allowing for custom behavior (though make it clearly documented, typing isn't very intuitive)

Implement "nested gets" with dataframe `select`

Requested feature

If I have a dataframe with nested json, I want to be able to:

Unnest it
Expand it to multiple columns

E.g.

select(df, "a.firstKey[0].secondKey")
select(df, "a +> {b: a.firstKey[0].secondKey, c: a.firstKey}")
- -> create new cols and "consume" input col
- +>: create new cols, keep source col

Alternatives considered

Additional context

Depends on #6

Look into Ibis for dataframe consolidation

Problem

Ibis is an existing and popular framework for consolidating many different data "backends". Consider building off-of this (or just suggesting it if the problem is sufficiently solved).

Similar in spirit to validation and pydantic: add a niche when applicable, and provide interop for user experience

Requested feature

Alternatives considered

Additional context

https://github.com/ibis-project/ibis

Add framework for loading data into a database as an ETL script

Is your feature request related to a problem? Please describe.

Loading data to something like Postgres or SQLite requires writing a lot of boilerplate code, and then doing a lot of data wrangling to get that in a consistent form

Describe the solution you'd like

Standard (database-agnostic) API for loading SQL data

Describe alternatives you've considered

Additional context

Support different types of DSL for `get`

E.g. allow FHIRPath instead of JMESPath by using: https://github.com/beda-software/fhirpath-py

Benefit: mapping styles can favor certain approaches, especially if highly opinionated

Default to JMESPath, since most people will probably not feel too strongly (though providing the option is good)

Add support for serialzing objects

Problem

Loading to<->from Protobuf using a dictionary is too hard. It'd be nice to have a way to

Requested feature

Something like

import your.protoclass.here as exp
d = {'a': 'b'}
d_proto = as_proto(cls=exp.SomeClass) # Raise error if this fails
save_to_disk(d_proto)

Alternatives considered

There are existing JSON wrapper functions that Google provides -- look into this more and see if it solves the problem (and/or if wrapper functions like above would help)

Additional context

Have DataFrame module return error instead of None

When working with DataFrames, usually using something like a Jupyter notebook, and columns are generally not optional (though values can be optional).

Thus, returning None could cause an issue. For dict handling, it makes sense since fields can be optional (1-1 with values), and a strict mode is available. However here, better to return an error -- can also add wrappers later for preferring None (e.g. in Piper)

Reorganize `partials` module

As stuff gets a lot more domain-specific, refactor to a folder. E.g.

partials
|-- pandas.py
|-- __init__.py
|-- core.py
--- ... etc

Then consider:

import pydian.partials as p

p.pd.iloc(...)
p.do(...)
... etc.

Key benefit: organized developer experience

Add custom validators

Requested feature

Pydantic-compatiable validators to match data schemas with specific structure

Alternatives considered

Just using pydantic -- not sure there's a way to enforce counts (e.g. require len range 1..10) as part of it

Additional context

Basically: can we get the FHIR spec but in code?

Handle `strict` case where the value is `None`

E.g. the dict actually contains None, as opposed to some key error

It looks like JMESPath won't fix this (too hard for backwards compatibility probably -- discussion here) this despite an existing PR (linked in discussion). Consider using: https://github.com/h2non/jsonpath-ng

Implement a data pipeline feature

Requested feature

Consider a data pipeline abstraction (let's call it Piper) that allows specifying a set of operations, and then organizes the results accordingly. E.g.:

from pydian import Piper

import pydian.partials as p

pipe = Piper(steps=[str.upper, p.keep(3), ...])

assert pipe.run("some_str") == Ok("SOM")
assert pipe.run("s") == Err({"step": 0, "context": ...})

... etc.

It's worth looking into "Result"-like thinking (like OkErr). The goal would be programatic reproduction of experiments -- e.g. if get a specific kind of error, maybe we can rerun it or retry with a different, known approach (in code!)

Alternatives considered

... Need to look into this more!

Additional context

Some previous ideas:

Data Pipeline wrappers

try_in_order: a nested try/except without throwing an explicit error
Piper: a wrapper class that simplifies tracking:
- "Failed" operations (None). Let stack traces play their course!
- "Successful" operations (not None)

Implement concept of "Group" of tables / objects

Problem

The current approach to tables (0.3.0) might not fit mental model for most people working with relational dbs

Requested feature

Idea: use relational algebra operators and make this more like SQL across a relational table. E.g.:

df1 = ...
df2 = ...
p = Grouper({'first': df1, 'second': df2})
p.select('a,b FROM first, second') # this can be regular SQL syntax, or a more formal syntax

Could maybe also do this with JSON objects, make it easy to quickly make something query-able in a relational way

Alternatives considered

Additional context

Add `strict` at the `get` level

Requested feature

Be able to specify strict for a specific get. This provides flexibility when developing

Alternatives considered

Use strict mapper, and then isolate specific behavior e.g. via debugger
Setup asserts manually (though can't assert at inner-level)
Add a specific enum (e.g. HALT, STOP) that causes the mapper to raise an error

Additional context

Find the "right" way of applying Polars Expressions

Problem

.apply works, however is often slow. Sometimes there are faster ways to do the same thing (e.g. df['col'] // 2 is probably going to be 100x faster than .apply(lambda r: r // 2)

Requested feature

Alternatives considered

Additional context

From viewing this video: https://youtu.be/BBQPeLYpdLY?si=mc0wvs9D3efmNGCF
- Might just need to consider math operations, consider existing partials and making those compatible as well (e.g. apply on the entire column when possible!)

Reuse mapping functions in tests

If a goal is to make mappings reusable to reduce boilerplate, the tests should demonstrate this!

Add features for interfacing with vectors / tensors

Problem

Formatting things outside of 2d for ML can be very confusing since math and programming intermingle (reshaping, applying specific transforms)

Requested feature

A cleaner interface for working with vectors and ML things (PyTorch, TensorFlow, JAX, Dask, etc.)

Not exactly sure how yet, though want to jot the idea down

Alternatives considered

... look into this deeply!

Additional context

Add features for interfacing with databases

Problem

It's too hard to interface with databases. I want to just grab data and go -- not manage cursors and other things (which are important in many other contexts, but not here!)

Requested feature

A series of helper functions to move from object -> collection -> _database_. The database layer could be Postgres, MySQL, SQLite, etc.

E.g. possibly:

from pydian.databases import extract, transform, load

df_from_db = extract(from="...",)

df = pd.DataFrame({...})
load(df, to="...")

transform(db="...",

Also worth considering adding interfaces for a cache (e.g. Redis)

We have Mapper, maybe Piper, ... then how about Baser? For standardizing connections, e.g. Baser.extract(...)

Alternatives considered

... look into this! Maybe this is just what dbt is doing (dev practices in SQL land)?

ericpan64 / pydian Goto Github PK

pydian's Introduction

Pydian - pythonic data interchange

get specific data, then do stuff

Developer-friendly API

Examples

Future Work

Contact

pydian's People

Contributors

pydian's Issues

Requested feature

Alternatives considered

Additional context

Problem

Requested feature

Alternatives considered

Additional context

Problem

Requested feature

Alternatives considered

Additional context

Requested feature

Alternatives considered

Additional context

Requested feature

Alternatives considered

Additional context

Problem

Requested feature

Alternatives considered

Additional context

Problem

Requested feature

Alternatives considered

Additional context

Requested feature

Alternatives considered

Additional context

Requested feature

Alternatives considered

Additional context

Data Pipeline wrappers

Problem

Requested feature

Alternatives considered

Additional context

Requested feature

Alternatives considered

Additional context

Problem

Requested feature

Alternatives considered

Additional context

Problem

Requested feature

Alternatives considered

Additional context

Problem

Requested feature

Alternatives considered

Additional context

Problem

Requested feature

Alternatives considered

Additional context

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`get` specific data, then do stuff