Comments (8)
I think we should instead have a separate parameter called coalesce
which is a bool.
In joining, the only time any of this matters is a left/right/outer join, in which the key on one side has no corresponding key on the other. This could all be captured by a single boolean column that indicates whether a join was successful or not (or an enum that says 'Left'/'Right'/'Both' to indicate which frame the value came from). But barring that, the only reason to distinguish between whether the value came from the left or right frame is when one is null and the other isn't.
The non-coalesce situation to me feels very redundant. If you have many columns, they're going to look like this (say we have a left join)
Key | A_left | A_right | B_left | B_right | C_left | C_right | comment |
---|---|---|---|---|---|---|---|
key1 | a1 | null | b1 | null | c1 | null | no match |
key2 | a2 | a2 | b2 | b2 | c2 | c2 | match |
key3 | a3 | a3 | b3 | b3 | c3 | c3 | match |
In other words, all of our _right
columns are redundant with a single column declaring whether there was a successful match or not.
from polars.
I am opposed to a coalesce
boolean flag as it is not valid for all join types. I do however think we must change the left join and create a "left_coalesce"
join.
from polars.
This is overlapping with these 3:
from polars.
@mcrumiller A couple counter examples.
- join_asof it wouldn't make sense to do the comment column.
- It also wouldn't make sense if we're doing
left_on='a', right_on=pl.col('b').expression_that_makes_b_look_like_a()
if we want to retain what b started as (something like truncating a datetime, for example.)
from polars.
join_asof
only works on a single join column, but yes I agree there that it would be nice to retain the asof
column and that the above example does not apply in that case.
For your second example (I assume we're no longer talking about an asof-join), I think my example still does apply. Column b
would be returned as if it were a non-key column. The expression defining the right join key column would end up just being B_right
above--either exactly equal to a
or null, with no exceptions.
from polars.
Is the plan to do this in some 0.20.x
release or in 1.0
?
In my projects that use polars at work, I'm pinning specific versions instead of just pinning polars < 1.0
, because I'm scared wondering when this change is going to happen.
from polars.
@s-banach it's currently milestoned for 1.0
from polars.
We want to pick this up for 1.0. The current blocker is that we do not support left outer non-coalesce join in the streaming engine.
from polars.
Related Issues (20)
- Schema of `LazyFrame.with_context` does not match result of collect HOT 2
- Following a selector with .exclude() is not considered a selector HOT 6
- predicate pushdown with `pl.Expr.cut`
- `.list.to_struct()` has non-deterministic behavior HOT 5
- Add `Expr.list.map_elements(func)` to perform a custom function on every element in a list HOT 2
- pl.from_pandas(..., nan_to_null=True) does not convert NaN to Null HOT 3
- Example of `.over()` 900x slower than group_by.agg.join (and over 50x slower than pandas) HOT 6
- Non-deterministic failure when materializing LazyFrame
- LazyFrame - Unnested columns are missing in Lazy Frame HOT 5
- Add section about using `pipe` to the user guide HOT 1
- Regression: `list.sum()` inside WhenThen now returns a list HOT 1
- In pl.Series, nan_to_null parameter not respected with floats HOT 1
- When reading excel table data, you are advised to freely select the column name or column number to read data HOT 2
- When reading excel table data, allow selection of the column names/indices to read HOT 2
- Incorrect `ColumnNotFound` panic, which occurs only for LazyFrames HOT 2
- search_sorted does not work on boolean columns
- PanicException creating DataFrame with numpy array inside dict HOT 1
- `struct.rename_fields` does not work on structs with categorical columns after scanning a parquet file with more than one row group. HOT 4
- `SchemaFieldNotFound` on LazyFrame when using `select` after `struct.field(...)` HOT 1
- Handle `pd.NaT` values in lists passed to DataFrame constructor HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.