GithubHelp home page GithubHelp logo

Comments (6)

alexander-beedie avatar alexander-beedie commented on June 27, 2024

This is actually the expected behaviour as the expression-method .exclude(..) is not itself a selector, and does not return a selector. Calling a method on a selector will always result in an expression, and that expression will be broadcast over the selected columns.

To instead perform set operations with selectors (intersection, union, difference, complement) you can use the &, |, -, and ~ operators, as illustrated in the docs here (the operands must themselves both be selectors):

In your example what you actually need is the following:

cs.expand_selector(df, cs.starts_with("a") - cs.by_name("ab"))
# ('a',)

This keeps things consistent - you don't really want expression methods to change their return type based on what they were called on, but if you want to do set operations on selectors, you can absolutely do that 👍

from polars.

machow avatar machow commented on June 27, 2024

you don't really want expression methods to change their return type based on what they were called on

Can you say a bit more about this? To my understanding, this is fairly common (e.g. made easy to hint through the Self type).

from typing import Self

class Expr:
    def exclude(self, ...) -> Self:
        return self.__class__(...)

edit: a good example is the suggested selection, where the return type of __sub__ depends on what it's called on, and the other object:

  • selector: cs.starts_with("a") - cs.by_name("ab")
  • expression: cs.starts_with("a") - 1, cs.starts_with("a") - pl.col("b")

The bigger picture for Great Tables

I think the challenge right now in Great Tables is that we want to do these things:

  • Accept some set of polars operations that perform simple selection
    • Let's say simple selection is something that specifies columns of data to return, without specifying any transformations/operations on them (either their data or names).
  • Return the names of the columns that would be selected.

I think we are running into three issues:

  • cognitive: there are operations I can do as a user that perform simple selection, but that will be rejected by Great Tables. (e.g. cs.starts_with("a").exclude("ab")).
  • cognitive: if someone uses the DataFrame.select() interface, the suggested method, cs.starts_with("a") - cs.by_name("ab") is one way to specify the selection, but might not be what they gravitate towards.
  • computer: Great Tables is using expand_selector(), which is maybe the wrong way to model this situation?

We just need to know they are not transforming data, and the column names they'd select. So if .exclude() pushes us out of being a selector, but there's a better way to still recognize it as a simple selection, we can def go that way!

Thanks for coming on this funky selection adventure

I realize a lot of this is a funkier use of the polars API. Thanks for being so quick to help with this stuff :)

from polars.

alexander-beedie avatar alexander-beedie commented on June 27, 2024

Can you say a bit more about this? To my understanding, this is fairly common

In this case it's about API consistency more than anything technical - it's cleanest if expression methods on selectors consistently broadcast as expressions (a selector is an expression of course, but it's also a little special so... arguable).

I'm very amenable to finding ways to make this work for your use-case though; exclude is in a slight grey area as it doesn't really broadcast, it modifies the preceding multiple-output col expression (it existed before selectors, which is why it probably looks like it should work, but doesn't).

  • Accept some set of polars operations that perform simple selection
  • Let's say simple selection is something that specifies columns of data to return, without specifying any transformations/operations on them (either their data or names).

So... there are definitely no renaming operations happening at this point?
eg: no pl.col("xyz").alias("abc") or cs.starts_with("a").name.prefix("foo_")?

I realize a lot of this is a funkier use of the polars API. Thanks for being so quick to help with this stuff :)

😎👍

from polars.

machow avatar machow commented on June 27, 2024

Simple selector rules of thumb

it's cleanest if expression methods on selectors consistently broadcast as expressions

Are the rules of thumb for selectors something like this?

  1. every top-level function in cs that can be used for selection returns a selector (e.g. cs.starts_with())
  2. in order to keep things as selectors, you need to use infix operators (e.g. cs.starts_with(...) - cs.by_name(...))
  3. infix operators are only guaranteed to return selectors if all operands are selectors
  • e.g. selector: cs.starts_with(...) - cs.by_name(...)
  • e.g. expression: cs.starts_with("...") - "a"
  1. method calls off selectors return expressions

The rule of thumb (1) explains why cs.exclude(...) is a selector, even though pl.exclude(...) isn't. Rule of thumbs (3) and (4) explain why cs.starts_with(...).exclude(...) isn't a selector.

(I'm not too worried about the model, so much as humans being able to guess what will work and what won't; hedging against the risk that they don't know how to rewrite the things they're used to in DataFrame.select() in less common results-in-a-selector-type code.)

Simple selection vs renaming

So... there are definitely no renaming operations happening at this point?

Yeah, exactly:

  1. no renaming operations in simple selector cases (and we'd be okay raising an error for renames)
  2. for the one renaming situation in Great Tables, we can just use pl.DataFrame.rename() as the contract ✨

Example

Simple selection is really helpful for cases like the coffee data table .

import polars as pl
import polars.selectors as cs
from great_tables import GT, loc, style

coffee_sales = pl.read_json("data/coffee-sales.json")

sel_rev = cs.starts_with("revenue")
sel_prof = cs.starts_with("profit")


coffee_table = (
    GT(coffee_sales)
    .tab_header("Sales of Coffee Equipment")

    # Case 1: simple selection ----
    # create spanner columns "Revenue" and "Profit" (super common activity)

    .tab_spanner(label="Revenue", columns=sel_rev)
    .tab_spanner(label="Profit", columns=sel_prof)

    # Case 2: renaming ----
    # renaming basically only happens here
    .cols_label(
        revenue_dollars="Amount",
        profit_dollars="Amount",
        revenue_pct="Percent",
        profit_pct="Percent",
        icon="",
        product="Product",
    )
)
image

from polars.

alexander-beedie avatar alexander-beedie commented on June 27, 2024

You have nailed the rules of thumb - enough so that I am genuinely tempted to crib that breakdown for a small section in the selectors documentation to make it more widely available 🤣 Anyway, with the information above in mind, I think I have a solution, and have made a PR. Take a look and see if you think it'll cover everything?

from polars.

alexander-beedie avatar alexander-beedie commented on June 27, 2024

Available in the new release: 0.20.30 ✌️

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.