GithubHelp home page GithubHelp logo

Comments (1)

ncclementi avatar ncclementi commented on August 16, 2024

Notes and Questions asked at PyData NYC 2023 tutorial

TODOs:

  • Get post-it notes
  • In the intro slides swap order of PyData translation slide to be after Deferred Execution
  • Consider adding Ibis-lingo slide/include in notebooks what's an expression vs an operation (?)
  • Consider remove Note in Getting started notebook about lockfile because we have read_only=True in the connect. (We could move this note to the parquet loading section in memtables notebook where we create a plain duckdb connection)
  • Consider using to_sql() instead of show_sql() because to_sql has syntax highlighting.

Questions:

  • 24:47 What is the SQL we (Ibis) are generating?
    Answer: The SQL we are generating is dialect specific, in the notebook example becasue the expression is built on a DuckDB table the SQL is DuckDB SQL.
    You can provide dialect="snowflake" to ibis.show_sql() to get snowflake sql

  • 27:55 Can you any of the columns without specifying the actual string, is it possible to do that by reference say "the last column" or "the first column"?
    Answer: Yes, selectors

  • 37:53 The first (actually second) cell of the Getting started notebook took a long time, what's happening?
    Note: The person is in local env, it's on us, need to trouble shoot. We should check if that's still happening.

  • 39:30 Is there some way to control the log level on Ibis?
    Answer: There isn't

  • 40:00 I noticed I could put in all the arguments into the agg function itself instead of breaking it into separate pieces? (like passing by, how, in directly)
    Answer: Yes, unlesss we say something is deprecated, then that's an approved way of doing it (maybe we can show an example of doing it this way too). Having it in separate parts might be easier to debug instead of having a big chain of commands.

  • [40:46] (https://youtu.be/TyopbrmlZx8?t=2446) Do you support categorical data dtypes ...(unclear last part of question)?
    Answer: If the dataset has categorical dtypes we will treat it as a string but it will still use whatever the DB does for effective efficient use of category ...(unclear audio in last part of answer)

  • 44:06 Does every backend supports memtables?
    Answer: Most of them do

  • 47:50 When do you switch from ibis (or DuckDB) to pandas?
    Answer: It is up to you, one potential answer is never. It depends on what's your workload. Ibis can be a replacement tool for pandas. (Maybe show what happen when not in interactive mode and show to_pandas())

  • 48:56 If pandas us the backend when things are being compiled, is the result that it generates are pandas operations?
    Answer: This applies to spark, polars, datafusion, dask. In those cases we are mapping the Ibis expressions (e.g order by) to some Pandas/Polars/spark operation, it's more of a straight translation.
    Followup: Is it possible to extract that result, to see what the operations are?
    Answer: Yes, it's possible.

  • 51:30 You run a big expression, it gets executed, you're like I have to change something, does it rerun everything?
    Answer: The short answer is Yes. But you can cache expressions, which will create a view of the results at that point and then you can continue from there without repeating execution.

  • 53:15 If we are passing all the SQL to snowflake, specially if it's hairy sql, we are executing it over and over again, are you going to shoot yourself in the foot with cost?
    Answer: It's definitely possible, we don't think the hairiness of the SQL in terms of the generated SQL is going to make a difference on that, Snowflake wants to execute that in an optimized fashion, so that's going to get optimized down to something that's much cleaner. But yes, if you were performing a big query a bunch, the cost can definitely go up quickly.

  • 1:03:28 How do you know the size of the data?
    Answer: In terms of the size in megabytes on disk, I would use du in linux for that, in terms of the row count, you can do .count().
    Followup: Is there a .info() or something so you can know the size? Like how much space is going to take up in memory.
    Answer: No, but it doesn't matter that much. This isn't really being loaded into memory as a thing. DuckDB is out of core database. For duckdb the limit of data you can load is limit by your hard drive size, not about how much RAM you have.

  • 1:05:29 With table object is it not trivial to do ... with the column names of them (not sure what the question is here audio was not clear, I think it's asking about auto complete?)
    Answer: It should be available, it is trivial.
    Followup: Is it non-trivial for the deferred underscore operator as well?
    Answer: Highly non-trivial to add it. We don't know how to do that.

  • 1:14:08 Is there any built-in way to say run it on Snowflake and I want a local sample cat, and then I can work with that and then eventually switch back to the real thing.
    Answer: You can do that. Connect to snowflake get the table expression call, there is a table sample method (do we have this now?), do a to_pyarrow, now you have a pyarrow table that you can directly read in memory from DuckDB and then you can start working with it locally and then you could dump it to parquet and save it for later

  • 1:14:55 Is there some kind of extension API for example, we are running everything on Snowflake and we decided to migrate to some new technology, and I have the necessity to implement just part of Ibis for that platform before you announce a production release. Is there some way for me to implement those kind of methods beforehand
    Answer: There is not an extension API or similar for that. But Each of the expression is decomposed in a series of operations that are backend agnostic, you can implement a full backend that has very limited functionality to begin with.

That was all the questions/answers

from ibis-tutorial.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.