Some common themes that have come up during previous presentations that we could addre

refactor: things to mention in intro about ibis-tutorial HOT 1 CLOSED

ibis-project commented on August 16, 2024

refactor: things to mention in intro

from ibis-tutorial.

Comments (1)

ncclementi commented on August 16, 2024

Notes and Questions asked at PyData NYC 2023 tutorial

TODOs:

Get post-it notes
In the intro slides swap order of PyData translation slide to be after Deferred Execution
Consider adding Ibis-lingo slide/include in notebooks what's an expression vs an operation (?)
Consider remove Note in Getting started notebook about lockfile because we have read_only=True in the connect. (We could move this note to the parquet loading section in memtables notebook where we create a plain duckdb connection)
Consider using to_sql() instead of show_sql() because to_sql has syntax highlighting.

Questions:

24:47 What is the SQL we (Ibis) are generating?
Answer: The SQL we are generating is dialect specific, in the notebook example becasue the expression is built on a DuckDB table the SQL is DuckDB SQL.
You can provide dialect="snowflake" to ibis.show_sql() to get snowflake sql
27:55 Can you any of the columns without specifying the actual string, is it possible to do that by reference say "the last column" or "the first column"?
Answer: Yes, selectors
37:53 The first (actually second) cell of the Getting started notebook took a long time, what's happening?
Note: The person is in local env, it's on us, need to trouble shoot. We should check if that's still happening.
39:30 Is there some way to control the log level on Ibis?
Answer: There isn't
40:00 I noticed I could put in all the arguments into the agg function itself instead of breaking it into separate pieces? (like passing by, how, in directly)
Answer: Yes, unlesss we say something is deprecated, then that's an approved way of doing it (maybe we can show an example of doing it this way too). Having it in separate parts might be easier to debug instead of having a big chain of commands.
[40:46] (https://youtu.be/TyopbrmlZx8?t=2446) Do you support categorical data dtypes ...(unclear last part of question)?
Answer: If the dataset has categorical dtypes we will treat it as a string but it will still use whatever the DB does for effective efficient use of category ...(unclear audio in last part of answer)
44:06 Does every backend supports memtables?
Answer: Most of them do
47:50 When do you switch from ibis (or DuckDB) to pandas?
Answer: It is up to you, one potential answer is never. It depends on what's your workload. Ibis can be a replacement tool for pandas. (Maybe show what happen when not in interactive mode and show to_pandas())
48:56 If pandas us the backend when things are being compiled, is the result that it generates are pandas operations?
Answer: This applies to spark, polars, datafusion, dask. In those cases we are mapping the Ibis expressions (e.g order by) to some Pandas/Polars/spark operation, it's more of a straight translation.
Followup: Is it possible to extract that result, to see what the operations are?
Answer: Yes, it's possible.
51:30 You run a big expression, it gets executed, you're like I have to change something, does it rerun everything?
Answer: The short answer is Yes. But you can cache expressions, which will create a view of the results at that point and then you can continue from there without repeating execution.
53:15 If we are passing all the SQL to snowflake, specially if it's hairy sql, we are executing it over and over again, are you going to shoot yourself in the foot with cost?
Answer: It's definitely possible, we don't think the hairiness of the SQL in terms of the generated SQL is going to make a difference on that, Snowflake wants to execute that in an optimized fashion, so that's going to get optimized down to something that's much cleaner. But yes, if you were performing a big query a bunch, the cost can definitely go up quickly.
1:03:28 How do you know the size of the data?
Answer: In terms of the size in megabytes on disk, I would use du in linux for that, in terms of the row count, you can do .count().
Followup: Is there a .info() or something so you can know the size? Like how much space is going to take up in memory.
Answer: No, but it doesn't matter that much. This isn't really being loaded into memory as a thing. DuckDB is out of core database. For duckdb the limit of data you can load is limit by your hard drive size, not about how much RAM you have.
1:05:29 With table object is it not trivial to do ... with the column names of them (not sure what the question is here audio was not clear, I think it's asking about auto complete?)
Answer: It should be available, it is trivial.
Followup: Is it non-trivial for the deferred underscore operator as well?
Answer: Highly non-trivial to add it. We don't know how to do that.
1:14:08 Is there any built-in way to say run it on Snowflake and I want a local sample cat, and then I can work with that and then eventually switch back to the real thing.
Answer: You can do that. Connect to snowflake get the table expression call, there is a table sample method (do we have this now?), do a to_pyarrow, now you have a pyarrow table that you can directly read in memory from DuckDB and then you can start working with it locally and then you could dump it to parquet and save it for later
1:14:55 Is there some kind of extension API for example, we are running everything on Snowflake and we decided to migrate to some new technology, and I have the necessity to implement just part of Ibis for that platform before you announce a production release. Is there some way for me to implement those kind of methods beforehand
Answer: There is not an extension API or similar for that. But Each of the expression is decomposed in a series of operations that are backend agnostic, you can implement a full backend that has very limited functionality to begin with.

That was all the questions/answers

from ibis-tutorial.

refactor: things to mention in intro about ibis-tutorial HOT 1 CLOSED

Comments (1)

Notes and Questions asked at PyData NYC 2023 tutorial

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs