Comments (1)
Notes and Questions asked at PyData NYC 2023 tutorial
TODOs:
- Get post-it notes
- In the intro slides swap order of PyData translation slide to be after Deferred Execution
- Consider adding Ibis-lingo slide/include in notebooks what's an expression vs an operation (?)
- Consider remove Note in Getting started notebook about lockfile because we have
read_only=True
in the connect. (We could move this note to the parquet loading section in memtables notebook where we create a plain duckdb connection) - Consider using
to_sql()
instead ofshow_sql()
becauseto_sql
has syntax highlighting.
Questions:
-
24:47 What is the SQL we (Ibis) are generating?
Answer: The SQL we are generating is dialect specific, in the notebook example becasue the expression is built on a DuckDB table the SQL is DuckDB SQL.
You can providedialect="snowflake"
toibis.show_sql()
to get snowflake sql -
27:55 Can you any of the columns without specifying the actual string, is it possible to do that by reference say "the last column" or "the first column"?
Answer: Yes, selectors -
37:53 The first (actually second) cell of the Getting started notebook took a long time, what's happening?
Note: The person is in local env, it's on us, need to trouble shoot. We should check if that's still happening. -
39:30 Is there some way to control the log level on Ibis?
Answer: There isn't -
40:00 I noticed I could put in all the arguments into the
agg
function itself instead of breaking it into separate pieces? (like passing by, how, in directly)
Answer: Yes, unlesss we say something is deprecated, then that's an approved way of doing it (maybe we can show an example of doing it this way too). Having it in separate parts might be easier to debug instead of having a big chain of commands. -
[40:46] (https://youtu.be/TyopbrmlZx8?t=2446) Do you support categorical data dtypes ...(unclear last part of question)?
Answer: If the dataset has categorical dtypes we will treat it as a string but it will still use whatever the DB does for effective efficient use of category ...(unclear audio in last part of answer) -
44:06 Does every backend supports memtables?
Answer: Most of them do -
47:50 When do you switch from ibis (or DuckDB) to pandas?
Answer: It is up to you, one potential answer is never. It depends on what's your workload. Ibis can be a replacement tool for pandas. (Maybe show what happen when not in interactive mode and showto_pandas()
) -
48:56 If pandas us the backend when things are being compiled, is the result that it generates are pandas operations?
Answer: This applies to spark, polars, datafusion, dask. In those cases we are mapping the Ibis expressions (e.g order by) to some Pandas/Polars/spark operation, it's more of a straight translation.
Followup: Is it possible to extract that result, to see what the operations are?
Answer: Yes, it's possible. -
51:30 You run a big expression, it gets executed, you're like I have to change something, does it rerun everything?
Answer: The short answer is Yes. But you can cache expressions, which will create a view of the results at that point and then you can continue from there without repeating execution. -
53:15 If we are passing all the SQL to snowflake, specially if it's hairy sql, we are executing it over and over again, are you going to shoot yourself in the foot with cost?
Answer: It's definitely possible, we don't think the hairiness of the SQL in terms of the generated SQL is going to make a difference on that, Snowflake wants to execute that in an optimized fashion, so that's going to get optimized down to something that's much cleaner. But yes, if you were performing a big query a bunch, the cost can definitely go up quickly. -
1:03:28 How do you know the size of the data?
Answer: In terms of the size in megabytes on disk, I would usedu
in linux for that, in terms of the row count, you can do.count()
.
Followup: Is there a.info()
or something so you can know the size? Like how much space is going to take up in memory.
Answer: No, but it doesn't matter that much. This isn't really being loaded into memory as a thing. DuckDB is out of core database. For duckdb the limit of data you can load is limit by your hard drive size, not about how much RAM you have. -
1:05:29 With table object is it not trivial to do ... with the column names of them (not sure what the question is here audio was not clear, I think it's asking about auto complete?)
Answer: It should be available, it is trivial.
Followup: Is it non-trivial for the deferred underscore operator as well?
Answer: Highly non-trivial to add it. We don't know how to do that. -
1:14:08 Is there any built-in way to say run it on Snowflake and I want a local sample
cat
, and then I can work with that and then eventually switch back to the real thing.
Answer: You can do that. Connect to snowflake get the table expression call, there is a table sample method (do we have this now?), do ato_pyarrow
, now you have a pyarrow table that you can directly read in memory from DuckDB and then you can start working with it locally and then you could dump it to parquet and save it for later -
1:14:55 Is there some kind of extension API for example, we are running everything on Snowflake and we decided to migrate to some new technology, and I have the necessity to implement just part of Ibis for that platform before you announce a production release. Is there some way for me to implement those kind of methods beforehand
Answer: There is not an extension API or similar for that. But Each of the expression is decomposed in a series of operations that are backend agnostic, you can implement a full backend that has very limited functionality to begin with.
That was all the questions/answers
from ibis-tutorial.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ibis-tutorial.