Comments (10)
Hi @ritchie46 ,
I've now provided a complete example and memory consumption flamegraphs for the example above.
Would be great if you could take a look at this, as it breaks the promise of support for "larger than memory files" with scan_csv/sink_parquet.
Btw, in case this is related, a similar memory consumption problem is there when reading parquet files with scan_parquet. I can provide an example for that once this bug is fixed (since I'd need to be able to dynamically create a parquet file without consuming much memory first).
from polars.
Hard to reproduce like this. Can you tell me more what is your schema. How does memory increase overtime? Why do you need to glob? Has that influence?
from polars.
Just tried setting up a more complete minimal example with random dummy csv data, but can't reproduce it there.
I'll continue looking into this and report back.
from polars.
Yes, maybe compile with debug symbols and get a heaptack report?
from polars.
I managed to generate some sample data with similar behavior - 22GB Peak memory usage for a 23GB csv file.
The resulting flame graph from memray is attached:
memray-flamegraph.py.15.zip
The code looks like this:
import csv
import logging
import os
import random
from pathlib import Path
import polars as pl
logger = logging.getLogger("etl")
pl.show_versions()
local_folder = Path("downloads/polars_example")
local_file_csv = Path(local_folder / "example").with_suffix(".csv")
local_file_parquet = Path(local_folder / "example").with_suffix(".parquet")
os.makedirs(local_folder, exist_ok=True)
logger.info("Generating dummy CSV")
with open(local_file_csv, "w", newline="") as csvfile:
writer = csv.writer(csvfile)
for _ in range(200_000_000):
writer.writerow(
[
"3c436c803d23be8" + str(random.random()), # noqa
"lEmeKiDdvHkkvpPlnvWPBAQhfG3DpFjDDDEA6ndhLX-dQXeyWvSCY" + str(random.random()), # noqa
"2023-01-01",
]
)
logger.info("Generating Parquet")
lf = pl.scan_csv(local_file_csv, low_memory=True)
lf.sink_parquet(local_file_parquet)
from polars.
Oh and this happens in both Kubernetes (Debian bullseye based image) and WSL (Debian bookworm based image)
from polars.
I've now run the exact code above (the previous flame graph contained some additional code) on my MacOS computer with the same results regarding memory usage.
Please find the flame graph attached (I ran memray in native mode this time, so it contains more details on the rust side).
memray-flamegraph-main.py.76054.html.zip
The version info is the following:
--------Version info---------
Polars: 0.20.3
Index type: UInt32
Platform: macOS-14.2.1-arm64-arm-64bit
Python: 3.12.1 (main, Jan 5 2024, 19:05:58) [Clang 15.0.0 (clang-1500.1.0.2.5)]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
gevent: <not installed>
hvplot: <not installed>
matplotlib: <not installed>
numpy: <not installed>
openpyxl: <not installed>
pandas: <not installed>
pyarrow: <not installed>
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
from polars.
I see the same behavior. Memory seems to continually grow while streaming from csv to parquet. I’ve tried to turn off all optimizations as well as compression, and adjusted row_group_size.
I’ll see if I can profile further as well.
from polars.
I was able to reproduce this in Rust, creating a LazyCsvReader and then calling .sink_parquet() on the resulting LazyFrame.
from polars.
I'm seeing this memory behavior too using polars v1.2.1 to read a 12GB CSV file:
lf = pl.scan_csv('test.csv')
lf.sink_parquet('test.parquet')
Watching RSS memory usage with ps
while that runs shows that RSS memory peaks at 21 GB.
Experimenting with reading with a schema
, and writing using compression
and row_group_size
had no effect. It also doesn't appear related to writing the Parquet file since sink_csv()
behaves the same way.
I don't know if this is a regression, but it does seem like this behavior is not described correctly in the documentation?
If it's helpful to have the CSV file email me at [email protected]. I don't want to link it here. There are some columns with really large values (as high as 8MB of string data).
from polars.
Related Issues (20)
- Inconsistent rolling results when using temporal windows HOT 5
- Polars ignoring rows that are empty in Excel HOT 1
- S3 credentials aren't loaded from `~/.aws/config` if equals aren't padded with spaces
- No non-strict creation of literals HOT 3
- PanicException: index: 8449 out of bounds for len: 1 when using scan csv with schema and include_file_paths
- Fail to compile polars 0.42.0
- Should `str.to_titlecase()` capitalize the letter after an apostrophe? HOT 5
- polars.read_database can not work with duckdb_engine connection. HOT 1
- Build fail HOT 6
- Improve decimal_comma error message
- Add Lateral Column Aliasing support for the SQL interface HOT 2
- maintain_order is a gotcha. Make it true by default? HOT 4
- Creating struct using 'when' HOT 2
- group_by on all columns cannot suppot tail HOT 1
- Possible memory leak on scanning csv and sinking to parquet HOT 7
- Remove `json` support in `LazyFrame.serialize`
- Regex look-ahead/behind support HOT 7
- Add `.dt.replace()` to alter date/datetime values HOT 3
- LazyFrame.head(n) raises exception when there are errors after the first n rows.
- Non-strict data frame creation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.