dd.read_csv(path, storage_options) on a 15Gb csv file

Yes sure: create a big csv file in azure storage blob <li

Seems like it is related to this issue: <a class="issue-link js-issue-link" data-e

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

azure-storage-blob >=v12 causes slow, high memory dd.read_csv about adlfs HOT 19 CLOSED

fsspec commented on July 24, 2024

azure-storage-blob >=v12 causes slow, high memory dd.read_csv

from adlfs.

Comments (19)

hayesgb commented on July 24, 2024

Thanks for reporting the issue. Can you provide some specifics on what was observed for both 0.2.4 and 0.3.0 so I can attempt to reproduce?

from adlfs.

mgsnuno commented on July 24, 2024

Yes sure:

create a big csv file in azure storage blob
use dask.distributed/dask_kubernetes to create a cluster
dd.read_csv(path, storage_options).persist() that file

looking at the dask.distributed dashboard you should be able to see how much longer it takes to read the file. Going through the dask.distributed Profile you can see that an unusual amount of time is being spent on an operation part of python.core ssl.

If you want me to say how I created a dask_kubernetes cluster in azure I can expand on it.

from adlfs.

andersbogsnes commented on July 24, 2024

Seems like it is related to this issue:
Azure/azure-sdk-for-python#9596

from adlfs.

mgsnuno commented on July 24, 2024

Closing for now, stopped having the issue using the same code and storage as before, either something changed internally in azure or in the library. Will reopen if reapers. Thank you.

from adlfs.

mgsnuno commented on July 24, 2024

Issue is still there, I was just being lucky with one specific file.

from adlfs.

mgsnuno commented on July 24, 2024

@hayesgb the biggest issue I experience with the latest version of adlfs while read/writing to/from csv is the high memory usage, high meaning that writing a 200Mb dataframe, with multiple partitions, can take upwards of 20Gb of RAM.
Not sure why azure-storage-blob >=12 causes the memory usage increased so much.
Any idea?

from adlfs.

mgsnuno commented on July 24, 2024

@hayesgb what can I do to help you debug this issue?
It still exists with the latest version of adlfs, if adlfs>0.3.3 then writing to an azure blob container with adlfs incurs much higher memory usage (50-100% more)

from adlfs.

anders-kiaer commented on July 24, 2024

@mgsnuno I have seen a similar problem as you (however in my case it was only a read operation, and of a .parquet file - the performance regression when going >=0.3 was 10x+ in runtime).

what can I do to help you debug this issue?

In my case, I got the same performance also for adlfs>=0.3 (and then also using recent azure-storage-blob versions) by simply changing length argument in this line https://github.com/dask/adlfs/blob/2bfdb02d13d14c0787e769c6686fecd2e3861a4b/adlfs/spec.py#L1785 from end to (end - start). See also #247.

from adlfs.

hayesgb commented on July 24, 2024

@mgsnuno -- #247 has been merged into master and is included in release 0.7.7. Would be great to get your feedback on if it fixes this issue.

from adlfs.

mgsnuno commented on July 24, 2024

@hayesgb I tested 0.7.7, it gives me the same issue, something like the code bellow, with adlfs>=0.3.3 it uses 2x-3x more memory, it is much faster (it looks like it's in parallel and before was not). Also maybe the parallelism explains the high memory usage.
I'm not being overly picky about the memory, the issue I have is that in a 32Gb RAM VM where I run dask to read, parse and write to abfs multiple parquet files in parallel, before I had no problems, 15Gb would be more than enough, now I have killed worker errors as the memory quickly reaches the 32Gb limit.
Is there a way to limit parallelism of adlfs with asyncio to make it slower but less memory hungry?

import dask
from adlfs import AzureBlobFileSystem
from pipelines.core import base as corebase

# create dummy dataset
df = dask.datasets.timeseries(end="2001-01-31", partition_freq="7d")
df.to_csv("../dataset/")

# upload data
credentials = get_credentials()
storage_options = {
    "account_name": credentials["abfs-name"],
    "account_key": credentials["abfs-key"],
}
abfs = AzureBlobFileSystem(**storage_options)
abfs.mkdir("testadlfs/")
abfs.put("../dataset/", "testadlfs/", recursive=True)

from adlfs.

mgsnuno commented on July 24, 2024

@hayesgb any pointers you can share? Thank you

from adlfs.

hayesgb commented on July 24, 2024

@mgsnuno -- I'm looking at your example now. I see the high memory usage with the put method, but not when writing the df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options)

Is this consistent with your issue?

from adlfs.

mgsnuno commented on July 24, 2024

@hayesgb very good point and yes, consistent with what I see, put has high memory usage, to_csv(remote_folder...) has not.
But, if you do to_csv(local_folder...) it takes time for dask to write those text files, can it be that that time is masking the issue that is highlighted with just the put method?

from adlfs.

hayesgb commented on July 24, 2024

When I monitor memory usage with the .to_csv() method, either locally or remotely, I don't see high memory usage. However, when I monitor memory usage wiht the .put() method, I do see memory usage rising significantly.

Just want to be sure I'm working on the correct issue. The put() method opens a bufferedreader object and streams it to azure. This currently happens async, and may be the cause for the high memory usage.

from adlfs.

mgsnuno commented on July 24, 2024

@hayesgb I have tested with our business tables/dataframes and even with to_parquet("abfs://...", with the latest adlfs/fsspec vs adlfs==0.3.3/fsspec==0.7.4 the memory usage is much higher, I get KilledWorker errors that I never had with those fixed versions.

Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it.

I went back to your previous comment where you mention df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options), is az same as abfs?

from adlfs.

hayesgb commented on July 24, 2024

Yes, the “az” and “abfs” protocols are identical.

…

On Jul 14, 2021, at 6:44 AM, Nuno Gomes Silva ***@***.***> wrote: @hayesgb I have tested with our business tables/dataframes and even with to_parquet("abfs://...", with the latest adlfs/fsspec vs adlfs==0.3.3/fsspec==0.7.4 the memory usage is much higher, I get KilledWorker errors that I never had with those fixed versions. Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it. I went back to your previous comment where you mention df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options), is az same as abfs? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

from adlfs.

mgsnuno commented on July 24, 2024

@hayesgb did you manage to reproduce high mem usage with to_parquet("abfs://..." as well?

from adlfs.

hayesgb commented on July 24, 2024

No, I did not. We actually use to_parquet regularly in our workloads without issue

from adlfs.

mgsnuno commented on July 24, 2024

all good now with:

adlfs 2022.2.0
fsspec 2022.3.0
dask 2022.4.0
azure-storage-blob 12.11.0

from adlfs.

azure-storage-blob >=v12 causes slow, high memory dd.read_csv about adlfs HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs