GithubHelp home page GithubHelp logo

Comments (19)

hayesgb avatar hayesgb commented on July 24, 2024

Thanks for reporting the issue. Can you provide some specifics on what was observed for both 0.2.4 and 0.3.0 so I can attempt to reproduce?

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

Yes sure:

  • create a big csv file in azure storage blob
  • use dask.distributed/dask_kubernetes to create a cluster
  • dd.read_csv(path, storage_options).persist() that file

looking at the dask.distributed dashboard you should be able to see how much longer it takes to read the file. Going through the dask.distributed Profile you can see that an unusual amount of time is being spent on an operation part of python.core ssl.

If you want me to say how I created a dask_kubernetes cluster in azure I can expand on it.

from adlfs.

andersbogsnes avatar andersbogsnes commented on July 24, 2024

Seems like it is related to this issue:
Azure/azure-sdk-for-python#9596

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

Closing for now, stopped having the issue using the same code and storage as before, either something changed internally in azure or in the library. Will reopen if reapers. Thank you.

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

Issue is still there, I was just being lucky with one specific file.

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

@hayesgb the biggest issue I experience with the latest version of adlfs while read/writing to/from csv is the high memory usage, high meaning that writing a 200Mb dataframe, with multiple partitions, can take upwards of 20Gb of RAM.
Not sure why azure-storage-blob >=12 causes the memory usage increased so much.
Any idea?

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

@hayesgb what can I do to help you debug this issue?
It still exists with the latest version of adlfs, if adlfs>0.3.3 then writing to an azure blob container with adlfs incurs much higher memory usage (50-100% more)

from adlfs.

anders-kiaer avatar anders-kiaer commented on July 24, 2024

@mgsnuno I have seen a similar problem as you (however in my case it was only a read operation, and of a .parquet file - the performance regression when going >=0.3 was 10x+ in runtime).

what can I do to help you debug this issue?

In my case, I got the same performance also for adlfs>=0.3 (and then also using recent azure-storage-blob versions) by simply changing length argument in this line https://github.com/dask/adlfs/blob/2bfdb02d13d14c0787e769c6686fecd2e3861a4b/adlfs/spec.py#L1785 from end to (end - start). See also #247.

from adlfs.

hayesgb avatar hayesgb commented on July 24, 2024

@mgsnuno -- #247 has been merged into master and is included in release 0.7.7. Would be great to get your feedback on if it fixes this issue.

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

@hayesgb I tested 0.7.7, it gives me the same issue, something like the code bellow, with adlfs>=0.3.3 it uses 2x-3x more memory, it is much faster (it looks like it's in parallel and before was not). Also maybe the parallelism explains the high memory usage.
I'm not being overly picky about the memory, the issue I have is that in a 32Gb RAM VM where I run dask to read, parse and write to abfs multiple parquet files in parallel, before I had no problems, 15Gb would be more than enough, now I have killed worker errors as the memory quickly reaches the 32Gb limit.
Is there a way to limit parallelism of adlfs with asyncio to make it slower but less memory hungry?

import dask
from adlfs import AzureBlobFileSystem
from pipelines.core import base as corebase

# create dummy dataset
df = dask.datasets.timeseries(end="2001-01-31", partition_freq="7d")
df.to_csv("../dataset/")

# upload data
credentials = get_credentials()
storage_options = {
    "account_name": credentials["abfs-name"],
    "account_key": credentials["abfs-key"],
}
abfs = AzureBlobFileSystem(**storage_options)
abfs.mkdir("testadlfs/")
abfs.put("../dataset/", "testadlfs/", recursive=True)

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

@hayesgb any pointers you can share? Thank you

from adlfs.

hayesgb avatar hayesgb commented on July 24, 2024

@mgsnuno -- I'm looking at your example now. I see the high memory usage with the put method, but not when writing the df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options)

Is this consistent with your issue?

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

@hayesgb very good point and yes, consistent with what I see, put has high memory usage, to_csv(remote_folder...) has not.
But, if you do to_csv(local_folder...) it takes time for dask to write those text files, can it be that that time is masking the issue that is highlighted with just the put method?

from adlfs.

hayesgb avatar hayesgb commented on July 24, 2024

When I monitor memory usage with the .to_csv() method, either locally or remotely, I don't see high memory usage. However, when I monitor memory usage wiht the .put() method, I do see memory usage rising significantly.

Just want to be sure I'm working on the correct issue. The put() method opens a bufferedreader object and streams it to azure. This currently happens async, and may be the cause for the high memory usage.

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

@hayesgb I have tested with our business tables/dataframes and even with to_parquet("abfs://...", with the latest adlfs/fsspec vs adlfs==0.3.3/fsspec==0.7.4 the memory usage is much higher, I get KilledWorker errors that I never had with those fixed versions.

Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it.

I went back to your previous comment where you mention df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options), is az same as abfs?

from adlfs.

hayesgb avatar hayesgb commented on July 24, 2024

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

@hayesgb did you manage to reproduce high mem usage with to_parquet("abfs://..." as well?

from adlfs.

hayesgb avatar hayesgb commented on July 24, 2024

No, I did not. We actually use to_parquet regularly in our workloads without issue

from adlfs.

mgsnuno avatar mgsnuno commented on July 24, 2024

all good now with:

  • adlfs 2022.2.0
  • fsspec 2022.3.0
  • dask 2022.4.0
  • azure-storage-blob 12.11.0

from adlfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.