Comments (19)
Thanks for reporting the issue. Can you provide some specifics on what was observed for both 0.2.4 and 0.3.0 so I can attempt to reproduce?
from adlfs.
Yes sure:
- create a big csv file in azure storage blob
- use dask.distributed/dask_kubernetes to create a cluster
dd.read_csv(path, storage_options).persist()
that file
looking at the dask.distributed dashboard you should be able to see how much longer it takes to read the file. Going through the dask.distributed Profile
you can see that an unusual amount of time is being spent on an operation part of python.core ssl
.
If you want me to say how I created a dask_kubernetes cluster in azure I can expand on it.
from adlfs.
Seems like it is related to this issue:
Azure/azure-sdk-for-python#9596
from adlfs.
Closing for now, stopped having the issue using the same code and storage as before, either something changed internally in azure or in the library. Will reopen if reapers. Thank you.
from adlfs.
Issue is still there, I was just being lucky with one specific file.
from adlfs.
@hayesgb the biggest issue I experience with the latest version of adlfs while read/writing to/from csv is the high memory usage, high meaning that writing a 200Mb dataframe, with multiple partitions, can take upwards of 20Gb of RAM.
Not sure why azure-storage-blob >=12 causes the memory usage increased so much.
Any idea?
from adlfs.
@hayesgb what can I do to help you debug this issue?
It still exists with the latest version of adlfs, if adlfs>0.3.3 then writing to an azure blob container with adlfs incurs much higher memory usage (50-100% more)
from adlfs.
@mgsnuno I have seen a similar problem as you (however in my case it was only a read operation, and of a .parquet
file - the performance regression when going >=0.3
was 10x+ in runtime).
what can I do to help you debug this issue?
In my case, I got the same performance also for adlfs>=0.3
(and then also using recent azure-storage-blob
versions) by simply changing length
argument in this line https://github.com/dask/adlfs/blob/2bfdb02d13d14c0787e769c6686fecd2e3861a4b/adlfs/spec.py#L1785 from end
to (end - start)
. See also #247.
from adlfs.
@mgsnuno -- #247 has been merged into master and is included in release 0.7.7. Would be great to get your feedback on if it fixes this issue.
from adlfs.
@hayesgb I tested 0.7.7, it gives me the same issue, something like the code bellow, with adlfs>=0.3.3 it uses 2x-3x more memory, it is much faster (it looks like it's in parallel and before was not). Also maybe the parallelism explains the high memory usage.
I'm not being overly picky about the memory, the issue I have is that in a 32Gb RAM VM where I run dask to read, parse and write to abfs multiple parquet files in parallel, before I had no problems, 15Gb would be more than enough, now I have killed worker errors as the memory quickly reaches the 32Gb limit.
Is there a way to limit parallelism of adlfs with asyncio to make it slower but less memory hungry?
import dask
from adlfs import AzureBlobFileSystem
from pipelines.core import base as corebase
# create dummy dataset
df = dask.datasets.timeseries(end="2001-01-31", partition_freq="7d")
df.to_csv("../dataset/")
# upload data
credentials = get_credentials()
storage_options = {
"account_name": credentials["abfs-name"],
"account_key": credentials["abfs-key"],
}
abfs = AzureBlobFileSystem(**storage_options)
abfs.mkdir("testadlfs/")
abfs.put("../dataset/", "testadlfs/", recursive=True)
from adlfs.
@hayesgb any pointers you can share? Thank you
from adlfs.
@mgsnuno -- I'm looking at your example now. I see the high memory usage with the put
method, but not when writing the df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options)
Is this consistent with your issue?
from adlfs.
@hayesgb very good point and yes, consistent with what I see, put
has high memory usage, to_csv(remote_folder...)
has not.
But, if you do to_csv(local_folder...)
it takes time for dask to write those text files, can it be that that time is masking the issue that is highlighted with just the put
method?
from adlfs.
When I monitor memory usage with the .to_csv() method, either locally or remotely, I don't see high memory usage. However, when I monitor memory usage wiht the .put() method, I do see memory usage rising significantly.
Just want to be sure I'm working on the correct issue. The put() method opens a bufferedreader object and streams it to azure. This currently happens async, and may be the cause for the high memory usage.
from adlfs.
@hayesgb I have tested with our business tables/dataframes and even with to_parquet("abfs://..."
, with the latest adlfs/fsspec vs adlfs==0.3.3/fsspec==0.7.4 the memory usage is much higher, I get KilledWorker
errors that I never had with those fixed versions.
Unfortunately I cannot share those tables, but maybe you can make some tests with dummy data you usually use and replicate it.
I went back to your previous comment where you mention df.to_csv("az://testadlfs/my_files/*.csv", storage_options=storage_options)
, is az
same as abfs
?
from adlfs.
from adlfs.
@hayesgb did you manage to reproduce high mem usage with to_parquet("abfs://..."
as well?
from adlfs.
No, I did not. We actually use to_parquet regularly in our workloads without issue
from adlfs.
all good now with:
- adlfs 2022.2.0
- fsspec 2022.3.0
- dask 2022.4.0
- azure-storage-blob 12.11.0
from adlfs.
Related Issues (20)
- support az cli config ~/.azure/config HOT 1
- UserWarning: Failed to fetch container properties for CONTAINER_NAME. Assume it exists already HOT 1
- "sdk_moniker" key error HOT 9
- Avoid private APIs from azure.storage HOT 2
- InternalServerError while writing large json data.
- await file_obj.credential.close() : TypeError: object NoneType can't be used in 'await' expression HOT 4
- update readme HOT 1
- Support py3.12
- `find` doesn't accept `maxdepth` parameter HOT 1
- Add use_emulator setting to better align with object_store crate HOT 2
- Current state of the library, milestones and current development HOT 1
- Concurrent download of multiple files HOT 1
- Support virtual directory stubs with uppercase "Hdi_isfolder" metadata HOT 1
- Feature Suggestion: Optional content type when for writing file HOT 2
- Support passing url in AzureBlobFileSystem HOT 1
- Add comment why `aiohttp` is required
- Fix typo in repo About
- Python 3.12 support blocked by aiohttp HOT 1
- Feature Request: Support for Adding Metadata to Blobs
- Runtime warning from missing await HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adlfs.