Comments (3)
Similar error trying to read using dask.dataframe.read_csv:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-15-99412ae0de4e> in <module>
----> 1 df = dask.delayed(dd.read_csv)(f'abfs://testing/data/isd*/*.csv', engine='pyarrow', storage_options=STORAGE_OPTIONS).compute()
2 get_ipython().run_line_magic('time', 'df.head()')
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
163 dask.base.compute
164 """
--> 165 (result,) = compute(self, traverse=False, **kwargs)
166 return result
167
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
434 keys = [x.__dask_keys__() for x in collections]
435 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 436 results = schedule(dsk, keys, **kwargs)
437 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
438
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2574 should_rejoin = False
2575 try:
-> 2576 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
2577 finally:
2578 for f in futures.values():
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
1872 direct=direct,
1873 local_worker=local_worker,
-> 1874 asynchronous=asynchronous,
1875 )
1876
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
767 else:
768 return sync(
--> 769 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
770 )
771
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
333 if error[0]:
334 typ, exc, tb = error[0]
--> 335 raise exc.with_traceback(tb)
336 else:
337 return result[0]
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/utils.py in f()
317 if callback_timeout is not None:
318 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 319 result[0] = yield future
320 except Exception as exc:
321 error[0] = sys.exc_info()
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1728 exc = CancelledError(key)
1729 else:
-> 1730 raise exception.with_traceback(traceback)
1731 raise exc
1732 if errors == "skip":
/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/utils.py in apply()
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/dataframe/io/csv.py in read()
576 storage_options=storage_options,
577 include_path_column=include_path_column,
--> 578 **kwargs
579 )
580
/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/dataframe/io/csv.py in read_pandas()
/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/bytes/core.py in read_bytes()
/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/core.py in get_fs_token_paths()
/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/spec.py in glob()
/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/spec.py in find()
/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/adlfs/core.py in walk()
/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/adlfs/core.py in walk()
KeyError: 'type'
from adlfs.
I attempted to put together a minimal example from your observation. Starting with a set of csv files, such that there are 10 CSV files in a container titled: "testfolder":
['alzhiemers.csv/0.csv',
'alzhiemers.csv/1.csv',
'alzhiemers.csv/2.csv',
'alzhiemers.csv/3.csv',
'alzhiemers.csv/4.csv',
'alzhiemers.csv/5.csv',
'alzhiemers.csv/6.csv',
'alzhiemers.csv/7.csv',
'alzhiemers.csv/8.csv',
'alzhiemers.csv/9.csv']
the following behaves as expected for me:
from adlfs import AzureBlobFileSystem
storage_options={
'account_name': 'account_name',
'account_key': 'account_key',
'container_name': "testfolder"
}
account_name='account_name'
account_key='account_key'
container='testfolder'
fs = AzureBlobFileSystem(account_name=account_name,
account_key=account_key,
container_name=container)
The same works if I follow your convention. Can you confirm that you're using the latest version of adlfs? There was a bug in 0.1.4.
Also, to instantiate the filesystem as you do above, you should be including the container_name in the **STORAGE_OPTIONS dictionary. Finally, when reading from csv files, the engine declaration is not needed.
from adlfs.
I have experienced this as well. (adlfs==0.1.5, azure-storage-blob==2.1.0)
I uploaded a large partitioned parquet file to a container, and for some reason no type information were returned on at least one of the folders in the parquet "file".
My guess would be that the following piece of code could handle it.
EDIT:
It seemed like a temporary issue, but weren't.
An example of the locals at the original stack trace:
{
'self': <adlfs.core.AzureBlobFileSystem object at 0x7fc274b11780>,
'path': 'raw_financials.parquet/CompanyId=879661',
'maxdepth': 1,
'kwargs': {},
'full_dirs': [],
'dirs': [],
'files': [],
'listing': [
{'name': 'raw_financials.parquet/CompanyId=879661/', 'container_name': 'corporate20190912'},
{'name': 'raw_financials.parquet/CompanyId=8796617/', 'container_name': 'corporate20190912', 'type': 'directory', 'size': 0}
],
'info': {'name': 'raw_financials.parquet/CompanyId=879661/', 'container_name': 'corporate20190912'},
'name': 'raw_financials.parquet/CompanyId=879661'
}
from adlfs.
Related Issues (20)
- `find` doesn't accept `maxdepth` parameter HOT 1
- Add use_emulator setting to better align with object_store crate HOT 2
- Current state of the library, milestones and current development HOT 1
- Concurrent download of multiple files HOT 1
- Support virtual directory stubs with uppercase "Hdi_isfolder" metadata HOT 1
- Feature Suggestion: Optional content type when for writing file HOT 2
- Support passing url in AzureBlobFileSystem HOT 1
- Add comment why `aiohttp` is required
- Fix typo in repo About
- Python 3.12 support blocked by aiohttp HOT 1
- Feature Request: Support for Adding Metadata to Blobs
- Runtime warning from missing await HOT 2
- `fs.info()` and `fs.ls(detail=True)` return different etag formats
- Issue with parallel uploads to the same blob
- Can I use a bearer token / entra ID token for authentication? HOT 1
- Parameter anon ignored if set to False
- exists() is missing **kwargs
- object NoneType can't be used in 'await' expression HOT 3
- Does adlfs provide a way to set the content-type when writing a file to azure blob storage? HOT 1
- Does adlfs provide a way to get a file's public cloud url?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adlfs.