My data is stored in an ADLS gen2 account. It is partitioned by <code class="notransla

Similar error trying to read using dask.dataframe.read_csv: <div class="highlight

KeyError: 'type' using fs.glob about adlfs HOT 3 CLOSED

fsspec commented on July 24, 2024

KeyError: 'type' using fs.glob

from adlfs.

Comments (3)

lostmygithubaccount commented on July 24, 2024

Similar error trying to read using dask.dataframe.read_csv:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-99412ae0de4e> in <module>
----> 1 df = dask.delayed(dd.read_csv)(f'abfs://testing/data/isd*/*.csv', engine='pyarrow', storage_options=STORAGE_OPTIONS).compute()
      2 get_ipython().run_line_magic('time', 'df.head()')

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    163         dask.base.compute
    164         """
--> 165         (result,) = compute(self, traverse=False, **kwargs)
    166         return result
    167 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    434     keys = [x.__dask_keys__() for x in collections]
    435     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 436     results = schedule(dsk, keys, **kwargs)
    437     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    438 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2574                     should_rejoin = False
   2575             try:
-> 2576                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2577             finally:
   2578                 for f in futures.values():

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1872                 direct=direct,
   1873                 local_worker=local_worker,
-> 1874                 asynchronous=asynchronous,
   1875             )
   1876 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    767         else:
    768             return sync(
--> 769                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    770             )
    771 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    333     if error[0]:
    334         typ, exc, tb = error[0]
--> 335         raise exc.with_traceback(tb)
    336     else:
    337         return result[0]

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/utils.py in f()
    317             if callback_timeout is not None:
    318                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 319             result[0] = yield future
    320         except Exception as exc:
    321             error[0] = sys.exc_info()

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1728                             exc = CancelledError(key)
   1729                         else:
-> 1730                             raise exception.with_traceback(traceback)
   1731                         raise exc
   1732                     if errors == "skip":

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/utils.py in apply()

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/dataframe/io/csv.py in read()
    576             storage_options=storage_options,
    577             include_path_column=include_path_column,
--> 578             **kwargs
    579         )
    580 

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/dataframe/io/csv.py in read_pandas()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/bytes/core.py in read_bytes()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/core.py in get_fs_token_paths()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/spec.py in glob()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/spec.py in find()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/adlfs/core.py in walk()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/adlfs/core.py in walk()

KeyError: 'type'

from adlfs.

hayesgb commented on July 24, 2024

I attempted to put together a minimal example from your observation. Starting with a set of csv files, such that there are 10 CSV files in a container titled: "testfolder":

['alzhiemers.csv/0.csv',
 'alzhiemers.csv/1.csv',
 'alzhiemers.csv/2.csv',
 'alzhiemers.csv/3.csv',
 'alzhiemers.csv/4.csv',
 'alzhiemers.csv/5.csv',
 'alzhiemers.csv/6.csv',
 'alzhiemers.csv/7.csv',
 'alzhiemers.csv/8.csv',
 'alzhiemers.csv/9.csv']

the following behaves as expected for me:

from adlfs import AzureBlobFileSystem
storage_options={
    'account_name': 'account_name',
    'account_key': 'account_key',
    'container_name': "testfolder"
}
account_name='account_name'
account_key='account_key'
container='testfolder'

fs = AzureBlobFileSystem(account_name=account_name,
                        account_key=account_key,
                        container_name=container)

The same works if I follow your convention. Can you confirm that you're using the latest version of adlfs? There was a bug in 0.1.4.

Also, to instantiate the filesystem as you do above, you should be including the container_name in the **STORAGE_OPTIONS dictionary. Finally, when reading from csv files, the engine declaration is not needed.

from adlfs.

lassevalentini commented on July 24, 2024

I have experienced this as well. (adlfs==0.1.5, azure-storage-blob==2.1.0)

I uploaded a large partitioned parquet file to a container, and for some reason no type information were returned on at least one of the folders in the parquet "file".

My guess would be that the following piece of code could handle it.

EDIT:
It seemed like a temporary issue, but weren't.

An example of the locals at the original stack trace:

{
    'self': <adlfs.core.AzureBlobFileSystem object at 0x7fc274b11780>, 
    'path': 'raw_financials.parquet/CompanyId=879661', 
    'maxdepth': 1,  
    'kwargs': {},  
    'full_dirs': [],  
    'dirs': [],  
    'files': [],  
    'listing': [ 
        {'name': 'raw_financials.parquet/CompanyId=879661/', 'container_name': 'corporate20190912'},  
        {'name': 'raw_financials.parquet/CompanyId=8796617/', 'container_name': 'corporate20190912', 'type': 'directory', 'size': 0} 
    ],  
    'info': {'name': 'raw_financials.parquet/CompanyId=879661/', 'container_name': 'corporate20190912'},  
    'name': 'raw_financials.parquet/CompanyId=879661'
}

from adlfs.

KeyError: 'type' using fs.glob about adlfs HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs