GithubHelp home page GithubHelp logo

KeyError: 'type' using fs.glob about adlfs HOT 3 CLOSED

fsspec avatar fsspec commented on July 24, 2024
KeyError: 'type' using fs.glob

from adlfs.

Comments (3)

lostmygithubaccount avatar lostmygithubaccount commented on July 24, 2024

Similar error trying to read using dask.dataframe.read_csv:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-99412ae0de4e> in <module>
----> 1 df = dask.delayed(dd.read_csv)(f'abfs://testing/data/isd*/*.csv', engine='pyarrow', storage_options=STORAGE_OPTIONS).compute()
      2 get_ipython().run_line_magic('time', 'df.head()')

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    163         dask.base.compute
    164         """
--> 165         (result,) = compute(self, traverse=False, **kwargs)
    166         return result
    167 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    434     keys = [x.__dask_keys__() for x in collections]
    435     postcomputes = [x.__dask_postcompute__() for x in collections]
--> 436     results = schedule(dsk, keys, **kwargs)
    437     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    438 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   2574                     should_rejoin = False
   2575             try:
-> 2576                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   2577             finally:
   2578                 for f in futures.values():

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, direct, asynchronous)
   1872                 direct=direct,
   1873                 local_worker=local_worker,
-> 1874                 asynchronous=asynchronous,
   1875             )
   1876 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    767         else:
    768             return sync(
--> 769                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    770             )
    771 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    333     if error[0]:
    334         typ, exc, tb = error[0]
--> 335         raise exc.with_traceback(tb)
    336     else:
    337         return result[0]

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/utils.py in f()
    317             if callback_timeout is not None:
    318                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 319             result[0] = yield future
    320         except Exception as exc:
    321             error[0] = sys.exc_info()

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1728                             exc = CancelledError(key)
   1729                         else:
-> 1730                             raise exception.with_traceback(traceback)
   1731                         raise exc
   1732                     if errors == "skip":

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/utils.py in apply()

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/dask/dataframe/io/csv.py in read()
    576             storage_options=storage_options,
    577             include_path_column=include_path_column,
--> 578             **kwargs
    579         )
    580 

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/dataframe/io/csv.py in read_pandas()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/dask/bytes/core.py in read_bytes()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/core.py in get_fs_token_paths()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/spec.py in glob()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/fsspec/spec.py in find()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/adlfs/core.py in walk()

/azureml-envs/azureml_792b2d64d1e52c01b5b979a6de29c506/lib/python3.6/site-packages/adlfs/core.py in walk()

KeyError: 'type'

from adlfs.

hayesgb avatar hayesgb commented on July 24, 2024

I attempted to put together a minimal example from your observation. Starting with a set of csv files, such that there are 10 CSV files in a container titled: "testfolder":

['alzhiemers.csv/0.csv',
 'alzhiemers.csv/1.csv',
 'alzhiemers.csv/2.csv',
 'alzhiemers.csv/3.csv',
 'alzhiemers.csv/4.csv',
 'alzhiemers.csv/5.csv',
 'alzhiemers.csv/6.csv',
 'alzhiemers.csv/7.csv',
 'alzhiemers.csv/8.csv',
 'alzhiemers.csv/9.csv']

the following behaves as expected for me:

from adlfs import AzureBlobFileSystem
storage_options={
    'account_name': 'account_name',
    'account_key': 'account_key',
    'container_name': "testfolder"
}
account_name='account_name'
account_key='account_key'
container='testfolder'

fs = AzureBlobFileSystem(account_name=account_name,
                        account_key=account_key,
                        container_name=container)

image

The same works if I follow your convention. Can you confirm that you're using the latest version of adlfs? There was a bug in 0.1.4.

Also, to instantiate the filesystem as you do above, you should be including the container_name in the **STORAGE_OPTIONS dictionary. Finally, when reading from csv files, the engine declaration is not needed.

from adlfs.

lassevalentini avatar lassevalentini commented on July 24, 2024

I have experienced this as well. (adlfs==0.1.5, azure-storage-blob==2.1.0)

I uploaded a large partitioned parquet file to a container, and for some reason no type information were returned on at least one of the folders in the parquet "file".

My guess would be that the following piece of code could handle it.

EDIT:
It seemed like a temporary issue, but weren't.

An example of the locals at the original stack trace:

{
    'self': <adlfs.core.AzureBlobFileSystem object at 0x7fc274b11780>, 
    'path': 'raw_financials.parquet/CompanyId=879661', 
    'maxdepth': 1,  
    'kwargs': {},  
    'full_dirs': [],  
    'dirs': [],  
    'files': [],  
    'listing': [ 
        {'name': 'raw_financials.parquet/CompanyId=879661/', 'container_name': 'corporate20190912'},  
        {'name': 'raw_financials.parquet/CompanyId=8796617/', 'container_name': 'corporate20190912', 'type': 'directory', 'size': 0} 
    ],  
    'info': {'name': 'raw_financials.parquet/CompanyId=879661/', 'container_name': 'corporate20190912'},  
    'name': 'raw_financials.parquet/CompanyId=879661'
}

from adlfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.