Comments (14)
I've spent some time on this today. I can replicate your issue on my Windows machine, but it works as expected on Ubuntu and my Mac. I've found one compatibility issue with the 0.7.2 release of fsspec, which I will work on fixing tomorrow. Currently comparing package dependencies between Windows and Linux.
from adlfs.
Thanks @hayesgb! I was able to read in the csv file on my windows.
Going to move the parquet file read to a separate issue
from adlfs.
This is the first report I've gotten of this error. Noted mdurant's suggestion, which seems the most likely explanation. Can I assume your filepath is formatted as:
"abfs://{filesystem_name}/file.parquet"? Also, which Azure region are you in?
from adlfs.
I was actually doing abfs://{filesystem_name}/file
but I updated my code (and the SO Q's) to be abfs://{filesystem_name}/file.parquet
. However, I still get the AzureHttpError
.
I'm in East US 2.
from adlfs.
For reference, I'm working in East US 2 daily without issue, so I would assume it's not an availability problem. Can you answer a few other questions?
- What package versions are you running? (adlfs, fsspec, dask, and azure-storage-blob).
- Are you running Dask locally or distributed? If distributed, what version.
- Is this parquet file one that was written to abfs with Dask? If no, does a simple read-write operation with another file work, and how was the existing parquet file created? If yes, does a read-write operation to-from CSV work successfully?
- Have you recreated the problem with a minimal working example (small example dummy dataframe)? If so can you share that example so I can try to re-create your issue?
from adlfs.
Thanks for the prompt for a MCVE.
- What package versions are you running? (adlfs, fsspec, dask, and azure-storage-blob).
Windows 10
adlfs==0.2.0, fsspec==0.6.2, dask==2.10.1, azure-storage-blob==2.1.0
further details below
> conda list:
_anaconda_depends 2019.03 py37_0
_ipyw_jlab_nb_ext_conf 0.1.0 py37_0
adal 1.2.2 pypi_0 pypi
adlfs 0.2.0 pypi_0 pypi
alabaster 0.7.12 py37_0
alembic 1.4.0 py_0 conda-forge
anaconda custom py37_1
anaconda-client 1.7.2 py37_0
anaconda-navigator 1.9.7 py37_0
anaconda-project 0.8.4 py_0
appdirs 1.4.3 pypi_0 pypi
argh 0.26.2 py37_0
arrow-cpp 0.13.0 py37h49ee12d_0
asn1crypto 1.3.0 py37_0
astroid 2.3.3 py37_0
astropy 4.0 py37he774522_0
atomicwrites 1.3.0 py37_1
attrs 19.3.0 py_0
autopep8 1.4.4 py_0
azure-common 1.1.25 pypi_0 pypi
azure-core 1.2.2 pypi_0 pypi
azure-datalake-store 0.0.48 pypi_0 pypi
azure-storage-blob 2.1.0 pypi_0 pypi
azure-storage-common 2.1.0 pypi_0 pypi
babel 2.8.0 py_0
backcall 0.1.0 py37_0
backports 1.0 py_2
backports.functools_lru_cache 1.6.1 py_0
backports.os 0.1.1 py37_0
backports.shutil_get_terminal_size 1.0.0 py37_2
backports.tempfile 1.0 py_1
backports.weakref 1.0.post1 py_1
bcrypt 3.1.7 py37he774522_0
beautifulsoup4 4.8.2 py37_0
bitarray 1.2.1 py37he774522_0
bkcharts 0.2 py37_0
black 19.10b0 pypi_0 pypi
blackcellmagic 0.0.2 pypi_0 pypi
blas 1.0 mkl
bleach 3.1.0 py37_0
blosc 1.16.3 h7bd577a_0
bokeh 1.4.0 py37_0
boost-cpp 1.67.0 hfa6e2cd_4
boto 2.49.0 py37_0
bottleneck 1.3.1 py37h8c2d366_0
brotli 1.0.7 h33f27b4_0
bzip2 1.0.8 he774522_0
ca-certificates 2020.4.5.1 hecc5488_0 conda-forge
certifi 2020.4.5.1 py37hc8dfbb8_0 conda-forge
cffi 1.14.0 py37h7a1dbc1_0
chardet 3.0.4 py37_1003
click 7.0 py37_0
cloudpickle 1.3.0 py_0
clyent 1.2.2 py37_1
colorama 0.4.3 py_0
colorcet 2.0.2 py_0
comtypes 1.1.7 py37_0
conda 4.8.3 py37hc8dfbb8_1 conda-forge
conda-build 3.18.11 py37_0
conda-env 2.6.0 1
conda-package-handling 1.6.0 py37h62dcd97_0
conda-verify 3.4.2 py_1
configparser 3.7.3 py37_1 conda-forge
console_shortcut 0.1.1 3
contextlib2 0.6.0.post1 py_0
cryptography 2.8 py37h7a1dbc1_0
cudatoolkit 10.1.243 h74a9793_0
curl 7.68.0 h2a8f88b_0
cx-oracle 7.3.0 pypi_0 pypi
cycler 0.10.0 py37_0
cymem 2.0.2 py37h74a9793_0
cython 0.29.15 py37ha925a31_0
cython-blis 0.2.4 py37hfa6e2cd_1 fastai
cytoolz 0.10.1 py37he774522_0
dask 2.10.1 py_0
dask-core 2.10.1 py_0
databricks-cli 0.9.1 py_0 conda-forge
dataclasses 0.6 py_0 fastai
decorator 4.4.1 py_0
defusedxml 0.6.0 py_0
diff-match-patch 20181111 py_0
distributed 2.10.0 py_0
doc8 0.8.0 pypi_0 pypi
docker-py 4.1.0 py37_0 conda-forge
docker-pycreds 0.4.0 py_0 conda-forge
docutils 0.16 py37_0
double-conversion 3.1.5 ha925a31_1
entrypoints 0.3 py37_0
et_xmlfile 1.0.1 py37_0
fastai 1.0.60 1 fastai
fastcache 1.1.0 py37he774522_0
fastparquet 0.3.3 py37hc8d92b1_0 conda-forge
fastprogress 0.2.2 py_0 fastai
filelock 3.0.12 py_0
flake8 3.7.9 py37_0
flask 1.1.1 py_0
freetype 2.9.1 ha9979f8_1
fsspec 0.6.2 py_0
future 0.18.2 py37_0
get_terminal_size 1.0.0 h38e98db_0
gevent 1.4.0 py37he774522_0
gflags 2.2.2 ha925a31_0
gitdb2 3.0.2 py_0 conda-forge
gitpython 3.0.5 py_0 conda-forge
glob2 0.7 py_0
glog 0.4.0 h33f27b4_0
gorilla 0.3.0 py_0 conda-forge
greenlet 0.4.15 py37hfa6e2cd_0
h5py 2.10.0 py37h5e291fa_0
hdf5 1.10.4 h7ebc959_0
heapdict 1.0.1 py_0
holoviews 1.12.7 py_0
html5lib 1.0.1 py37_0
hvplot 0.5.2 py_0 conda-forge
hypothesis 5.4.1 py_0
icc_rt 2019.0.0 h0cc432a_1
icu 58.2 ha66f8fd_1
idna 2.8 py37_0
imageio 2.6.1 py37_0
imagesize 1.2.0 py_0
importlib_metadata 1.5.0 py37_0
intel-openmp 2020.0 166
intervaltree 3.0.2 py_0
ipykernel 5.1.4 py37h39e3cac_0
ipython 7.12.0 py37h5ca1d4c_0
ipython_genutils 0.2.0 py37_0
ipywidgets 7.5.1 py_0
isodate 0.6.0 pypi_0 pypi
isort 4.3.21 py37_0
itsdangerous 1.1.0 py37_0
jdcal 1.4.1 py_0
jedi 0.14.1 py37_0
jinja2 2.11.1 py_0
joblib 0.14.1 py_0
jpeg 9b hb83a4c4_2
json5 0.9.1 py_0
jsonschema 3.2.0 py37_0
jupyter 1.0.0 py37_7
jupyter_client 5.3.4 py37_0
jupyter_console 6.1.0 py_0
jupyter_core 4.6.1 py37_0
jupyterlab 1.2.6 pyhf63ae98_0
jupyterlab_server 1.0.6 py_0
keyring 21.1.0 py37_0
kiwisolver 1.1.0 py37ha925a31_0
krb5 1.17.1 hc04afaa_0
lazy-object-proxy 1.4.3 py37he774522_0
libarchive 3.3.3 h0643e63_5
libboost 1.67.0 hfd51bdf_4
libcurl 7.68.0 h2a8f88b_0
libiconv 1.15 h1df5818_7
liblief 0.9.0 ha925a31_2
libpng 1.6.37 h2a8f88b_0
libprotobuf 3.6.0 h1a1b453_0
libsodium 1.0.16 h9d3ae62_0
libspatialindex 1.9.3 h33f27b4_0
libssh2 1.8.2 h7a1dbc1_0
libtiff 4.1.0 h56a325e_0
libxml2 2.9.9 h464c3ec_0
libxslt 1.1.33 h579f668_0
llvmlite 0.31.0 py37ha925a31_0
locket 0.2.0 py37_1
lxml 4.5.0 py37h1350720_0
lz4-c 1.8.1.2 h2fa13f4_0
lzo 2.10 h6df0209_2
m2w64-gcc-libgfortran 5.3.0 6
m2w64-gcc-libs 5.3.0 7
m2w64-gcc-libs-core 5.3.0 7
m2w64-gmp 6.1.0 2
m2w64-libwinpthread-git 5.0.0.4634.697f757 2
mako 1.1.0 py_0 conda-forge
markupsafe 1.1.1 py37he774522_0
matplotlib 3.1.3 py37_0
matplotlib-base 3.1.3 py37h64f37c6_0
mccabe 0.6.1 py37_1
menuinst 1.4.16 py37he774522_0
mistune 0.8.4 py37he774522_0
mkl 2020.0 166
mkl-service 2.3.0 py37hb782905_0
mkl_fft 1.0.15 py37h14836fe_0
mkl_random 1.1.0 py37h675688f_0
mlflow 1.6.0 pypi_0 pypi
mock 4.0.1 py_0
more-itertools 8.2.0 py_0
mpmath 1.1.0 py37_0
msgpack-python 0.6.1 py37h74a9793_1
msrest 0.6.11 pypi_0 pypi
msys2-conda-epoch 20160418 1
multipledispatch 0.6.0 py37_0
murmurhash 1.0.2 py37h33f27b4_0
navigator-updater 0.2.1 py37_0
nbconvert 5.6.1 py37_0
nbformat 5.0.4 py_0
networkx 2.4 py_0
ninja 1.9.0 py37h74a9793_0
nltk 3.4.5 py37_0
nose 1.3.7 py37_2
notebook 6.0.3 py37_0
numba 0.48.0 py37h47e9c7a_0
numexpr 2.7.1 py37h25d0782_0
numpy 1.18.1 py37h93ca92e_0
numpy-base 1.18.1 py37hc3f5095_1
numpydoc 0.9.2 py_0
nvidia-ml-py3 7.352.0 py_0 fastai
oauthlib 3.1.0 pypi_0 pypi
olefile 0.46 py37_0
openpyxl 3.0.3 py_0
openssl 1.1.1f hfa6e2cd_0 conda-forge
packaging 20.1 py_0
pandas 1.0.1 py37h47e9c7a_0
pandoc 2.2.3.2 0
pandocfilters 1.4.2 py37_1
param 1.9.3 py_0
paramiko 2.6.0 py37_0
parso 0.5.2 py_0
partd 1.1.0 py_0
path 13.1.0 py37_0
path.py 12.4.0 0
pathlib2 2.3.5 py37_0
pathspec 0.7.0 pypi_0 pypi
pathtools 0.1.2 py_1
patsy 0.5.1 py37_0
pbr 5.4.4 pypi_0 pypi
pep8 1.7.1 py37_0
pexpect 4.8.0 py37_0
pickleshare 0.7.5 py37_0
pillow 7.0.0 py37hcc1f983_0
pip 20.0.2 py37_1
pkginfo 1.5.0.1 py37_0
plac 0.9.6 py37_0
pluggy 0.13.1 py37_0
ply 3.11 py37_0
powershell_shortcut 0.0.1 2
preshed 2.0.1 py37h33f27b4_0
prometheus_client 0.7.1 py_0
prometheus_flask_exporter 0.12.2 py_0 conda-forge
prompt_toolkit 3.0.3 py_0
properscoring 0.1 py_0 conda-forge
protobuf 3.6.0 py37he025d50_1 conda-forge
psutil 5.6.7 py37he774522_0
py 1.8.1 py_0
py-lief 0.9.0 py37ha925a31_2
pyarrow 0.13.0 py37ha925a31_0
pycodestyle 2.5.0 py37_0
pycosat 0.6.3 py37he774522_0
pycparser 2.19 py37_0
pycrypto 2.6.1 py37hfa6e2cd_9
pyct 0.4.6 py37_0
pycurl 7.43.0.5 py37h7a1dbc1_0
pydocstyle 4.0.1 py_0
pyflakes 2.1.1 py37_0
pygments 2.5.2 py_0
pyjwt 1.7.1 pypi_0 pypi
pylint 2.4.4 py37_0
pynacl 1.3.0 py37h62dcd97_0
pyodbc 4.0.30 py37ha925a31_0
pyopenssl 19.1.0 py37_0
pyparsing 2.4.6 py_0
pypiwin32 223 pypi_0 pypi
pyqt 5.9.2 py37h6538335_2
pyreadline 2.1 py37_1
pyrsistent 0.15.7 py37he774522_0
pysocks 1.7.1 py37_0
pytables 3.6.1 py37h1da0976_0
pytest 5.3.5 py37_0
pytest-arraydiff 0.3 py37h39e3cac_0
pytest-astropy 0.8.0 py_0
pytest-astropy-header 0.1.2 py_0
pytest-doctestplus 0.5.0 py_0
pytest-openfiles 0.4.0 py_0
pytest-remotedata 0.3.2 py37_0
python 3.7.6 h60c2a47_2
python-dateutil 2.8.1 py_0
python-editor 1.0.4 py_0 conda-forge
python-jsonrpc-server 0.3.4 py_0
python-language-server 0.31.7 py37_0
python-libarchive-c 2.8 py37_13
python-snappy 0.5.4 py37hd25c944_1 conda-forge
python_abi 3.7 1_cp37m conda-forge
pytorch 1.4.0 py3.7_cuda101_cudnn7_0 pytorch
pytz 2019.3 py_0
pyviz_comms 0.7.3 py_0
pywavelets 1.1.1 py37he774522_0
pywin32 227 py37he774522_1
pywin32-ctypes 0.2.0 py37_1000
pywinpty 0.5.7 py37_0
pyyaml 5.3 py37he774522_0
pyzmq 18.1.1 py37ha925a31_0
qdarkstyle 2.8 py_0
qt 5.9.7 vc14h73c81de_0
qtawesome 0.6.1 py_0
qtconsole 4.6.0 py_1
qtpy 1.9.0 py_0
querystring_parser 1.2.4 py_0 conda-forge
re2 2019.08.01 vc14ha925a31_0
regex 2020.1.8 pypi_0 pypi
requests 2.22.0 py37_1
requests-oauthlib 1.3.0 pypi_0 pypi
restructuredtext-lint 1.3.0 pypi_0 pypi
rope 0.16.0 py_0
rtree 0.9.3 py37h21ff451_0
ruamel_yaml 0.15.87 py37he774522_0
scikit-image 0.16.2 py37h47e9c7a_0
scikit-learn 0.22.1 py37h6288b17_0
scipy 1.4.1 py37h9439919_0
seaborn 0.10.0 py_0
send2trash 1.5.0 py37_0
setuptools 45.2.0 py37_0
simplegeneric 0.8.1 py37_2
simplejson 3.17.0 py37hfa6e2cd_0 conda-forge
singledispatch 3.4.0.3 py37_0
sip 4.19.8 py37h6538335_0
six 1.14.0 py37_0
smmap2 2.0.5 py_0 conda-forge
snappy 1.1.7 h777316e_3
snowballstemmer 2.0.0 py_0
sortedcollections 1.1.2 py37_0
sortedcontainers 2.1.0 py37_0
soupsieve 1.9.5 py37_0
spacy 2.1.8 py37he980bc4_0 fastai
sphinx 2.4.0 py_0
sphinxcontrib 1.0 py37_1
sphinxcontrib-applehelp 1.0.1 py_0
sphinxcontrib-devhelp 1.0.1 py_0
sphinxcontrib-htmlhelp 1.0.2 py_0
sphinxcontrib-jsmath 1.0.1 py_0
sphinxcontrib-qthelp 1.0.2 py_0
sphinxcontrib-serializinghtml 1.1.3 py_0
sphinxcontrib-websupport 1.2.0 py_0
spyder 4.0.1 py37_0
spyder-kernels 1.8.1 py37_0
sqlalchemy 1.3.13 py37he774522_0
sqlite 3.31.1 he774522_0
sqlparse 0.3.0 py_0 conda-forge
srsly 0.1.0 py37h6538335_0 fastai
statsmodels 0.11.0 py37he774522_0
stevedore 1.32.0 pypi_0 pypi
sympy 1.5.1 py37_0
tabulate 0.8.6 py_0 conda-forge
tbb 2020.0 h74a9793_0
tblib 1.6.0 py_0
terminado 0.8.3 py37_0
testpath 0.4.4 py_0
thinc 7.0.8 py37he980bc4_0 fastai
thrift 0.11.0 py37h6538335_1001 conda-forge
thrift-cpp 0.11.0 h1ebf3fd_3
tk 8.6.8 hfa6e2cd_0
toml 0.10.0 pypi_0 pypi
toolz 0.10.0 py_0
torchvision 0.5.0 py37_cu101 pytorch
tornado 6.0.3 py37he774522_3
tqdm 4.42.1 py_0
traitlets 4.3.3 py37_0
typed-ast 1.4.1 pypi_0 pypi
ujson 1.35 py37hfa6e2cd_0
unicodecsv 0.14.1 py37_0
urllib3 1.25.8 py37_0
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_1
waitress 1.4.3 py_0 conda-forge
wasabi 0.2.2 py_0 fastai
watchdog 0.10.2 py37_0
wcwidth 0.1.8 py_0
webencodings 0.5.1 py37_1
websocket-client 0.57.0 py37_0 conda-forge
werkzeug 1.0.0 py_0
wheel 0.34.2 py37_0
widgetsnbextension 3.5.1 py37_0
win_inet_pton 1.1.0 py37_0
win_unicode_console 0.5 py37_0
wincertstore 0.2 py37_0
winpty 0.4.3 4
wrapt 1.11.2 py37he774522_0
xarray 0.15.0 py_0 conda-forge
xlrd 1.2.0 py37_0
xlsxwriter 1.2.7 py_0
xlwings 0.17.1 py37_0
xlwt 1.3.0 py37_0
xskillscore 0.0.15 py_0 conda-forge
xz 5.2.4 h2fa13f4_4
yaml 0.1.7 hc54c509_2
yapf 0.28.0 py_0
zeromq 4.3.1 h33f27b4_3
zict 1.0.0 py_0
zipp 2.2.0 py_0
zlib 1.2.11 h62dcd97_3
zstd 1.3.7 h508b16e_0
- Are you running Dask locally or distributed? If distributed, what version.
distributed (2.10.1) using a LocalCluster
.
from dask.distributed import Client
client = Client()
- Is this parquet file one that was written to abfs with Dask? If no, does a simple read-write operation with another file work, and how was the existing parquet file created? If yes, does a read-write operation to-from CSV work successfully?
- Have you recreated the problem with a minimal working example (small example dummy dataframe)? If so can you share that example so I can try to re-create your issue?
Good questions. I tackle then both in the MCVE code.
I get EmptyDataError: No columns to parse from file
with the csv files and AzureHttpError: Server encountered an internal error
with the parquet file.
import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)
ddf = dd.from_pandas(df, npartitions=2)
STORAGE_OPTIONS={'account_name': 'ACCOUNT_NAME',
'account_key': 'ACCOUNT_KEY'}
# This works fine and I see the files in Microsoft Azure Storage Explorer
dd.to_csv(df=ddf,
filename='abfs://BLOB/FILE/*.csv',
storage_options=STORAGE_OPTIONS)
df = dd.read_csv('abfs://tmp/tmp2/*.csv', storage_options=STORAGE_OPTIONS)
---------------------------------------------------------------------------
EmptyDataError Traceback (most recent call last)
<ipython-input-33-4ef0af5e9369> in <module>
----> 1 df = dd.read_csv('abfs://tmp/tmp2/*.csv', storage_options=STORAGE_OPTIONS)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
576 storage_options=storage_options,
577 include_path_column=include_path_column,
--> 578 **kwargs
579 )
580
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
442
443 # Use sample to infer dtypes and check for presence of include_path_column
--> 444 head = reader(BytesIO(b_sample), **kwargs)
445 if include_path_column and (include_path_column in head.columns):
446 raise ValueError(
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.__name__ = name
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
446
447 # Create the parser.
--> 448 parser = TextFileReader(fp_or_buf, **kwds)
449
450 if chunksize or iterator:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
878 self.options["has_index_names"] = kwds["has_index_names"]
879
--> 880 self._make_engine(self.engine)
881
882 def close(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
1112 def _make_engine(self, engine="c"):
1113 if engine == "c":
-> 1114 self._engine = CParserWrapper(self.f, **self.options)
1115 else:
1116 if engine == "python":
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
1889 kwds["usecols"] = self.usecols
1890
-> 1891 self._reader = parsers.TextReader(src, **kwds)
1892 self.unnamed_cols = self._reader.unnamed_cols
1893
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()
EmptyDataError: No columns to parse from file
# This works and I see it in Microsoft Azure Storage Explorer
dd.to_parquet(df=df,
path='abfs://BLOB/FILE.parquet',
storage_options=STORAGE_OPTIONS)
df = dd.read_parquet('abfs://tmp/tmp.parquet',
storage_options=STORAGE_OPTIONS)
ERROR:azure.storage.common.storageclient:Client-Request-ID=fe8a8c36-8120-11ea-a33c-a0afbd853445 Retry policy did not allow for a retry: Server-Timestamp=Sat, 18 Apr 2020 03:03:08 GMT, Server-Request-ID=a5160140-d01e-006b-642d-1518c8000000, HTTP status code=500, Exception=Server encountered an internal error. Please try again after some time. ErrorCode: InternalError<?xml version="1.0" encoding="utf-8"?><Error><Code>InternalError</Code><Message>Server encountered an internal error. Please try again after some time.RequestId:a5160140-d01e-006b-642d-1518c8000000Time:2020-04-18T03:03:09.2047334Z</Message></Error>.
AzureHttpError Traceback (most recent call last)
<ipython-input-35-0b3e24138208> in <module>
1 df = dd.read_parquet('abfs://tmp/tmp.parquet',
----> 2 storage_options=STORAGE_OPTIONS)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\core.py in read_parquet(path, columns, filters, categories, index, storage_options, engine, gather_statistics, split_row_groups, chunksize, **kwargs)
231 filters=filters,
232 split_row_groups=split_row_groups,
--> 233 **kwargs
234 )
235 if meta.index.name is not None:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py in read_metadata(fs, paths, categories, index, gather_statistics, filters, **kwargs)
176 # correspond to a row group (populated below).
177 parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
--> 178 fs, paths, gather_statistics, **kwargs
179 )
180
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py in _determine_pf_parts(fs, paths, gather_statistics, **kwargs)
127 open_with=fs.open,
128 sep=fs.sep,
--> 129 **kwargs.get("file", {})
130 )
131 if gather_statistics is None:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\fastparquet\api.py in __init__(self, fn, verify, open_with, root, sep)
109 fn2 = join_path(fn, '_metadata')
110 self.fn = fn2
--> 111 with open_with(fn2, 'rb') as f:
112 self._parse_header(f, verify)
113 fn = fn2
~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in open(self, path, mode, block_size, cache_options, **kwargs)
722 autocommit=ac,
723 cache_options=cache_options,
--> 724 **kwargs
725 )
726 if not ac:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in _open(self, path, mode, block_size, autocommit, cache_options, **kwargs)
552 autocommit=autocommit,
553 cache_options=cache_options,
--> 554 **kwargs,
555 )
556
~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in __init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, **kwargs)
582 cache_type=cache_type,
583 cache_options=cache_options,
--> 584 **kwargs,
585 )
586
~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in __init__(self, fs, path, mode, block_size, autocommit, cache_type, cache_options, **kwargs)
954 if mode == "rb":
955 if not hasattr(self, "details"):
--> 956 self.details = fs.info(path)
957 self.size = self.details["size"]
958 self.cache = caches[cache_type](
~\AppData\Local\Continuum\anaconda3\lib\site-packages\fsspec\spec.py in info(self, path, **kwargs)
499 if out:
500 return out[0]
--> 501 out = self.ls(path, detail=True, **kwargs)
502 path = path.rstrip("/")
503 out1 = [o for o in out if o["name"].rstrip("/") == path]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in ls(self, path, detail, invalidate_cache, delimiter, **kwargs)
446 # then return the contents
447 elif self._matches(
--> 448 container_name, path, as_directory=True, delimiter=delimiter
449 ):
450 logging.debug(f"{path} appears to be a directory")
~\AppData\Local\Continuum\anaconda3\lib\site-packages\adlfs\core.py in _matches(self, container_name, path, as_directory, delimiter)
386 prefix=path,
387 delimiter=delimiter,
--> 388 num_results=None,
389 )
390
~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\blob\baseblobservice.py in list_blob_names(self, container_name, prefix, num_results, include, delimiter, marker, timeout)
1360 '_context': operation_context,
1361 '_converter': _convert_xml_to_blob_name_list}
-> 1362 resp = self._list_blobs(*args, **kwargs)
1363
1364 return ListGenerator(resp, self._list_blobs, args, kwargs)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\blob\baseblobservice.py in _list_blobs(self, container_name, prefix, marker, max_results, include, delimiter, timeout, _context, _converter)
1435 }
1436
-> 1437 return self._perform_request(request, _converter, operation_context=_context)
1438
1439 def get_blob_account_information(self, container_name=None, blob_name=None, timeout=None):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
444 status_code,
445 exception_str_in_one_line)
--> 446 raise ex
447 finally:
448 # If this is a location locked operation and the location is not set,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
372 except AzureException as ex:
373 retry_context.exception = ex
--> 374 raise ex
375 except Exception as ex:
376 retry_context.exception = ex
~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\storageclient.py in _perform_request(self, request, parser, parser_args, operation_context, expected_errors)
358 # and raised as an azure http exception
359 _http_error_handler(
--> 360 HTTPError(response.status, response.message, response.headers, response.body))
361
362 # Parse the response
~\AppData\Local\Continuum\anaconda3\lib\site-packages\azure\storage\common\_error.py in _http_error_handler(http_error)
113 ex.error_code = error_code
114
--> 115 raise ex
116
117
AzureHttpError: Server encountered an internal error. Please try again after some time. ErrorCode: InternalError
<?xml version="1.0" encoding="utf-8"?><Error><Code>InternalError</Code><Message>Server encountered an internal error. Please try again after some time.
RequestId:a5160140-d01e-006b-642d-1518c8000000
Time:2020-04-18T03:03:09.2047334Z</Message></Error>
from adlfs.
I've just attempted to reproduce your example, but it worked on my end. Below is my code and results:
import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()
storage_options = <DEFINED>
d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)
ddf = dd.from_pandas(df, npartitions=2)
dd.to_csv(df=ddf,
filename='abfs://<container>/test_csvfile/*.csv',
storage_options=storage_options)
df2 = dd.read_csv("abfs://datascience-dev/test_csvfile/*.csv", storage_options=storage_options)
df2.head() <returns successfully in Jupyter Notebook>
dd.to_parquet(ddf,
'abfs://datascience-dev/testfile.parquet',
storage_options=storage_options)
df3 = dd.read_parquet("abfs://datascience-dev/testfile.parquet",
storage_options=storage_options)
df3.head() <returns successfully in Jupyter Notebook>
This was run on Linux with Anaconda Python. Python v3.6.7. Confirmed it works on my Windows 10 as well.
Versions of adlfs, fsspec, azure-storage-blob == 2.1.0, azure-common==1.1.24, and azure-datalake-store==0.0.48. I see that you have azure-core installed, which I do not have installed, and is not a dependency. You may want to try removing. Looking through other packages that are logical suspects, I also have requests 2.23 rather than 2.22.
I will investigate further later today.
from adlfs.
Thanks a lot for running. Regarding the packages I’ll try in a new env
from adlfs.
Here's a new environment. Slightly different error message but same thing along the lines of file(s) not found?
Create new env:
> conda create -n adlfs python=3.8
> conda activate adlfs
> pip install adlfs
> conda install -c conda-forge dask fastparquet ipython
Check packages:
> conda list:
adal 1.2.2 pypi_0 pypi
adlfs 0.2.0 pypi_0 pypi
azure-common 1.1.25 pypi_0 pypi
azure-datalake-store 0.0.48 pypi_0 pypi
azure-storage-blob 2.1.0 pypi_0 pypi
azure-storage-common 2.1.0 pypi_0 pypi
backcall 0.1.0 py_0 conda-forge
bokeh 2.0.1 py38h32f6830_0 conda-forge
ca-certificates 2020.4.5.1 hecc5488_0 conda-forge
certifi 2020.4.5.1 py38h32f6830_0 conda-forge
cffi 1.14.0 pypi_0 pypi
chardet 3.0.4 pypi_0 pypi
click 7.1.1 pyh8c360ce_0 conda-forge
cloudpickle 1.3.0 py_0 conda-forge
colorama 0.4.3 py_0 conda-forge
cryptography 2.9 pypi_0 pypi
cytoolz 0.10.1 py38hfa6e2cd_0 conda-forge
dask 2.14.0 py_0 conda-forge
dask-core 2.14.0 py_0 conda-forge
decorator 4.4.2 py_0 conda-forge
distributed 2.14.0 py38h32f6830_0 conda-forge
fastparquet 0.3.3 py38hc8d92b1_0 conda-forge
freetype 2.10.1 ha9979f8_0 conda-forge
fsspec 0.7.2 py_0 conda-forge
heapdict 1.0.1 py_0 conda-forge
idna 2.9 pypi_0 pypi
intel-openmp 2020.0 166
ipython 7.13.0 py38h32f6830_2 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.17.0 py38h32f6830_0 conda-forge
jinja2 2.11.2 pyh9f0ad1d_0 conda-forge
jpeg 9c hfa6e2cd_1001 conda-forge
libblas 3.8.0 15_mkl conda-forge
libcblas 3.8.0 15_mkl conda-forge
liblapack 3.8.0 15_mkl conda-forge
libpng 1.6.37 hfe6a214_1 conda-forge
libtiff 4.1.0 h885aae3_6 conda-forge
llvmlite 0.31.0 py38h32f6830_1 conda-forge
locket 0.2.0 py_2 conda-forge
lz4-c 1.9.2 h33f27b4_0 conda-forge
markupsafe 1.1.1 py38h9de7a3e_1 conda-forge
mkl 2020.0 166
msgpack-python 1.0.0 py38heaebd3c_1 conda-forge
numba 0.48.0 py38he350917_0 conda-forge
numpy 1.18.1 py38ha749109_1 conda-forge
olefile 0.46 py_0 conda-forge
openssl 1.1.1f hfa6e2cd_0 conda-forge
packaging 20.1 py_0 conda-forge
pandas 1.0.3 py38he6e81aa_1 conda-forge
parso 0.7.0 pyh9f0ad1d_0 conda-forge
partd 1.1.0 py_0 conda-forge
pickleshare 0.7.5 py38h32f6830_1001 conda-forge
pillow 7.1.1 py38h8103267_0 conda-forge
pip 20.0.2 py38_1
prompt-toolkit 3.0.5 py_0 conda-forge
psutil 5.7.0 py38h9de7a3e_1 conda-forge
pycparser 2.20 pypi_0 pypi
pygments 2.6.1 py_0 conda-forge
pyjwt 1.7.1 pypi_0 pypi
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
python 3.8.2 h5fd99cc_11
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.8 1_cp38 conda-forge
pytz 2019.3 py_0 conda-forge
pyyaml 5.3.1 py38h9de7a3e_0 conda-forge
requests 2.23.0 pypi_0 pypi
setuptools 46.1.3 py38_0
six 1.14.0 py_1 conda-forge
sortedcontainers 2.1.0 py_0 conda-forge
sqlite 3.31.1 he774522_0
tblib 1.6.0 py_0 conda-forge
thrift 0.11.0 py38h6538335_1001 conda-forge
tk 8.6.10 hfa6e2cd_0 conda-forge
toolz 0.10.0 py_0 conda-forge
tornado 6.0.4 py38hfa6e2cd_0 conda-forge
traitlets 4.3.3 py38h32f6830_1 conda-forge
typing_extensions 3.7.4.1 py38h32f6830_3 conda-forge
urllib3 1.25.9 pypi_0 pypi
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_1
wcwidth 0.1.9 pyh9f0ad1d_0 conda-forge
wheel 0.34.2 py38_0
wincertstore 0.2 py38_0
xz 5.2.5 h2fa13f4_0 conda-forge
yaml 0.2.3 he774522_0 conda-forge
zict 2.0.0 py_0 conda-forge
zlib 1.2.11 h2fa13f4_1006 conda-forge
zstd 1.4.4 h9f78265_3 conda-forge
Setup code:
import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()
storage_options = <DEFINED>
d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)
ddf = dd.from_pandas(df, npartitions=2)
csv example:
dd.to_csv(df=ddf,
filename='abfs://<container>/test_csvfile/*.csv',
storage_options=storage_options)
df2 = dd.read_csv('abfs://<container>/test_csvfile/*.csv',
storage_options=storage_options)
Error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\csv.py", line 566, in read
return read_pandas(
File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\csv.py", line 398, in read_pandas
b_out = read_bytes(
File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\bytes\core.py", line 96, in read_bytes
raise IOError("%s resolved to no files" % urlpath)
OSError: abfs://<container>/test_csvfile/*.csv resolved to no files
Print a few things using %debug:
ipdb> urlpath
'abfs://tmp/test_csvfile/*.csv'
ipdb> paths
[]
ipdb> b_lineterminator
b'\n'
parquet example:
dd.to_parquet(ddf,
'abfs://<container>/testfile.parquet',
storage_options=storage_options)
df3 = dd.read_parquet("abfs://<container>/testfile.parquet",
storage_options=storage_options)
Error message:
>>> df3 = dd.read_parquet("abfs://<container>/testfile.parquet", storage_options=storage_options)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\core.py", line 225, in read_parquet
meta, statistics, parts = engine.read_metadata(
File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 202, in read_metadata
parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 147, in _determine_pf_parts
base, fns = _analyze_paths(paths, fs)
File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\utils.py", line 405, in _analyze_paths
basepath = path_parts_list[0][:-1]
IndexError: list index out of range
Print a few things using %debug:
ipdb> path_parts_list
[]
ipdb> file_list
[]
ipdb> paths
[]
ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000019872422C70>
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\core.py(225)read_parquet()
223 index = [index]
224
--> 225 meta, statistics, parts = engine.read_metadata(
226 fs,
227 paths,
ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000019872422C70>
ipdb> paths
['tmp/testfile.parquet']
ipdb> gather_statistics
ipdb>
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py(147)_determine_pf_parts()
145 # This is a directory, check for _metadata, then _common_metadata
146 paths = fs.glob(paths[0] + fs.sep + "*")
--> 147 base, fns = _analyze_paths(paths, fs)
148 if "_metadata" in fns:
149 # Using _metadata file (best-case scenario)
ipdb> paths
[]
ipdb> u
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py(202)read_metadata()
200 # then each part will correspond to a file. Otherwise, each part will
201 # correspond to a row group (populated below).
--> 202 parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
203 fs, paths, gather_statistics, **kwargs
204 )
ipdb> paths
['tmp/testfile.parquet']
ipdb> paths[0]
'tmp/testfile.parquet'
It seems paths
moves from ['tmp/testfile.parquet']
to []
at some point. I think around https://github.com/dask/dask/blob/master/dask/dataframe/io/parquet/fastparquet.py#L146
I'll try pyarrow
from adlfs.
Create new env:
> conda create -n adlfs-pa python=3.8
> conda activate adlfs-pa
> pip install adlfs
> conda install -c conda-forge dask pyarrow ipython
Check packages:
> conda list:
abseil-cpp 20200225.1 he025d50_2 conda-forge
adal 1.2.2 pypi_0 pypi
adlfs 0.2.0 pypi_0 pypi
arrow-cpp 0.16.0 py38hd3bb158_3 conda-forge
aws-sdk-cpp 1.7.164 vc14h867dc94_1 [vc14] conda-forge
azure-common 1.1.25 pypi_0 pypi
azure-datalake-store 0.0.48 pypi_0 pypi
azure-storage-blob 2.1.0 pypi_0 pypi
azure-storage-common 2.1.0 pypi_0 pypi
backcall 0.1.0 py_0 conda-forge
bokeh 2.0.1 py38h32f6830_0 conda-forge
boost-cpp 1.72.0 h0caebb8_0 conda-forge
brotli 1.0.7 he025d50_1001 conda-forge
bzip2 1.0.8 hfa6e2cd_2 conda-forge
c-ares 1.15.0 h2fa13f4_1001 conda-forge
ca-certificates 2020.4.5.1 hecc5488_0 conda-forge
certifi 2020.4.5.1 py38h32f6830_0 conda-forge
cffi 1.14.0 pypi_0 pypi
chardet 3.0.4 pypi_0 pypi
click 7.1.1 pyh8c360ce_0 conda-forge
cloudpickle 1.3.0 py_0 conda-forge
colorama 0.4.3 py_0 conda-forge
cryptography 2.9 pypi_0 pypi
curl 7.69.1 h1dcc11c_0 conda-forge
cytoolz 0.10.1 py38hfa6e2cd_0 conda-forge
dask 2.14.0 py_0 conda-forge
dask-core 2.14.0 py_0 conda-forge
decorator 4.4.2 py_0 conda-forge
distributed 2.14.0 py38h32f6830_0 conda-forge
freetype 2.10.1 ha9979f8_0 conda-forge
fsspec 0.7.2 py_0 conda-forge
gflags 2.2.2 he025d50_1002 conda-forge
glog 0.4.0 h0174b99_3 conda-forge
grpc-cpp 1.28.1 hb1a2610_1 conda-forge
heapdict 1.0.1 py_0 conda-forge
idna 2.9 pypi_0 pypi
intel-openmp 2020.0 166
ipython 7.13.0 py38h32f6830_2 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.17.0 py38h32f6830_0 conda-forge
jinja2 2.11.2 pyh9f0ad1d_0 conda-forge
jpeg 9c hfa6e2cd_1001 conda-forge
krb5 1.17.1 hdd46e55_0 conda-forge
libblas 3.8.0 15_mkl conda-forge
libcblas 3.8.0 15_mkl conda-forge
libcurl 7.69.1 h1dcc11c_0 conda-forge
liblapack 3.8.0 15_mkl conda-forge
libpng 1.6.37 hfe6a214_1 conda-forge
libprotobuf 3.11.4 h1a1b453_0 conda-forge
libssh2 1.8.2 h642c060_2 conda-forge
libtiff 4.1.0 h885aae3_6 conda-forge
locket 0.2.0 py_2 conda-forge
lz4-c 1.9.2 h33f27b4_0 conda-forge
markupsafe 1.1.1 py38h9de7a3e_1 conda-forge
mkl 2020.0 166
msgpack-python 1.0.0 py38heaebd3c_1 conda-forge
numpy 1.18.1 py38ha749109_1 conda-forge
olefile 0.46 py_0 conda-forge
openssl 1.1.1f hfa6e2cd_0 conda-forge
packaging 20.1 py_0 conda-forge
pandas 1.0.3 py38he6e81aa_1 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.7.0 pyh9f0ad1d_0 conda-forge
partd 1.1.0 py_0 conda-forge
pickleshare 0.7.5 py38h32f6830_1001 conda-forge
pillow 7.1.1 py38h8103267_0 conda-forge
pip 20.0.2 py38_1
prompt-toolkit 3.0.5 py_0 conda-forge
psutil 5.7.0 py38h9de7a3e_1 conda-forge
pyarrow 0.16.0 py38h57df961_2 conda-forge
pycparser 2.20 pypi_0 pypi
pygments 2.6.1 py_0 conda-forge
pyjwt 1.7.1 pypi_0 pypi
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
python 3.8.2 h5fd99cc_11
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.8 1_cp38 conda-forge
pytz 2019.3 py_0 conda-forge
pyyaml 5.3.1 py38h9de7a3e_0 conda-forge
re2 2020.04.01 vc14h6538335_0 [vc14] conda-forge
requests 2.23.0 pypi_0 pypi
setuptools 46.1.3 py38_0
six 1.14.0 py_1 conda-forge
snappy 1.1.8 he025d50_1 conda-forge
sortedcontainers 2.1.0 py_0 conda-forge
sqlite 3.31.1 he774522_0
tblib 1.6.0 py_0 conda-forge
thrift-cpp 0.13.0 h1907cbf_2 conda-forge
tk 8.6.10 hfa6e2cd_0 conda-forge
toolz 0.10.0 py_0 conda-forge
tornado 6.0.4 py38hfa6e2cd_0 conda-forge
traitlets 4.3.3 py38h32f6830_1 conda-forge
typing_extensions 3.7.4.1 py38h32f6830_3 conda-forge
urllib3 1.25.9 pypi_0 pypi
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_1
wcwidth 0.1.9 pyh9f0ad1d_0 conda-forge
wheel 0.34.2 py38_0
wincertstore 0.2 py38_0
xz 5.2.5 h2fa13f4_0 conda-forge
yaml 0.2.3 he774522_0 conda-forge
zict 2.0.0 py_0 conda-forge
zlib 1.2.11 h2fa13f4_1006 conda-forge
zstd 1.4.4 h9f78265_3 conda-forge
Setup code:
import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()
storage_options = <DEFINED>
d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = pd.DataFrame(data=d)
ddf = dd.from_pandas(df, npartitions=2)
csv example:
dd.to_csv(df=ddf,
filename='abfs://tmp/test_csvfile/*.csv',
storage_options=storage_options)
df2 = dd.read_csv('abfs://tmp/test_csvfile/*.csv',
storage_options=storage_options)
Same error as above
parquet example:
dd.to_parquet(ddf,
'abfs://tmp/testfile.parquet',
storage_options=storage_options)
df3 = dd.read_parquet("abfs://tmp/testfile.parquet",
storage_options=storage_options)
Same error as above
Some output of %debug:
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\utils.py(405)_analyze_paths()
403 path_parts_list = [_join_path(fn).split("/") for fn in file_list]
404 if root is False:
--> 405 basepath = path_parts_list[0][:-1]
406 for i, path_parts in enumerate(path_parts_list):
407 j = len(path_parts) - 1
ipdb> path_parts_list
[]
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\arrow.py(129)_determine_dataset_parts()
127 # This is a directory, check for _metadata, then _common_metadata
128 allpaths = fs.glob(paths[0] + fs.sep + "*")
--> 129 base, fns = _analyze_paths(allpaths, fs)
130 if "_metadata" in fns and "validate_schema" not in dataset_kwargs:
131 dataset_kwargs["validate_schema"] = False
ipdb> allpaths
[]
> c:\users\131416\appdata\local\continuum\anaconda3\envs\adlfs-pa\lib\site-packages\dask\dataframe\io\parquet\arrow.py(220)read_metadata()
218 # then each part will correspond to a file. Otherwise, each part will
219 # correspond to a row group (populated below)
--> 220 parts, dataset = _determine_dataset_parts(
221 fs, paths, gather_statistics, filters, kwargs.get("dataset", {})
222 )
ipdb> paths
['tmp/testfile.parquet']
ipdb> parts
*** NameError: name 'parts' is not defined
ipdb> dataset
*** NameError: name 'dataset' is not defined
ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x0000020136448D60>
ipdb> gather_statistics
ipdb> filters
ipdb>
pyarrow over fastparquet doesn't seem to matter.
from adlfs.
Just tested reading the csv file and worked on my linux machine. Although got the AzureHttpError
for the parquet file. I was also curious about path
> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/fsspec/spec.py(542)info()
540 if out:
541 return out[0]
--> 542 out = self.ls(path, detail=True, **kwargs)
543 path = path.rstrip("/")
544 out1 = [o for o in out if o["name"].rstrip("/") == path]
ipdb> path
'tmp/testfile.parquet/_metadata/_metadata'
> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/adlfs/core.py(576)__init__()
574 self.blob = blob
575
--> 576 super().__init__(
577 fs=fs,
578 path=path,
ipdb> fs
<adlfs.core.AzureBlobFileSystem object at 0x7efdfca6fe80>
ipdb> path
'tmp/testfile.parquet/_metadata/_metadata'
ipdb>
> /home/ray/local/bin/anaconda3/envs/adlfs/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py(202)read_metadata()
200 # then each part will correspond to a file. Otherwise, each part will
201 # correspond to a row group (populated below).
--> 202 parts, pf, gather_statistics, fast_metadata = _determine_pf_parts(
203 fs, paths, gather_statistics, **kwargs
204 )
ipdb> paths
['tmp/testfile.parquet']
from adlfs.
reading the csv file worked fine on my Mac. Same AzureHttpError
on the parquet file.
I see there are two things here:
- Try and read the csv file on my windows as i'm able to on my linux and mac
- Try and read the parquet file.
from adlfs.
I just uploaded v0.2.2. Give it a shot and let me know if it works for you. Seems there was an issue with parsing container names in Windows, which should be fixed. Also found a change in fsspec v0.6.3 that is causing adlfs to fail one of its unit tests. Need to verify everything is OK before I allow fsspec >= 0.6.3, so pinned to fsspec0.6.0 to 0.6.2. Let me know if it solves your issue.
from adlfs.
Thanks. I'll try tomorrow
from adlfs.
Related Issues (20)
- Add use_emulator setting to better align with object_store crate HOT 1
- Current state of the library, milestones and current development HOT 1
- Concurrent download of multiple files HOT 1
- Support virtual directory stubs with uppercase "Hdi_isfolder" metadata HOT 1
- Feature Suggestion: Optional content type when for writing file HOT 2
- Support passing url in AzureBlobFileSystem HOT 1
- Add comment why `aiohttp` is required
- Fix typo in repo About
- Python 3.12 support blocked by aiohttp HOT 1
- Feature Request: Support for Adding Metadata to Blobs
- Runtime warning from missing await HOT 2
- `fs.info()` and `fs.ls(detail=True)` return different etag formats
- Issue with parallel uploads to the same blob
- Can I use a bearer token / entra ID token for authentication? HOT 1
- Parameter anon ignored if set to False
- exists() is missing **kwargs
- object NoneType can't be used in 'await' expression HOT 3
- Does adlfs provide a way to set the content-type when writing a file to azure blob storage? HOT 1
- Does adlfs provide a way to get a file's public cloud url?
- Xarray Serialisation Issues reading NetCDF from AzureBlobFile HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adlfs.