fsspec / filesystem_spec Goto Github PK
View Code? Open in Web Editor NEWA specification that python filesystems should adhere to.
License: BSD 3-Clause "New" or "Revised" License
A specification that python filesystems should adhere to.
License: BSD 3-Clause "New" or "Revised" License
I'm working on an Azure Datalake Gen2 package using fsspec and encountered strange behavior with BytesCacheing, where the file being read from the datalake and calling compute() on the cached dataframe appears as an exact duplicate of rows of the same dataframe.
It appears to be caused by the _fetch method in BytesCache when reading a file into Dask. During what should be the first read when calling "dd.read_csv()", the value of start (when none) is passed as zero, but when calling the .compute() method on the dataframe, the file gets read again, leading to the duplication of rows. Explicitly setting start, end= 0,0 like:
def _fetch(self, start, end):
# TODO: only set start/end after fetch, in case it fails?
# is this where retry logic might go?
if self.start is None and self.end is None:
# First read
start, end = 0, 0
self.cache = self.fetcher(start, end + self.blocksize)
self.start = start
appears to fix it for me. Have you observed any issues with BytesCacheing elsewhere?
Currently just implemented with shutil.copyfile()
, i.e.
def copy(self, path1, path2, **kwargs):
""" Copy within two locations in the filesystem"""
shutil.copyfile(path1, path2)
get = copy
put = copy
which only works for single files.
with of as f:
f.exists(path)
That's fine for local files. However, I ultimately want to use this in underlying code that intake will use, and path being https, s3 or c:\ can all be valid possibilities.
HTTPfile doesn't have an exists(), what is the recommended way to do this?
I originally filed this under s3fs (here), but adding it here as related:
Consider for example:
fs.glob('mybucket/*/????/*.zip')
returns
['mybucket/folder/2018/file.zip', 'mybucket/folder/2019/file.zip']
whereas
fs.glob('mybucket/folder/????/*.zip')
returns
[]
If the * is after the ?, the ? ends up in the path that gets walked, and so it doesn't find the path. Instead of searching the * only, this bit should consider other characters:
https://github.com/intake/filesystem_spec/blob/c0c4f2bfa69127f5d6daf81b98b8584669041468/fsspec/spec.py#L417
when doing
import pandas as pd
import s3fs
import dask.dataframe as dd
ddf=dd.read_csv("s3://bucket/data/path.csv")
getting the following error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-34-64a49ed7781b> in <module>
----> 1 df = dd.read_csv("s3://amra-s3-0066-00-external/promotion-data/transactions/Kay_trans_data_201806_202005.csv", storage_options={'token': fs})
/opt/anaconda/envs/promo/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
580 storage_options=storage_options,
581 include_path_column=include_path_column,
--> 582 **kwargs
583 )
584
/opt/anaconda/envs/promo/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, collection, lineterminator, compression, sample, enforce, assume_missing, storage_options, include_path_column, **kwargs)
407 compression=compression,
408 include_path=include_path_column,
--> 409 **(storage_options or {})
410 )
411
/opt/anaconda/envs/promo/lib/python3.7/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, include_path, **kwargs)
93
94 """
---> 95 fs, fs_token, paths = get_fs_token_paths(urlpath, mode="rb", storage_options=kwargs)
96
97 if len(paths) == 0:
/opt/anaconda/envs/promo/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol)
313 cls = get_filesystem_class(protocol)
314
--> 315 options = cls._get_kwargs_from_urls(urlpath)
316 path = cls._strip_protocol(urlpath)
317 update_storage_options(options, storage_options)
AttributeError: type object 'S3FileSystem' has no attribute '_get_kwargs_from_urls'
urlsmall = 'https://six-library.s3.amazonaws.com/sicd_example_RMA_RGZERO_RE32F_IM32F_cropped_multiple_image_segments.nitf'
#test = cf_sicd.Reader(urlsmall)
test = fsspec.open(urlsmall)
test.fs.exists(test.path)
True
with test as fid:
fid.read(9)
AttributeError Traceback (most recent call last)
in
----> 1 with test as fid:
2 fid.read(9)
C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+28.g9b388d4-py3.7.egg\fsspec\core.py in enter(self)
57 mode = self.mode.replace('t', '').replace('b', '') + 'b'
58
---> 59 f = self.fs.open(self.path, mode=mode)
60
61 fobjects = [f]
C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+28.g9b388d4-py3.7.egg\fsspec\spec.py in open(self, path, mode, block_size, **kwargs)
584 ac = kwargs.pop('autocommit', not self._intrans)
585 f = self._open(path, mode=mode, block_size=block_size,
--> 586 autocommit=ac, **kwargs)
587 if not ac:
588 self.transaction.files.append(f)
C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+28.g9b388d4-py3.7.egg\fsspec\implementations\http.py in _open(self, url, mode, block_size, **kwargs)
128 kw.pop('autocommit', None)
129 if block_size:
--> 130 return HTTPFile(self, url, self.session, block_size, **kw)
131 else:
132 kw['stream'] = True
C:\Apps\Anaconda3\envs\pyviz_dev\lib\site-packages\fsspec-0.2.0+28.g9b388d4-py3.7.egg\fsspec\implementations\http.py in init(self, fs, url, session, block_size, mode, **kwargs)
175 try:
176 size = file_size(url, self.session, allow_redirects=True,
--> 177 **self.kwargs)
178 except (ValueError, requests.HTTPError):
179 # No size information - only allow read() and no seek()
AttributeError: 'HTTPFile' object has no attribute 'kwargs'
Several methods should be able to automatically support paths containing "*" (or lists or paths?), e.g., get/put, delete, copy and move. For methods with a target, target would be interpreted as a directory.
There has been a bit of development since the 0.2.0 release. I'm wondering when the next release is planned. I am about to release a version of xmitgcm that needs the feature added in #46, so I need to decide what to do about the dependency. I would love to just be able to depend on 0.2.1, rather than github master. But I don't want to rush a release unnecessarily.
h5py now supports file-like objects. This works with the built-in open()
, but I get an error when I try this with fsspec
:
import h5py
import fsspec
with fsspec.open_files('test.h5', 'wb')[0] as f:
with h5py.File(f, 'w') as h5file:
h5file.attrs['foo'] = 'bar'
---------------------------------------------------------------------------
NotADirectoryError Traceback (most recent call last)
<ipython-input-15-bc0b4a6a665e> in <module>()
2 import fsspec
3
----> 4 with fsspec.open_files('test.h5', 'wb')[0] as f:
5 with h5py.File(f, 'w') as h5file:
6 h5file.attrs['foo'] = 'bar'
/usr/local/lib/python3.6/dist-packages/fsspec/core.py in __enter__(self)
57 mode = self.mode.replace('t', '').replace('b', '') + 'b'
58
---> 59 f = self.fs.open(self.path, mode=mode)
60
61 fobjects = [f]
/usr/local/lib/python3.6/dist-packages/fsspec/spec.py in open(self, path, mode, block_size, **kwargs)
562 ac = kwargs.pop('autocommit', not self._intrans)
563 f = self._open(path, mode=mode, block_size=block_size,
--> 564 autocommit=ac, **kwargs)
565 if not ac:
566 self.transaction.files.append(f)
/usr/local/lib/python3.6/dist-packages/fsspec/implementations/local.py in _open(self, path, mode, block_size, **kwargs)
70
71 def _open(self, path, mode='rb', block_size=None, **kwargs):
---> 72 return LocalFileOpener(path, mode, **kwargs)
73
74 def touch(self, path, **kwargs):
/usr/local/lib/python3.6/dist-packages/fsspec/implementations/local.py in __init__(self, path, mode, autocommit)
85 self.autocommit = autocommit
86 if autocommit or 'w' not in mode:
---> 87 self.f = open(path, mode=mode)
88 else:
89 # TODO: check if path is writable?
NotADirectoryError: [Errno 20] Not a directory: 'test.h5/0.part'
Working example with open()
:
with open('test.h5', 'wb') as f:
with h5py.File(f, 'w') as h5file:
h5file.attrs['foo'] = 'bar'
As discussed in
All file-systems should support full glob.
Also, recursive
should be allowd in various methods (see fsspec/gcsfs#139 ). This is somewhat similar to #17
From intake/intake#367
>>> x = len(dd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv").compute())
>>> y = pd.read_csv("https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv")
>>> len(x), len(y)
(519, 1458)
@martindurant thinks this may be because Github's HTTP server isn't respecting the 'identity'
request header, and is sending back the compressed content length instead.
How should we handle this situation?
The description mentions pyfilesystem, but doesn't explain how this is different. What are the differences and why is a new filesystem API needed? It sounds like the other one might already have a community so working within that API (if possible) could make sense.
Am I missing something?
$ python3 -V
Python 3.6.7
$ pip3 install fsspec
Collecting fsspec
Using cached https://files.pythonhosted.org/packages/00/cd/f3636189937a08830ebb488c6acd1ccb9c4f7f54e55c62d0106f675a4887/fsspec-0.1.2.tar.gz
Collecting python>=3.6 (from fsspec)
Could not find a version that satisfies the requirement python>=3.6 (from fsspec) (from versions: )
No matching distribution found for python>=3.6 (from fsspec)
Many file system libraries support glob strings (s3fs
, gcsfs
, and some of the libraries in dask.bytes), however some like HTTPFileSystem do not. In this case, it is difficult (if not impossible) since HTTP servers don't always list directories.
@pitrou wrote the following on [email protected]
Here are some comments:
API naming: you seem to favour re-using Unix command-line monickers in
some places, while using more regular verbs or names in other
places. I think it should be consistent. Since the Unix
command-line doesn't exactly cover the exposed functionality, and
since Unix tends to favour short cryptic names, I think it's better
to use Python-like naming (which is also more familiar to non-Unix
users). For example "move" or "rename" or "replace" instead of "mv",
etc.
**kwargs parameters: a couple APIs (mkdir
, put
...) allow passing
arbitrary parameters, which I assume are intended to be
backend-specific. It makes it difficult to add other optional
parameters to those APIs in the future. So I'd make the
backend-specific directives a single (optional) dict parameter rather
than a **kwargs.
invalidate_cache
doesn't state whether it invalidates recursively
or not (recursively sounds better intuitively?). Also, I think it
would be more flexible to take a list of paths rather than a single
path.
du
: the effect of the deep
parameter isn't obvious to me. I don't
know what it would mean not to recurse here: what is the size of a
directory if you don't recurse into it?
glob
may need a formal definition (are trailing slashes
significant for directory or symlink resolution? this kind of thing),
though you may want to keep edge cases backend-specific.
are head
and tail
at all useful? They can be easily recreated
using a generic open
facility.
read_block
tries to do too much in a single API IMHO, and
using open
directly is more flexible anyway.
if touch
is intended to emulate the Unix API of the same name, the
docstring should state "Create empty file or update last modification
timestamp".
the information dicts returned by several APIs (ls
, info
....)
need standardizing, at least for non backend-specific fields.
if the backend is a networked filesystem with non-trivial latency,
perhaps the operations would deserve being batched (operate on
several paths at once), though I will happily defer to your expertise
on the topic.
I observe an attribute error when the python program/session ends after using the PyArrowHDFS implementation. This does not impact the program in any way during execution(as far as I have observed).
Reproducer:
from fsspec.implementations.hdfs import PyArrowHDFS
fs = PyArrowHDFS()
After exit:
AttributeError: 'HadoopFileSystem' object has no attribute 'close'
Exception ignored in: 'pyarrow.lib.HadoopFileSystem.__dealloc__'
AttributeError: 'HadoopFileSystem' object has no attribute 'close'
Currently local file open()
supports only mode
. It would be useful to add at least encoding
(since it's implemented e.g. for S3FileSystem
) if not all builtin open
parameters (buffering
, errors
, etc).
With the cached FS and dask worker FS, not to mention zip, we have file-systems which take as input other file-systems or URLs. It would be convenient, but potentially confusing, to be able to address them with a single url:
"zip://s3://bucket/file.zip"
In practice, it may be tricky to differentiate between the parts of the path belong to which FS. This would probably need more syntax and maybe lead to more confusion than it clears up.
pip and conda
We should specify here that this output format has the same structure as os..walk
The caching file-system downloads from other file systems to a local store, and uses MMap to only fetch the chunks that are accessed, to a sparse file. An alternative mode of the same class would simply download whole files on first access, potentially with on-the-fly decompression, and then provide the local copy. This could then replace most/all of the functionality in Intake's cache layer (although who takes responsibility for metadata for checkpointing remains to be decided).
I'm looking for more details about using local cache with fsspec
specifically:
Currently, fuse, dask and hdfs are not being tested on Travis, although tests exist, leading to apparently low code coverage.
Each of these has had working fixtures in the past, and it would be reasonable to make the effort to add them again. HDFS may not be tractable, since it must be run on an edge node - we can indeed run it (dask nightlies is doing this), but the coverage will not show up.
>>> import os, fsspec
>>> fsspec.__version__
'0.3.6'
>>> fs = fsspec.filesystem("file")
>>> fs.touch(f"{os.getcwd()}/testfile")
>>> fs.rm(f"{os.getcwd()}/testfile", recursive=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/kfranz/ae/repo/env/lib/python3.7/site-packages/fsspec/implementations/local.py", line 90, in rm
shutil.rmtree(path)
File "/Users/kfranz/ae/repo/env/lib/python3.7/shutil.py", line 491, in rmtree
_rmtree_safe_fd(fd, path, onerror)
File "/Users/kfranz/ae/repo/env/lib/python3.7/shutil.py", line 410, in _rmtree_safe_fd
onerror(os.scandir, path, sys.exc_info())
File "/Users/kfranz/ae/repo/env/lib/python3.7/shutil.py", line 406, in _rmtree_safe_fd
with os.scandir(topfd) as scandir_it:
NotADirectoryError: [Errno 20] Not a directory: '/Users/kfranz/ae/repo/testfile'
Question:
I have a final hurdle in getting https files read, metadata gets read correctly, just np.fromfile doesn't like the fid.
It's not clear to me if readbytes is a close replacement.
Since we end up looping, using np.fromfile and fid.seek
I suspect there's a cleaner way now using fsspec to subsample (seek to where we need).
What's the best np.fromfile replacement to use fsspec?
With fsspec master, the following dask test fails
$ pytest dask/dataframe/io/tests/test_orc.py::test_orc_with_backend
The pyarrow error message isn't really helpful. I've traced it down to AbstractBufferedFile.fspath. Removing that, the test passes.
diff --git a/fsspec/spec.py b/fsspec/spec.py
index 0b66f99..6567ad9 100644
--- a/fsspec/spec.py
+++ b/fsspec/spec.py
@@ -1162,10 +1162,6 @@ class AbstractBufferedFile(io.IOBase):
def __str__(self):
return "<File-like object %s, %s>" % (type(self.fs).__name__, self.path)
-
- def __fspath__(self):
- return self.fs.protocol + "://" + self.path
-
__repr__ = __str__
def __enter__(self):
Does fspath make sense on this object? Is it actually present on the filesystem, or just in memory?
xref dask/dask#5267
I am trying to use fsspec.implementations.http.HTTPFileSystem to read a byte range from a remote url. Here's what I'm doing
from fsspec.implementations.http import HTTPFileSystem
path = 'https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_4320/compressed/0000010368/Theta.0000010368.data.shrunk'
fs = HTTPFileSystem()
f_obj = fs.open(path)
# ... later would come the reading commands
This hangs forever. I have turned on urllib3 debugging to see what's happening. I get the following log output
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): data.nas.nasa.gov:443
If I interrupt the running command, I get this stack trace
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/anaconda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
376 try: # Python 2.7, use buffering of HTTP responses
--> 377 httplib_response = conn.getresponse(buffering=True)
378 except TypeError: # Python 3
TypeError: getresponse() got an unexpected keyword argument 'buffering'
During handling of the above exception, another exception occurred:
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-27-556e6da93e62> in <module>
----> 1 f_obj = fs.open(path)
2 f_obj
~/anaconda/envs/pangeo/lib/python3.6/site-packages/fsspec/spec.py in open(self, path, mode, block_size, **kwargs)
562 ac = kwargs.pop('autocommit', not self._intrans)
563 f = self._open(path, mode=mode, block_size=block_size,
--> 564 autocommit=ac, **kwargs)
565 if not ac:
566 self.transaction.files.append(f)
~/anaconda/envs/pangeo/lib/python3.6/site-packages/fsspec/implementations/http.py in _open(self, url, mode, block_size, **kwargs)
117 kw.pop('autocommit', None)
118 if block_size:
--> 119 return HTTPFile(url, self.session, block_size, **kw)
120 else:
121 kw['stream'] = True
~/anaconda/envs/pangeo/lib/python3.6/site-packages/fsspec/implementations/http.py in __init__(self, url, session, block_size, **kwargs)
165 try:
166 self.size = file_size(url, self.session, allow_redirects=True,
--> 167 **self.kwargs)
168 except (ValueError, requests.HTTPError):
169 # No size information - only allow read() and no seek()
~/anaconda/envs/pangeo/lib/python3.6/site-packages/fsspec/implementations/http.py in file_size(url, session, **kwargs)
370 head = kwargs.get('headers', {})
371 head['Accept-Encoding'] = 'identity'
--> 372 r = session.head(url, allow_redirects=ar, **kwargs)
373 r.raise_for_status()
374 if 'Content-Length' in r.headers:
~/anaconda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in head(self, url, **kwargs)
566
567 kwargs.setdefault('allow_redirects', False)
--> 568 return self.request('HEAD', url, **kwargs)
569
570 def post(self, url, data=None, json=None, **kwargs):
~/anaconda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
531 }
532 send_kwargs.update(settings)
--> 533 resp = self.send(prep, **send_kwargs)
534
535 return resp
~/anaconda/envs/pangeo/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
644
645 # Send the request
--> 646 r = adapter.send(request, **kwargs)
647
648 # Total elapsed time of the request (approximately)
~/anaconda/envs/pangeo/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
447 decode_content=False,
448 retries=self.max_retries,
--> 449 timeout=timeout
450 )
451
~/anaconda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
598 timeout=timeout_obj,
599 body=body, headers=headers,
--> 600 chunked=chunked)
601
602 # If we're going to release the connection in ``finally:``, then
~/anaconda/envs/pangeo/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
378 except TypeError: # Python 3
379 try:
--> 380 httplib_response = conn.getresponse()
381 except Exception as e:
382 # Remove the TypeError from the exception chain in Python 3;
~/anaconda/envs/pangeo/lib/python3.6/http/client.py in getresponse(self)
1329 try:
1330 try:
-> 1331 response.begin()
1332 except ConnectionError:
1333 self.close()
~/anaconda/envs/pangeo/lib/python3.6/http/client.py in begin(self)
295 # read until we get a non-100 response
296 while True:
--> 297 version, status, reason = self._read_status()
298 if status != CONTINUE:
299 break
~/anaconda/envs/pangeo/lib/python3.6/http/client.py in _read_status(self)
256
257 def _read_status(self):
--> 258 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
259 if len(line) > _MAXLINE:
260 raise LineTooLong("status line")
~/anaconda/envs/pangeo/lib/python3.6/socket.py in readinto(self, b)
584 while True:
585 try:
--> 586 return self._sock.recv_into(b)
587 except timeout:
588 self._timeout_occurred = True
~/anaconda/envs/pangeo/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in recv_into(self, *args, **kwargs)
295 def recv_into(self, *args, **kwargs):
296 try:
--> 297 return self.connection.recv_into(*args, **kwargs)
298 except OpenSSL.SSL.SysCallError as e:
299 if self.suppress_ragged_eofs and e.args == (-1, 'Unexpected EOF'):
~/anaconda/envs/pangeo/lib/python3.6/site-packages/OpenSSL/SSL.py in recv_into(self, buffer, nbytes, flags)
1819 result = _lib.SSL_peek(self._ssl, buf, nbytes)
1820 else:
-> 1821 result = _lib.SSL_read(self._ssl, buf, nbytes)
1822 self._raise_ssl_error(self._ssl, result)
1823
KeyboardInterrupt:
Everything works fine if I use requests.
import requests
r = requests.get(path, headers={"Range": "bytes=0-100"})
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): data.nas.nasa.gov:443
DEBUG:urllib3.connectionpool:https://data.nas.nasa.gov:443 "GET /ecco/download_data.php?file=/eccodata/llc_4320/compressed/0000010368/Theta.0000010368.data.shrunk HTTP/1.1" 206 101
example:
from fsspec.implementations.cached import CachingFileSystem
fs = CachingFileSystem(target_protocol='s3')
fs.open('s3://bucket/empty')
{'fn': '01c8b40c487e3c79a7549715f6fff04a15e6b0017bcb83cb98d9278b0b4ed4ec', 'blocks': set()}
Traceback (most recent call last):
File "/fsspec_test.py", line 3, in <module>
fs.open('s3://bucket/empty')
File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/implementations/cached.py", line 157, in <lambda>
self, *args, **kw
File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/spec.py", line 669, in open
autocommit=ac, **kwargs)
File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/implementations/cached.py", line 157, in <lambda>
self, *args, **kw
File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/implementations/cached.py", line 132, in _open
fn, blocks)
File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/core.py", line 397, in __init__
self.cache = self._makefile()
File "/usr/local/miniconda3/envs//lib/python3.7/site-packages/fsspec/core.py", line 409, in _makefile
fd.seek(self.size - 1)
OSError: [Errno 22] Invalid argument
according to this: https://github.com/intake/filesystem_spec/blob/master/fsspec/spec.py#L420 glob should pass extra kwargs to the ls function.
I tried using the details=True flag, but got an exception
import fsspec
fs = fsspec.get_filesystem_class('s3')()
fs.glob('s3://bucket/*', detail=True)
File "fsspec_test.py", line 3, in <module>
fs.glob('s3://bucket/*', detail=True)
File "./lib/python3.7/site-packages/fsspec/spec.py", line 444, in glob
allpaths = self.find(root, maxdepth=depth, **kwargs)
File "./lib/python3.7/site-packages/fsspec/spec.py", line 373, in find
for path, _, files in self.walk(path, maxdepth, **kwargs):
TypeError: walk() got an unexpected keyword argument 'detail'
In #46, @martindurant added a size_policy
keyword to HTTPFileSystem.
In fsspec 0.2.1, it appears to be broken.
from fsspec.implementations.http import HTTPFileSystem
fs = HTTPFileSystem()
path = 'https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_2160/compressed/0000092160/Theta.0000092160.data.shrunk'
f = fs.open(path, size_policy='get')
raises
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-857d04d27e7c> in <module>()
2 fs = HTTPFileSystem()
3 path = 'https://data.nas.nasa.gov/ecco/download_data.php?file=/eccodata/llc_2160/compressed/0000092160/Theta.0000092160.data.shrunk'
----> 4 f = fs.open(path, size_policy='get')
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/fsspec/spec.py in open(self, path, mode, block_size, **kwargs)
615 ac = kwargs.pop('autocommit', not self._intrans)
616 f = self._open(path, mode=mode, block_size=block_size,
--> 617 autocommit=ac, **kwargs)
618 if not ac:
619 self.transaction.files.append(f)
~/miniconda3/envs/geo_scipy/lib/python3.6/site-packages/fsspec/implementations/http.py in _open(self, url, mode, block_size, **kwargs)
131 if block_size:
132 return HTTPFile(self, url, self.session, block_size,
--> 133 size_policy=self.size_policy, **kw)
134 else:
135 kw['stream'] = True
TypeError: type object got multiple values for keyword argument 'size_policy'
Like open
/open_files
, which find the file-system of interest and use it with parameters, could implement a generic top-level file-system which finds the correct instance and use it depending on the protocol implicit in the path given. This would become the primary user-facing API.
This can also be the place to implement globs or recursive where multiple files make sense.
Should allow copy/move/(sync?) between file-systems.
We have an obscure package called xmitgcm which provides read-only access to the binary files generated by the mitgcm ocean model in xarray format. The raw data is usually all stored in a single directory, and is a combination of binary data files (possibly large) and text metadata files (always very small).
Currently, the package works only with files stored on disk. Interaction with the filesystem is basically limited to the following operations:
os.listdir
, to get the files in a particular directory, plus pattern matching (we currently use glob, but it's not necessary)open
to read and parse metadata stored in text filesos.path.getsize
to examine the size of binary objectsfile.seek
to point to a specific byte range within the filenp.fromfile
to load binary data (optionally wrapped in a dask delayed call to make reading lazy). Sometimes invoked with the count
argument to read only a specific byte rangeWe would like to refactor this package to work use a more generic idea of filesystems, with the goal of being able to upload the data "as is" directly into cloud storage buckets and have everything work the same.
Based on my limited understanding of cloud storage, this should be doable. The directory listing and text file reading is trivial. The reading of contiguous byte ranges from binary files should also be possible with range requests.
So the only question is, how do we implement this? We could create our own filesystem abstraction and then implement an on-disk class, an s3 class, an gcs class, etc. But I feel that this would be a waste of time, since these things are already done by s3fs, gcsfs, etc.
So I am here seeking a recommendation about the best way to approach this problem.
It would be nice to have an alias that only opened a single file, e.g., fsspec.open_file
or even fsspec.open
. Right now idiomatic usage of fsspec for a single file looks pretty awkward with the need for indexing like [0]
, e.g.,
with fsspec.open_files(path)[0] as f:
f.write(...)
Currently the default implementation of glob is handwritten for this package, and relies on find
and some custom regexes (which have given problems in the past/currently). An alternative implementation may be similar to what we used to have in dask. This is based on the implementation in CPython, passes all their tests, and relies on the following standard filesystem methods:
Is there a reason to prefer the implementation in this repo over the one generalized from CPython we used to have in dask? Would it be worth investigating using the other version?
FSMap.__init__
should probably strip the protocol / prefix of the root
so that S3Map('s3://mybucket/file')
is treated the same as S3Map('mybucket/file')
.
In [15]: s3 = s3fs.S3FileSystem()
In [16]: map1 = s3fs.S3Map('s3://a/b', s3)
In [17]: map2 = s3fs.S3Map('a/b', s3)
In [18]: assert map1.root == map2.root
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-18-f02e96dd7165> in <module>
----> 1 assert map1.root == map2.root
AssertionError:
Should address what to do about trailing slashes/
.
This would prevent issues like dask/dask#5298, where a user had gcsfs 0.2.3. 0.3.0 is required to work with fsspec.
@martindurant This is a trial-balloon issue for this idea, mostly to see if fsspec is the "right" place for this feature and/or if I've missed an existing implementation. If so, I've a working proof-of-concept built off an older version of dask that could serve as a starting-point PR.
fsspec
is a great module for new development or projects built on the existing dask
ecosystem and enables an amazing S3-is-my-filesystem paradigm. However, the vast majority of projects make use of the python's built-in file operations, and are tightly coupled to a local filesystem. This causes major development friction when integrating existing third-party libraries into a project, as one almost invariably needs to work out an integration-specific flow between local files and remote storage.
One can workaround this problem with FUSE-based mounts, however this complexifies deployment and containerization. Alternatively, any of several (fsspec, smart_open, pyfilesystem2, et. al.) filesystem abstractions could be used, but updating a third-party component to an alternative filesystem interface is a painful and risky development lift. One either needs to maintain a private fork ๐ณ or open a massive and risky PR ๐ฌ.
I've found that a solid majority of filesystem use cases are covered by a relatively small set of operations, all of which is already covered by fsspec
. By providing strictly-compatible shims for a small set of the stdlib (eg: open
, os.path.exists
, os.remove
, glob.glob
, shutil.rmtree
, et. al.) and then swapping these via import
level changes one can quickly teach most libraries to seamlessly interact with all the file systems supported by fsspec
.
This shim layer would mandate strict adherence to standard library semantics for local file operations, likely by directly forwarding all local paths into the standard library and forwarding non-local paths through fsspec
-based implementations. The explicit goal would be to enable a majority of basic use cases, deferring to fsspec
interfaces for more robust integration and/or specialized use cases. This would turn fsspec
into a massively useful layer for updating existing systems to cloud-compatible storage, as updating a library to support s3 and gcsfs would be as simple as:
try:
from fsspec.stdlib import open
import fsspec.stdlib.os.path as os.path
import fsspec.stdlib.shutil as shutil
except ImportError:
import os.path
import shutil
Would like some general method by which files can be written to a temporary location, probably on the same file-system, and then all committed together later, when all writing has completed; or aborting the whole process and leaving the target location unchanged. This makes the writing process (almost) atomic.
Would ideally be not just for single files, but for groups such as "all files below path/
" or "all writes until a commit is requested". Of course, the default would still be to auto-commit, to write to the real final location as soon as possible.
In a FS like HDFS, this would be by mv
-ing the files from the temporary location. In a FS like S3, uploads of files larger than the smallest block size uses a blocked upload, which needs finalising anyway, and can be aborted (whereas there is no move, only copy-and-delete). Exactly how an implementation achieves temporary writes is not to be defined here.
Importantly, the concept should be portable across instances of the file-system in different processes, so that a batch of files can be written in parallel, e.g., with Dask, and committed together. Perhaps each instance notes only the files that it writes, but then all must still be around at commit time. I do not have a good idea of how that might be achieved, suggestions welcome.
Aside from the sibling projects (s3fs, gcsfs...) that this project is meant to unite, there are many other file-system-like things that would be easy to implement. Those things could live directly within this repo, as local, http and memory do, but would make setting up testing increasingly complicated.
Some are already implemented in pyfilsystems2 - can we wrap those or otherwise reuse the code?
Hi!
I am one of the maintainers of stor a filesystem library that attempts to implement a cross-compatible API for POSIX, OpenStack Swift, S3, and DNAnexus paths. @kyleabeauchamp pointed out your spec and it seemed pretty interesting :) . I am considering providing an alternate version of stor.Path
that implements the filesystem_spec
API. I know I'm combining a couple things here, but I didn't see a mailing list so I figured I'd ask 'em together.
AbstractFileSystem
?maxdepth
mean in the context of S3 or OpenStack Swift? Is it referring to directories delimited by /
? Is there a particular ordering that's required? (for example, is it necessary to guarantee that you fully list each "directory" before emitting files in another directory?)Aside - as a general comment, I wonder if the API would be better served by having all methods work as iterators and/or have all methods support some kind of limit
argument. We originally implemented stor to have most file listing methods return lists; in hindsight, this was a mistake - it's much too easy to accidentally write code that tries to list millions of objects, and thus gets really slow or unintentionally expensive.
Interested to test out the system more! :)
Jeff
Reuse https://github.com/dask/dask/pull/3335/files#diff-a8ff31fbc7d5c0e42fc6bafd7f7b9197R271 in functions that take paths to also allow Path-like objects and fspath protocol things. This would be in functions like open_files.
We could create a file-system type, which makes local copies of any file which is opened for reading on a given other file-system: download-on-first-access. That would allow for local, and therefore fast, access to some parts of a much bigger data-set. This is probably not much work, but we would need to decide how and where to to store the files.
The MMap file-cache also demonstrates only downloading those blocks of a file which are accessed into a sparse file. That could also be implemented for absolute minimal storage.
@rabernat , this is the kind of thing you had in mind, and it is an interesting idea, and probably not hard to implement.
I'm trying to debug gcsfs not working in Dask and tracing back to fsspec via known_implementations
. It says there is an error and requires gcsfs
to be installed (which it is):
>>> import gcsfs
>>> from fsspec.registry import known_implementations
>>> known_implementations
{'file': {'class': 'fsspec.implementations.local.LocalFileSystem'}, 'memory': {'class': 'fsspec.implementations.memory.MemoryFileSystem'}, 'http': {'class': 'fsspec.implementations.http.HTTPFileSystem', 'err': 'HTTPFileSystem requires "requests" to be installed'}, 'https': {'class': 'fsspec.implementations.http.HTTPFileSystem', 'err': 'HTTPFileSystem requires "requests" to be installed'}, 'zip': {'class': 'fsspec.implementations.zip.ZipFileSystem'}, 'gcs': {'class': 'gcsfs.GCSFileSystem', 'err': 'Please install gcsfs to access Google Storage'}, 'gs': {'class': 'gcsfs.GCSFileSystem', 'err': 'Please install gcsfs to access Google Storage'}, 'sftp': {'class': 'fsspec.implementations.sftp.SFTPFileSystem', 'err': 'SFTPFileSystem requires "paramiko" to be installed'}, 'ssh': {'class': 'fsspec.implementations.sftp.SFTPFileSystem', 'err': 'SFTPFileSystem requires "paramiko" to be installed'}, 'ftp': {'class': 'fsspec.implementations.ftp.FTPFileSystem'}, 'hdfs': {'class': 'fsspec.implementations.hdfs.PyArrowHDFS', 'err': 'pyarrow and local java libraries required for HDFS'}, 'webhdfs': {'class': 'fsspec.implementations.webhdfs.WebHDFS', 'err': 'webHDFS access requires "requests" to be installed'}, 's3': {'class': 's3fs.S3FileSystem', 'err': 'Install s3fs to access S3'}, 'cached': {'class': 'fsspec.implementations.cached.CachingFileSystem'}, 'dask': {'class': 'fsspec.implementations.dask.DaskWorkerFileSystem', 'err': 'Install dask distributed to access worker file system'}}
>>>
This is fsspec 0.4.1 and gcsfs 0.3.0. Should known_implementations
be listing it without an err
key if it is working correctly?
PEP-519 added paths-like support to arbitrary file-like things. We should support that and return full URLs, including protocols.
Most of us university folks have essentially unlimited free storage in google drive. It would be great to be able to work with this storage space in a cloud-native way.
The google drive SDK includes a python API: https://developers.google.com/drive/api/v3/about-sdk
In principle, it should be straightforward to create a google-drive implementation for fsspec.
I believe this would open up a whole world of cool possibilities. In particular, it would lower the bar for putting data on the "cloud." More people are comfortable with google drive than they are with S3 or GCS.
xref dask/dask#5039
Could you please tag 0.3.6 in git repo?
How about pushing a new release to pypi and/or conda-forge with the fixes in the hdfs branch?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.