liormizr / s3path Goto Github PK

View Code? Open in Web Editor NEW

206.0 206.0 39.0 421 KB

s3path is a pathlib extension for AWS S3 Service

License: Apache License 2.0

Python 99.75% Makefile 0.25%

amazon-s3 amazon-web-services aws-s3 boto3 python python3 s3-sdk

s3path's People

Contributors

Stargazers

Watchers

s3path's Issues

AccessDenied when closing file handle created by S3Path open('w') with serverside encryption enforced in bucket.

I have ServerSide encryption enabled for an S3 bucket. I followed the instructions to register in the configuration using the following:

register_configuration_parameter(
    S3Path(f"/{bucket_name }"),
    parameters={
        'ServerSideEncryption': 'AES256'
    })

However when I try to close a file handle created for an object which was created, like:

s3obj = S3Path('BUCKET-EXAMPLE','test')
s3_file_h = s3obj.open('w')
s3_file_h.write('test')
s3_file_h.close()

When I close the file, I immediately get an error like this, taken from botocore's response after trying put_object in smart_open:

{'Error': {'Code': 'AccessDenied', 'Message': 'Access Denied'}, 'ResponseMetadata': {'RequestId': '2S21JF4JBMW4QH29', 'HostId': 'RVPqYCunTaf4Aq1k8cAjePRQgvC4pgHSd+kqBKJPpU+C3Y8P7Gv5tTo36Qmu10CMA4r9eWch8Mw='Wch8Mw=''date': 'Thu, 03 Feb 2022 01:19:02 GMT', 'server': 'AmazonS3', 'connection': 'close''RetryAttempts': 0[  ](http://localhost:5001/post_exp_detail#), 'HTTPStatusCode': 403, 'HTTPHeaders': {'x-amz-request-id': '2S21JF4JBMW4QH29', 'x-amz-id-2': 'RVPqYCunTaf4Aq1k8cAjePRQgvC4pgHSd+kqBKJPpU+C3Y8P7Gv5tTo36Qmu10CMA4r9e, 'content-type': 'application/xml', 'transfer-encoding': 'chunked', }, }}

This is the bucket policy for reference:

    "Version": "2012-10-17",
    "Id": "PutObjPolicy",
    "Statement": [
        {
            "Sid": "DenyIncorrectEncryptionHeader",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::BUCKET-EXAMPLE/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "AES256"
                }
            }
        },
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::BUCKET-EXAMPLE/*",
            "Condition": {
                "Null": {
                    "s3:x-amz-server-side-encryption": "true"
                }
            }
        }
    ]
}

Very odd as this code worked a few days earlier but perhaps I'm missing something. No data is written to the file, so this may explain the issue... I can come up with a simple work around I suppose... i'm just realizing this now as I write this out.

iterdir returns an S3Path for empty directories

When calling iterdir() on an empty S3Path object, a single path is still returned. This differs from the pathlib.Path behavior where iterdir() on an empty Path is an empty generator.

Example:

from s3path import S3Path
from pathlib import Path

s3p = S3Path("/path/to/empty/s3dir/")
p = Path("/path/to/empty/dir")

for child in s3p.iterdir():
    print(child)  # prints /path/to/empty/s3dir/

for child in p.iterdir():
    print(child)  # never prints

How to copy a large file from local or S3 path to local or S3 path

I recently wanted a function which can copy a large file from a local or S3 path to a local or S3 path. I could find no code for it. Here is what I implemented, in case anyone finds it useful or wants to improve upon it:

import pathlib
import shutil
from typing import Union

import boto3
import s3path

AnyPath = Union[pathlib.Path, s3path.S3Path]


def cp_file(input_path: AnyPath, output_path: AnyPath) -> None:
    """Copy file from local or S3 path to local or S3 path."""
    if (not isinstance(input_path, s3path.S3Path)) and (not isinstance(output_path, s3path.S3Path)):
        shutil.copy(input_path, output_path)
    elif isinstance(input_path, s3path.S3Path) and isinstance(output_path, s3path.S3Path):
        # Note: boto3.client('s3').copy_object works only for objects up to 5G in size.
        boto3.resource("s3").meta.client.copy(CopySource={"Bucket": input_path.bucket.name, "Key": str(input_path.key)}, Bucket=output_path.bucket.name, Key=str(output_path.key))
    elif (not isinstance(input_path, s3path.S3Path)) and isinstance(output_path, s3path.S3Path):
        boto3.resource("s3").meta.client.upload_file(Filename=str(input_path), Bucket=output_path.bucket.name, Key=str(output_path.key))
    elif isinstance(input_path, s3path.S3Path) and (not isinstance(output_path, s3path.S3Path)):
        boto3.resource("s3").meta.client.download_file(Bucket=input_path.bucket.name, Key=str(input_path.key), Filename=str(output_path))
    else:
        assert False

Note that the upload and download code in comparison.rst doesn't really help me because I never want the entire file to be read into memory all at once.

Glob-related error appeared on 0.3.3

0.3.2 was working fine until I upgraded to 0.3.3 and then the following error started to happen. I don't know why and where is the problem but going back to 0.3.2 fixes the problem.

  File "/home/sagemaker-user/Kroft/my_avm/avm/avm.py", line 1063, in load
    available=[f for f in list(resolved_path.glob(filename)) if '.dvc' not in str(f)]
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 706, in glob
    yield from super().glob(pattern)
  File "/home/sagemaker-user/.pyenv/versions/3.9.9/lib/python3.9/pathlib.py", line 1177, in glob
    for p in selector.select_from(self):
  File "/home/sagemaker-user/.pyenv/versions/3.9.9/lib/python3.9/pathlib.py", line 523, in select_from
    if not is_dir(parent_path):
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 678, in is_dir
    return self._accessor.is_dir(self)
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 169, in is_dir
    resource, _ = self.configuration_map.get_configuration(path)
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 90, in get_configuration
    if resources is None and path in self.resources:
TypeError: argument of type 'NoneType' is not iterable

filename variable contains something like random • * • *.pkl*. And I don't think this is relevant information but this code is a static method of a class and it was running twice, in 2 parallel threads, where first thread succeeded and second failed with above error.

exists() fails when not the bucket owner

This SO comment describes the current bug. Namely that s3.buckets.all() returns only buckets owned by the current user. If the bucket is owned by someone else, exists() wrongly returns false.

Add unlink support

S3 / boto support unlink, so the API here is a bit out of alignment given that it lacks this functionality.

glob is broken

Following code doesn't work:

import s3path

s3=s3path.S3Path.from_uri('s3://bucket-name/folder1/folder2')

list(s3.glob('*pkl'))

Last part of error:

~/.local/lib/python3.10/site-packages/s3path.py in is_dir(self)
    664         if self.bucket and not self.key:
    665             return True
--> 666         return self._accessor.is_dir(self)
    667 
    668     def is_file(self):

AttributeError: '_NormalAccessor' object has no attribute 'is_dir'

Tested in s3path 0.3.2. I can't use 0.3.3 because it has other glob-thread-related bugs.

Streaming from file doesn't work as expected

I've been running into some issues when trying to work with fileobjects from the .open() command. I can only reproduce it with certain files/objects, like a pickled numpy array:

from s3path import S3Path
import pickle
import numpy as np

x = np.random.rand(3, 3)

path = S3Path("...")

with path.open("wb") as f:
    pickle.dump(x, f)
    
with path.open("rb") as f:
    y = pickle.load(f)
    
assert (x == y).all()

This fails with EOFError: Ran out of input when trying to load the array.

The script works fine if you replace: y = pickle.load(f) with y = pickle.loads(f.read()), but this isn't always practical if you want to stream a large file that won't fit into memory.

`Path.rename(S3Path)` doesn't work

Crossing type boundaries to easily "upload" a file doesn't seem to work. Is there a better way to do this?

pathlib's path is just running their _NormalAccessor which is using os.rename it seems like. It doesn't really check the target in Path.rename(target)

Python 3.10 support is broken

Initially reported at pypa/bandersnatch#1054, seems like the changes python/cpython#19342 broke the 3.10 support of s3path.

Support `endpoint_url` for localstack

We tests AWS locally using LocalStack.

This mean we instantiate boto3 resource with endpoint_url set to 'http://localhost:4566'. E.g.

boto3.resource("s3", region_name=region, endpoint_url="http://localhost:4566")

boto3.setup_default_session does not support endpoint_url.

The work around we are using is

    # pylint: disable=protected-access
    s3path._s3_accessor = s3path._S3Accessor(endpoint_url="http://localhost:4566")

This seems to work. It would be nice for it to be officially supported.

Writing large file stuck in retry

>>> p = S3Path("/foo/bar")

This works

>>> with p.open('w') as fp:
...   json.dump([None]*10,fp)


This get stuck for a long time

>>> with p.open('w') as fp:
...   json.dump([None]*5000,fp)


This works well also

>>> p.write_text(json.dumps([None]*5000))

When it is stuck, I've seen a lot of log messages like this.

DEBUG:urllib3.util.retry:Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
DEBUG:botocore.awsrequest:Waiting for 100 Continue response.
DEBUG:botocore.awsrequest:100 Continue response seen, now sending request body.

This looks like an infinite loop. But after a long time, it completed successfully.

glob is inefficient as it's iterating a dir that already scanned

I tried glob method and found it is too slow when there're millions of files in the directory.

turns out that the glob method will first call list_objects_v2 api first, get all files (every single file including folders and files), identify all files to see if they are folders. and then scan the folders.

The algorighm is corret in traditional fs, while inefficient in s3, s3 will return every object when requesting list_objects_v2 api, iterating subfolders are unneccessary.

Is that possible to fix it in s3path or it can only be fixed in pathlib ?

Overflow on read, even for small chunk size

While trying to read a 3GB file with s3path I get an OverflowError, even when I read it in small chunks. Please see the accompanying PR for a solution.

~/myscript.py in compute_hash(file_handle=<s3path.S3KeyReadableFileObject object>, file_size=3030219641, chunk_size=131072)
    223         disable=DISABLE_TQDM,
    224     ) as pbar:
--> 225         while chunk := file_handle.read(size=chunk_size):
        chunk = undefined
        file_handle.read = <bound method S3KeyReadableFileObject.read of <s3path.S3KeyReadableFileObject object at 0x7ff84da51460>>
        global size = undefined
        chunk_size = 131072
    226             file_hash.update(chunk)
    227             pbar.update(chunk_size)

/opt/conda/lib/python3.8/site-packages/s3path.py in wrapper(self=<s3path.S3KeyReadableFileObject object>, *args=(), **kwargs={'size': 131072})
    858             if not self.readable():
    859                 raise UnsupportedOperation('not readable')
--> 860             return method(self, *args, **kwargs)
        global method = undefined
        self = <s3path.S3KeyReadableFileObject object at 0x7ff84da51460>
        args = ()
        kwargs = {'size': 131072}
    861         return wrapper
    862 

/opt/conda/lib/python3.8/site-packages/s3path.py in read(self=<s3path.S3KeyReadableFileObject object>, *args=(), **kwargs={'size': 131072})
    874     @readable_check
    875     def read(self, *args, **kwargs):
--> 876         return self._string_parser(self._streaming_body.read())
        self._string_parser = functools.partial(<function _string_parser at 0x7ff84e260f70>, mode='rb', encoding=None)
        self._streaming_body.read = <bound method StreamingBody.read of <botocore.response.StreamingBody object at 0x7ff84c567c70>>
    877 
    878     @readable_check

/opt/conda/lib/python3.8/site-packages/botocore/response.py in read(self=<botocore.response.StreamingBody object>, amt=None)
     75         """
     76         try:
---> 77             chunk = self._raw_stream.read(amt)
        chunk = undefined
        self._raw_stream.read = <bound method HTTPResponse.read of <urllib3.response.HTTPResponse object at 0x7ff84c567910>>
        amt = None
     78         except URLLib3ReadTimeoutError as e:
     79             # TODO: the url will be None as urllib3 isn't setting it yet

/opt/conda/lib/python3.8/site-packages/urllib3/response.py in read(self=<urllib3.response.HTTPResponse object>, amt=None, decode_content=False, cache_content=False)
    512             if amt is None:
    513                 # cStringIO doesn't like amt=None
--> 514                 data = self._fp.read() if not fp_closed else b""
        data = undefined
        self._fp.read = <bound method HTTPResponse.read of <http.client.HTTPResponse object at 0x7ff84da50400>>
        fp_closed = False
    515                 flush_decoder = True
    516             else:

/opt/conda/lib/python3.8/http/client.py in read(self=<http.client.HTTPResponse object>, amt=None)
    469             else:
    470                 try:
--> 471                     s = self._safe_read(self.length)
        s = undefined
        self._safe_read = <bound method HTTPResponse._safe_read of <http.client.HTTPResponse object at 0x7ff84da50400>>
        self.length = 3030219641
    472                 except IncompleteRead:
    473                     self._close_conn()

/opt/conda/lib/python3.8/http/client.py in _safe_read(self=<http.client.HTTPResponse object>, amt=3030219641)
    610         IncompleteRead exception can be used to detect the problem.
    611         """
--> 612         data = self.fp.read(amt)
        data = undefined
        self.fp.read = undefined
        amt = 3030219641
    613         if len(data) < amt:
    614             raise IncompleteRead(data, amt-len(data))

/opt/conda/lib/python3.8/socket.py in readinto(self=<socket.SocketIO object>, b=<memory>)
    667         while True:
    668             try:
--> 669                 return self._sock.recv_into(b)
        self._sock.recv_into = undefined
        b = <memory at 0x7ff85070cdc0>
    670             except timeout:
    671                 self._timeout_occurred = True

/opt/conda/lib/python3.8/ssl.py in recv_into(self=<ssl.SSLSocket [closed] fd=-1, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>, buffer=<memory>, nbytes=3030212608, flags=0)
   1239                   "non-zero flags not allowed in calls to recv_into() on %s" %
   1240                   self.__class__)
-> 1241             return self.read(nbytes, buffer)
        self.read = <bound method SSLSocket.read of <ssl.SSLSocket [closed] fd=-1, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>>
        nbytes = 3030212608
        buffer = <memory at 0x7ff85070cdc0>
   1242         else:
   1243             return super().recv_into(buffer, nbytes, flags)

/opt/conda/lib/python3.8/ssl.py in read(self=<ssl.SSLSocket [closed] fd=-1, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>, len=3030212608, buffer=<memory>)
   1097         try:
   1098             if buffer is not None:
-> 1099                 return self._sslobj.read(len, buffer)
        self._sslobj.read = undefined
        len = 3030212608
        buffer = <memory at 0x7ff85070cdc0>
   1100             else:
   1101                 return self._sslobj.read(len)

OverflowError: signed integer is greater than maximum

S3Path should keep newline characters in readline and readlines

As written in the comment of the PR #50 readline and readlines methods should return the newline character if present to be compliant with the file interface

f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline.

See https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

This was a limitation because of the botocore iter_lines method in the StreamingBody class. But a PR to allow the method to return the newlines characters has been merged to the project: boto/botocore#2235

It should now be possible to respect the file interface on this project.

Doesn't work with Moto s3 mocking library

Love the idea of this and it works great except that it isn't working with the moto s3 test framework, I get 403 access denied errors.

I can use a tool like smart_open to open seamlessly mocked s3 urls:

eg:

from moto import mock_s3

@pytest.fixture(scope='session', autouse=True)
def aws_credentials():
    """Mocked AWS Credentials for moto."""
    os.environ['AWS_ACCESS_KEY_ID'] = 'testing'
    os.environ['AWS_SECRET_ACCESS_KEY'] = 'testing'
    os.environ['AWS_SECURITY_TOKEN'] = 'testing'
    os.environ['AWS_SESSION_TOKEN'] = 'testing'


@pytest.fixture(scope='session', autouse=True)
def s3(aws_credentials):
    with mock_s3():
        yield boto3.client('s3', region_name='us-east-1')


DEFAULT_BUCKET = "gfb-test"


@pytest.fixture(scope="module")
def s3_bucket(s3):
    s3.create_bucket(Bucket=DEFAULT_BUCKET)
    yield DEFAULT_BUCKET

@pytest.fixture(scope='module')
def s3_fille(s3_bucket):
    from smart_open import open
    in_path = Path('/tmp/foo')
    in_path.open('w').write('data')
    dest_path = S3Path(f"/{s3_bucket}") / str(in_path.name)
    with open(dest_path.as_uri(), 'wb') as dest:
        dest.write(in_path.read_bytes())
    return dest_path.as_uri()

but If I try to use any S3Path methods I get a 403

@pytest.fixture(scope='module')
def s3_fille(s3_bucket):
    from smart_open import open
    in_path = Path('/tmp/foo')
    in_path.open('w').write('data')
    dest_path = S3Path(f"/{s3_bucket}") / str(in_path.name)
    dest_path.write_bytes(in_path.read_bytes())  # error here
    return dest_path.as_uri()

or even

dest_path.stat()

For smart_open or just using boto3 the botocore.BaseClient._make_api_request request_dict looks like this:

{
'url_path': '/gfb-test/BioWordVec_PubMed_MIMICIII_d200-500.vec.bin?uploads',
'query_string': {},
'method': 'POST', 'headers': {'User-Agent': 'Boto3/1.15.3 Python/3.8.2 Darwin/19.6.0 Botocore/1.18.3 Resource'},
'body': b'', 
'url': 'https://s3.amazonaws.com/gfb-test/BioWordVec_PubMed_MIMICIII_d200-500.vec.bin?uploads',
'context': {'client_region': 'us-east-1', 'client_config': <botocore.config.Config object at 0x119d99a60>, 'has_streaming_input': False, 'auth_type': None, 'signing': {'bucket': 'gfb-test'}, 'timestamp': '20201016T151222Z'}
}

whereas with S3Path

the request_dict looks like this:

{'url_path': '/gfb-test/BioWordVec_PubMed_MIMICIII_d200-500.vec.bin', 'query_string': {'uploadId': 'cQIaRS6ArGme8K0aEt92Ec72OfNczjfxKmoqm3yVf6l31U558WPtrbYpQ', 'partNumber': 1}, 
'method': 'PUT', 
'headers': {'User-Agent': 'Boto3/1.15.3 Python/3.8.2 Darwin/19.6.0 Botocore/1.18.3 Resource', 'Content-MD5': 'GgMsXJfWEPiCUayMTflCIw==', 'Expect': '100-continue'}, 'body': <_io.BytesIO object at 0x11a4846d0>, 
'url': 'https://s3.amazonaws.com/gfb-test/BioWordVec_PubMed_MIMICIII_d200-500.vec.bin?uploadId=cQIaRS6ArGme8K0aEt92Ec72OfNczjfxKmoqm3yVf6l31U558WPtrbYpQ&partNumber=1', 
'context': {'client_region': 'us-east-1', 'client_config': <botocore.config.Config object at 0x119d99a60>, 'has_streaming_input': True, 'auth_type': None, 'signing': {'bucket': 'gfb-test'}, 'timestamp': '20201016T151832Z'}
}

I thought there might be a smoking gun but I don't really see it /shrug.

conda-forge recipe requirements

Hi. I recently ran into NotImplementedError: cannot instantiate 'S3Path' on your system when installing from conda-forge. The docs indicate that this may be because boto3 isn't installed but I had it so I struggled a bit and I think it's due to the new smart-open requirement.

I see that both boto3 and smart-open are in setup.py so I think it'd be nice if they were included in the conda-forge recipe as well. I can make a PR there if you like.

"compression" parameter is passed to smart_open.open() with incorrect version check

S3Path passes two different sets of kwargs to smart_open.open(), depending if version >=5.0.0 or lower. This throws an error when having smart_open=5.0.0 installed, because the parameter compression got only introduced in smart_open=5.1.0.

TypeError: open() got an unexpected keyword argument 'compression'

https://github.com/RaRe-Technologies/smart_open/releases/tag/v5.1.0

Installing manually smart_open>=5.1.0 fixes this.

Issue in S3KeyReadableFileObject line iteration

A few issues regarding the lines handling in S3KeyReadableFileObject:

Everytime readline is called a new generator is created from _streaming_body.iter_lines
The __next__ method is calling readline and this method never raises StopIteration
readlines is iterating over readline that never raises StopIteration

Config arguments such as ACL not passed to copy

The rename() method calls Bucket.copy() which does not accept configuration arguments as normal kwargs. Instead, the configuration arguments should be passed in as ExtraArgs. (reference)

As a result, if I have ACL in my config, the result of a rename() operation fails to apply the correct ACL to my renamed file.

Metadata

Is it possible to configure metadata when uploading / creating a file this way?

copy files between S3 locations using S3Path

Hi @liormizr, perhaps I have just missed it, but is there any way one can copy a file from the S3Path location to the S3Path location? I'm aware of S3Path.rename(), but this is not what I really want. As a workaround, I could think of

src = S3path('/src/dir')
dst - S3path('/dst/dir')

 for each in src.rglob("*"):
        if each.is_file():
            with each.open("rb") as fp:
                (dst / each.relative_to(src)).write_bytes(fp.read())

But that seems like a big overkill. I could directly use boto3, but I wondered if there is a direct support.

Thanks a lot!

writelines method is not compliant with base same method from file.

To be compliant with the base writelines method the new line character should not be added.

From the documentation https://docs.python.org/2/library/stdtypes.html#file.writelines :

The name is intended to match readlines(); writelines() does not add line separators.

There is also an issue where the double call to _string_parser (once in writlines and then in write) raising an error:

TypeError: a bytes-like object is required, not 'str'

make conda installable and add to conda-forge

It would make s3path easier to integrate into data science type projects if it was available through conda.

You can add a recipe to conda-forge: https://conda-forge.org/#add_recipe

[Question/discussion] Why S3Path's smart_open usage defaults to compression='disable'?

Seeing as how S3Path leverages smart_open, I was surprised that:

S3Path('/mybucket/mypath/file.csv.gz').open()

Did not "autodetect" compression from file extension, as smart_open does by default:

smart_open.open(S3Path('/mybucket/mypath/file.csv.gz'))

As it turns out, S3Path sets compression='disable': https://github.com/liormizr/s3path/blob/master/s3path.py#L388

Any particular reason?

Pandas compatibility

Hi. Thank you for making this package, it has helped me a lot and I use it almost interchangeably with pathlib.Path. The only friction I've found is when using an S3Path object as a filepath to pandas.read_*. I believe the relevant portion is here, where __fspath__ returns __str__ which strips the s3:/ prefix and tries to read from /bucket in the local file system. I tried overriding it but they're related and I get recursion errors haha. Do you have any suggestions for implementing this? I'd be happy to make a PR.

Improve doc's with Sphinx

Currently the doc's are just simple .rst files.
Need to improve with auto generated doc's with Sphinx.
And use sphinx automodule, doctest

Version 0.3.0 issue with custom configuration with smart-open - open method

Raised from @LeoQuote:

the original code won't work using custom s3 endpoint. Please consider adding my proposal code.

Sugestion:

        s3path_session = boto3.Session()
        s3path_client = s3path_session.client(
            's3',
            endpoint_url=resource_kwargs['endpoint_url'],
            aws_access_key_id=resource_kwargs['aws_access_key_id'],
            aws_secret_access_key=resource_kwargs['aws_secret_access_key'],
            aws_session_token=resource_kwargs['aws_session_token']
        )
        file_object = smart_open.open(
            uri=path.as_uri(),
            mode=mode,
            buffering=buffering,
            encoding=encoding,
            errors=errors,
            newline=newline,
            ignore_ext=True,
            transport_params={
                'session': s3path_session,
                'client': s3path_client,
                'resource_kwargs': resource_kwargs,

S3Path is awkward to use with boto and other libraries expecting bucket and key

I was very excited about S3Path but after trying to use it in another project I have run into some issue which make it rather cumbersome. If you have a codebase which uses boto or need to interface with other libraries that use (bucket, key) pairs then things aren't straight forward.

The first issue is creating an S3Path from bucket and key strings. I was expecting to be able to simply do S3Path(bucket, key) or perhaps S3Path(bucket=b, key=k) but this creates a relative S3Path with no bucket.

I can use S3Path('/', bucket, key) which I guess is kind of OK until key starts with a slash in which case I just end up with S3Path(/path1/path2/file), i.e. the first element of key becomes the bucket name which obviously makes no sense.

This behaviour might be for consistency pathlib but I think there should be straight forward way to make an S3Path from a bucket and key pair. If you don't care about python2 support perhaps an __init__(*, bucket, key) can be added which forces named arguments? Or a from_bk(bucket, key) class method like the existing from_uri?

The second issue is using S3Path with an api that expects bucket and key strings. The obvious thing to do, with for example boto, is s3.upload_file('localfile', s3path.bucket, s3path.key). This doesn't work because boto expects str not S3Path. Fair enough, lets try s3.upload_file('localfile', str(s3path.bucket), str(s3path.key)) ugly but perhaps reasonable, except that it doesn't work either because s3path.bucket starts with a slash. So to make it work you end up with s3.upload_file('localfile', s3path.bucket.name, str(s3path.key)) which is not intuitive at all.

Is there a rational why S3Path.bucket and S3Path.key are S3Path objects instead of strings? To me they intuitively should give one the same (string) values expected from boto and other aws libraries. This would also be in line with how pathlib returns .name, .suffix, .parts, etc as strings.

I'm not really familiar with the internals of pathlib but purely based on the documentation it seems like conceptually the bucket name is something that should perhaps be treated more like PurePath.drive instead of just the first part of a posix path?

Make S3Path compatile with pandas

A suggestion to make S3Path more compatible with pandas.

if you try something like :

path = S3Path('/bucket/sample.csv')
pandas.read_csv(path.open('r'), ...)

This provides a file equivalent as first argument and works as expected!.

However if you try :

pandas.read_csv(path, ...)

This provides a path equivalent as first argument but it does not work!

This is because pandas will call path.fspath() to get an actual path, which in this case is '/bucket/sample.csv' and points to a local path.

If instead fspath() returned 's3://bucket/sample.csv' than pandas could load the data directly and respect the path semantics.

Please note that pandas internally uses s3fs and fsspec to open the s3 url, unlike S3Path which uses smart_open I believe. In any case this would make the usage of S3Path as a path in the pandas context much more intuitive!

This could easily be achieved by adding a proper fspath() member to the S3Path. Hope this makes sense.

Thank you.

Is there a changelog?

Going forward, I think it'll help to maintain a changelog. Without a changelog, it is in general more difficult to understand the notable changes from one released version to another.

Using the library with S3 Compatible Storage

Hi,
We are dealing with S3-Compatible Storage at work, to work with it, we use Boto and create a client or a resource and we specify the endpoint_url.

I'd love to be able to use s3path with my project, but from what I've seen I'm only able to use it with a "default session", the problem is that there is no such thing as a default session for custom endpoint.

Would it be possible to use your library with custom endpoints with a little bit of tweaking ?

Thanks,

Cannot configure default session

docs/advance.rst says to call boto3.setup_default_session to override configuration options.

However, that only works if that call to boto3.setup_default_session occurs before s3path is imported, because the _S3Accessor object is created at module import time, and that's when it creates the session using default parameters.

The (very hacky) workaround I came up with is to call S3Path._accessor.__init__() after I reconfigure the default session, which seems to work but is not pretty.

Not sure what the ideal solution is to this—perhaps a module-level "set session" function?

URLencoding breakes S3 key

When uploading a file to S3 like this:

>>> from s3path import S3Path
>>> s=S3Path('/my-bucket/tmp/foo=bar/objname')
>>> s.write_text("foobar")
/path/to/my/venv/lib64/python3.8/site-packages/smart_open/s3.py:220: UserWarning: ignoring the following deprecated transport parameters: ['multipart_upload_kwargs', 'object_kwargs', 'resource_kwargs', 'session']. See <https://github.com/RaRe-Technologies/smart_open/blob/develop/MIGRATING_FROM_OLDER_VERSIONS.rst> for details
  warnings.warn(message, UserWarning)
6

The location is transformed and the = is urlencoded by a as_uri() call. In S3 I now have a tmp/foo%3Dbar/ prefix. But I expected an tmp/foo=bar/ prefix.

The as_uri() call in s3path.py leads to unexpected behaviour and breakes the behaviour of s3path <=v0.3

Configuration example not working: NoCredentialsError

Problem

I tried to set the credentials using boto3.setup_default_session as described in the s3path docs. However, it seems that s3path ignores that setting. I get a NoCredentialsError, while using boto3 directly works.

Code

import boto3
from s3path import S3Path

AWS_ACCESS_KEY_ID = "..."
AWS_SECRET_ACCESS_KEY = "..."

boto3.setup_default_session(
    region_name='eu-central-1',    
    aws_access_key_id=AWS_ACCESS_KEY_ID,    
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)

# works fine
s3 = boto3.client('s3')
for obj in s3.list_objects_v2(Bucket="my-bucket" Prefix="2021/")['Contents']:
    print(obj['Key'])
    break
    
# NoCredentialsError
bucket_path = S3Path('/my-bucket/2021/')
for p in bucket_path.iterdir():
    print(p)
    break

Workaround

Set credentials via environment variables before importing boto3:

import os

os.environ["AWS_ACCESS_KEY_ID"] = "..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "..."

import boto3
from s3path import S3Path

glob and rglob broke with Python 3.8.3

This used to work fine with Python 3.8.2, but doesn't in 3.8.3. The Python changelog does show relevant changes for .glob and .scandir for Python 3.8.3, one of which is https://bugs.python.org/issue39916. This Python issue also affects 3.7 and 3.9.

>>> import sys
>>> sys.version
'3.8.3 (default, May 19 2020, 13:54:14) \n[Clang 10.0.0 ]'

>>> import s3path
>>> p = s3path.S3Path.from_uri('s3://noaa-gefs-pds')

>>> next(p.iterdir())
S3Path('/noaa-gefs-pds/gefs.20170101')

>>> next(p.glob('*'))
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/removed/lib/python3.8/site-packages/s3path.py", line 502, in glob
    yield from super().glob(pattern)
  File "/removed/lib/python3.8/pathlib.py", line 1136, in glob
    for p in selector.select_from(self):
  File "/removed/lib/python3.8/pathlib.py", line 530, in _select_from
    with scandir(parent_path) as scandir_it:
AttributeError: __enter__

>>> next(p.rglob('*'))
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/removed/lib/python3.8/site-packages/s3path.py", line 508, in rglob
    yield from super().rglob(pattern)
  File "/removed/lib/python3.8/pathlib.py", line 1148, in rglob
    for p in selector.select_from(self):
  File "/removed/lib/python3.8/pathlib.py", line 583, in _select_from
    for p in successor_select(starting_point, is_dir, exists, scandir):
  File "/removed/lib/python3.8/pathlib.py", line 530, in _select_from
    with scandir(parent_path) as scandir_it:
AttributeError: __enter__

Note that glob and rglob of course continue to work fine for pathlib.Path:

>>> import pathlib

>>> next(pathlib.Path('/tmp').glob('*'))
PosixPath('/tmp/com.apple.launchd.6n6VKSqeWo')

>>> next(pathlib.Path('/tmp').rglob('*'))
PosixPath('/tmp/com.apple.launchd.6n6VKSqeWo')

The tricky point could be to support both Python 3.8.2 and 3.8.3.

StatResult use wrong attribute names

@liormizr Why does StatResult use different attribute names than the ones used by pathlib and os.stat_result? As a result, I cannot transparently use this package. I specifically need the st_size attribute but it's unavailable. This idiosyncrasy makes this package less than pathlib compatible. This concern is not limited to st_size.

Support non-default session

As far as I can tell, s3path is hard-wired to use boto3's default session. But I need to use a non-default session.

Actually, after further analysis of the source code, it looks like I could do

from s3path import S3Path, register_configuration_parameter
register_configuration_parameter(S3Path("/"), resource=boto3.resource("s3"))

where above one can easily substitute a resource based on a user-defined session. But it took me a while of reading the code in order to figure this out. (I didn't find any corresponding documentation.) Am I doing it right?

Thanks for the great library!

TypeError: 'NoneType' exception when calling owner() method on a valid path

Calling bp.owner() I get this unexpected error

In [16]: bp.stat()
Out[16]: StatResult(size=4454, last_modified=datetime.datetime(2019, 2, 28, 19, 50, 20, tzinfo=tzutc()))

In [17]: bp.owner()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-c9b9b77d3c96> in <module>()
----> 1 bp.owner()

/usr/local/lib/python3.6/dist-packages/s3path.py in owner(self)
    411         if not self.is_file():
    412             return KeyError('file not found')
--> 413         return self._accessor.owner(self)
    414 
    415     def rename(self, target):

/usr/local/lib/python3.6/dist-packages/s3path.py in owner(self, path)
    142         key_name = str(path.key)
    143         object_summery = self.s3.ObjectSummary(bucket_name, key_name)
--> 144         return object_summery.owner['DisplayName']
    145 
    146     def rename(self, path, target):

TypeError: 'NoneType' object is not subscriptable

Add support for symbolic links

It is possible to support symbolic linking on S3 by relying on the native metadata field website_redirect_location. This is intended to be used with website buckets as a redirect key, so S3 won't respect it in non-website buckets, but because it is provided natively we can rely on it and use it to store references similar to how symbolic links work.

In other similar distributed filesystems, this works with an empty, 0 length binary string as the file contents, accompanied by the relevant header stored in metadata (in this case, x-amz-website-redirect-location). IT would then be on the implementation to check whether a path is referring to a symbolic link or not during relevant operations, e.g. read() or during directory traversals.

Reliably publish to conda-forge as well

I see that PyPI currently has v0.1.93 but conda-forge has v0.1.91. Can the version on conda-forge be updated in tandem with the one on PyPI, or very close to it? Without this I cannot easily use the latest version in my conda environment. Thanks.

S3Path.write_bytes does not work

Python 3.8
s3path: 0.1.8

Calling write_bytes on an S3Path fails as the pathlib write_bytes method wraps the data in a memoryview. S3Path's _string_parser then fails to detect and treat it as bytes / binary and therefore fails when calling encode on the data.

missing_ok=True looks to be ignored for unlink

missing_ok=True seems to be ignored for unlink. This looks like a bug.

Officially support Python 3.8

Can this package be tested with Python 3.8, and it be added to this list in setup.py? Thanks.

Support for smart-open api change version 5.0.0

This is an issue to track the great work that @fhoecker did to support the new api for smart-open version 5.0.0

Can't write to new object and use encoding

my_s3_path.open("w", encoding="utf-8", newline="\n")

This code will produce the following error:

  File "C:\python\lib\site-packages\s3path.py", line 607, in open
    return self._accessor.open(
  File "C:\python\lib\site-packages\s3path.py", line 213, in open
    return file_object(
  File "C:\python\lib\site-packages\s3path.py", line 769, in __init__
    self._cache = NamedTemporaryFile(
  File "C:\python\lib\tempfile.py", line 542, in NamedTemporaryFile
    file = _io.open(fd, mode, buffering=buffering,
ValueError: binary mode doesn't take an encoding argument

It appears that the logic for the initialization of S3KeyWritableFileObject is the source of this error. Specifically, this update should resolve it:

binary = 'b' if (('b' in self.mode) or (self.encoding is None)) else ''
self._cache = NamedTemporaryFile(
    mode=binary + self.mode.strip('b') + '+',
    buffering=self.buffering,
    encoding=self.encoding,
    newline=self.newline)

register_configuration_parameter is broken

from s3path import S3Path, register_configuration_parameter

bucket = S3Path('/bucket/')
register_configuration_parameter(bucket, parameters={'ContentType': 'text/html'})
bucket.joinpath('bar.html').write_text('hello')

Got this error

ValueError: too many values to unpack (expected 2)

MinIO Compatability

Any chance you'd be interested in switching over to something like MinIO's python client so this works with any S3 compatible object server?

https://github.com/minio/minio-py

Warning from smart_open

I'm seeing the following warning. At the moment I don't have time to track it down, but maybe this is already enough to understand the problem?

/opt/conda/lib/python3.9/site-packages/smart_open/s3.py:220: UserWarning: ignoring the following deprecated transport parameters: ['multipart_upload_kwargs', 'object_kwargs', 'resource_kwargs', 'session']. See <https://github.com/RaRe-Technologies/smart_open/blob/develop/MIGRATING_FROM_OLDER_VERSIONS.rst> for details
  warnings.warn(message, UserWarning)

Is there a way to access/change content-type or meta data before uploading a file?

when using put_object method, I can set Meta and content type using

response = client.put_object(
    Body='filetoupload',
    Bucket='examplebucket',
    Key='objectkey',
    ContentType='string',
    Metadata={
        'string': 'string'
    },
)

It will be useful if I can set content type and metadata before uploading.

liormizr / s3path Goto Github PK

s3path's People

Contributors

Stargazers

Watchers

Forkers

s3path's Issues

Problem

Code

Workaround

Recommend Projects

Recommend Topics

Recommend Org

Jobs