xtream1101 / s3-concat Goto Github PK

View Code? Open in Web Editor NEW

39.0 5.0 12.0 46 KB

Concat multiple files in s3

License: MIT License

Python 100.00%

s3-concat s3 concatenation python cli

s3-concat's People

Contributors

Stargazers

Watchers

Forkers

vinceatbluelabs vjkholiya123 yanshuo619 he1nr1chk avantifellows siarhei-abramenka vivekjhauipath nitishmadhukar msgpo kitmenke shuntaka9576

s3-concat's Issues

Random Retry failed batch error

I don't know how it can happen as I only have small files and there is only one thread, but somehow batches fail. here is the code I use

job = S3Concat(s3_bucket, merged_csv, min_file_size=None)
job.add_files(path.join(prefix, "data"))
job.concat()

and not often but it happens to raise the error mentioned in the title. Here are the logs https://pastebin.com/raw/svSZ2sgu.

Read Timeout Error

I am getting a lot of ReadTimeoutErrors. Not sure why the endpoint URL is "None".

Is there a setting that I can change to limit eliminate the occurrence of this error?

CRITICAL:s3_concat.multipart_upload_job:Read timeout on endpoint URL: "None": When getting my/s3/prefix/my-file.json from the bucket my-s3-bucket
ERROR:s3_concat.utils:Retry failed batch of: [5, (5, [('my/s3/prefix/my-file-01.json', 343575), ('my/s3/prefix/my-file-02.json', 531245), ... ])]
Traceback (most recent call last):
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
    yield
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 566, in read
    data = self._fp_read(amt) if not fp_closed else b""
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 532, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
  File "/usr/local/lib/python3.7/http/client.py", line 478, in read
    s = self._safe_read(self.length)
  File "/usr/local/lib/python3.7/http/client.py", line 628, in _safe_read
    chunk = self.fp.read(min(amt, MAXAMOUNT))
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/env/lib/python3.7/site-packages/botocore/response.py", line 99, in read
    chunk = self._raw_stream.read(amt)
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 592, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 448, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: AWSHTTPSConnectionPool(host='ucc-filings-results-prod.s3.amazonaws.com', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/env/lib/python3.7/site-packages/s3_concat/utils.py", line 19, in _thread_run
    response = callback(item)
  File "/opt/env/lib/python3.7/site-packages/s3_concat/multipart_upload_job.py", line 108, in get_small_parts
    )['Body'].read()
  File "/opt/env/lib/python3.7/site-packages/botocore/response.py", line 102, in read
    raise ReadTimeoutError(endpoint_url=e.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None"

Could we introduce thread to concat more faster?

Before this test, I excluded files of 5MB or less to exclude the process of downloading locally.
When the concat process was executed using the average 20MB file group, the result was about 1 file per minute. It is a calculation that 200 files are combined in 1 minute. It means 100GB processing takes 100 minutes.

I have a total 100GB of files per hour and I want to finish the merge within 20 minutes.
For that, I thought that it was necessary to perform parallel processing using threads.

Unidecode Error utf-8

Hi guys, I've tried using this lib and got an error:

I saw the code that generate it: get_small_parts method

Looking better for this error, I found the root reason: files that are compressed in gzip format or another one will be an exception for this code. In my case, my files are gzip format and the turn around was using the gzip lib to read files.

S3 Server Side Encryption (KMS) not supported

If SSE (KMS) is enabled on S3, there is no option to provide KMS key for this utility and is filing of course with "Access Denied" error.

s3-concat not merging and no error

bucket = 's3-data-buck'
path_to_concat = 'customer' ## all files under this prefix in above bucket
concatenated_file = 'concat.csv'
min_file_size = '5MB'

job = S3Concat(bucket, concatenated_file, min_file_size,content_type='application/json')
job.add_files(path_to_concat)

When I ran the above code, no error occurs, and no merge happened. Is there any debug for this?
Anything wrong with the above call?

Add concat order optimization

Hi, I found so great tool.

In some case, Developers often prefer performance rather than file order.
It seems current add_files() with prefix adds just alphabetical order.

Could you make the following improvements that can significantly improve performance?

Sort the results obtained by job.add_files(path_to_concat) in size order
Perform efficient JOIN in order from the largest file
Download only small files under 5MB locally and then merge to upload

I guess we would better to add option for add_files() function such a join_from_large_files=true option.

sorting with size example

>>> objects_list = []
>>> objects_list.extend([('file_A', 1000),('file_B', 1500),('file_C', 500)])
>>> objects_list
[('file_A', 1000), ('file_B', 1500), ('file_C', 500)]
>>> objects_list.sort(key=lambda x: x[1])
>>> objects_list
[('file_C', 500), ('file_A', 1000), ('file_B', 1500)]

feature request: add support for gzipped content

Hi,

I manage millions of small files that needs to be concatenated and your tool works great for that.
In the case of gzipped source files, I needed to decompress the file before doing the concatenation based on the ContentEncoding header in S3 response.

If you're interested, I can submit a PR with the changes to handle automatically gzipped content (It's only 6 lines of code).
Maybe an arg can be added in MultipartUploadJob if you wan't to enable it manually.

Thanks for this lib.

Combining files < 5mb + using regex

Hi great work here! Thanks for putting this :)
I have a few questions:

Possible to use regex in bucket names? I'm trying to combine ( as an example) ALL the json files in:
(let's say I'm trying to combine the following 3 files)

1. s3://fluentd/events/012345esdf/2017/12/a.json
2. s3://fluentd/events/9872absdd/2017/12/b.json
3. s3://fluentd/events/333333dd/2017/12/c.json

What should be my BUCKET string will be?
Also I didn't understand what the difference between
out_path argument and path_to_concat argument.
FYI my files are < 5 MB so using small_parts_threads=4

support parquet?

feature request: add support for NDJSON, log message to pushed s3 key & compression

Hi @xtream1101

Please add support for:

NDJSON format (basically adding a new-line separator), e.g.

def construct_jsonl_str(items) -> str:
    json_str = ''
    for item in items:
        json_str += json.dumps(item) + os.linesep  # added workaround to convert to NDJSON
    return json_str

log message with s3 key to be put (probably somewhere here)
Please add compression to concatenated file

P.S I can contribute the requestd above logic by myself, if possible

change to add_file definition to use head_object instead of get_object

Hi @xtream1101 !

Thanks for your work on this. It's really useful.

I see that add_file definition needs only ContentLength which should be present in metadata. We can simply make head_object call rather than get_object which is expensive.

def add_file(self, key):
resp = self.s3.get_object(Bucket=self.bucket, Key=key)
self.all_files.append((key, resp['ContentLength']))

Is there a reason to use get_object instead of head_object?

Memory Leak - Threads are leaking

Hi @xtream1101 , Thanks a bunch for putting this together! We are using this for our project. We detected a memory leak in this library. The threads are being created every time we invoke a concat operation and they are not being cleaned up. This is exactly where the bug is https://github.com/xtream1101/s3-concat/blob/master/s3_concat/utils.py#L36. We are currently using ThreadPool to deal with this issue but it'd be great if you can fix it here so that we can tag along and use all great features of the library that will be coming in the future. Here's a quick glance at our fix.

from multiprocessing.pool import ThreadPool as Pool

def _thread_run(item, callback):
    for _ in range(3):
        # re try 3 times before giving up
        try:
            response = callback(item)
            return response
        except Exception:
            logger.exception("Retry failed batch of: {}".format(item))


def _threads(num_threads, data, callback):
    pool = Pool(num_threads)
    results = pool.starmap(_thread_run, [(item, callback) for item in data])
    pool.close()
    pool.join()
    return results

Please let us know if you need further details on this.

Option to delete/remove source files after concat operation

Hello,

Are you planning to add an option to delete source files after successful concat operation ?

Thanks

xtream1101 / s3-concat Goto Github PK

s3-concat's People

Contributors

Stargazers

Watchers

Forkers

s3-concat's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs