GithubHelp home page GithubHelp logo

s3-concat's People

Contributors

christophlingg avatar vinceatbluelabs avatar vivekjhauipath avatar xtream1101 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

s3-concat's Issues

Random Retry failed batch error

I don't know how it can happen as I only have small files and there is only one thread, but somehow batches fail. here is the code I use

job = S3Concat(s3_bucket, merged_csv, min_file_size=None)
job.add_files(path.join(prefix, "data"))
job.concat()

and not often but it happens to raise the error mentioned in the title. Here are the logs https://pastebin.com/raw/svSZ2sgu.

Read Timeout Error

I am getting a lot of ReadTimeoutErrors. Not sure why the endpoint URL is "None".

Is there a setting that I can change to limit eliminate the occurrence of this error?

CRITICAL:s3_concat.multipart_upload_job:Read timeout on endpoint URL: "None": When getting my/s3/prefix/my-file.json from the bucket my-s3-bucket
ERROR:s3_concat.utils:Retry failed batch of: [5, (5, [('my/s3/prefix/my-file-01.json', 343575), ('my/s3/prefix/my-file-02.json', 531245), ... ])]
Traceback (most recent call last):
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
    yield
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 566, in read
    data = self._fp_read(amt) if not fp_closed else b""
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 532, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
  File "/usr/local/lib/python3.7/http/client.py", line 478, in read
    s = self._safe_read(self.length)
  File "/usr/local/lib/python3.7/http/client.py", line 628, in _safe_read
    chunk = self.fp.read(min(amt, MAXAMOUNT))
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/env/lib/python3.7/site-packages/botocore/response.py", line 99, in read
    chunk = self._raw_stream.read(amt)
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 592, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 448, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: AWSHTTPSConnectionPool(host='ucc-filings-results-prod.s3.amazonaws.com', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/env/lib/python3.7/site-packages/s3_concat/utils.py", line 19, in _thread_run
    response = callback(item)
  File "/opt/env/lib/python3.7/site-packages/s3_concat/multipart_upload_job.py", line 108, in get_small_parts
    )['Body'].read()
  File "/opt/env/lib/python3.7/site-packages/botocore/response.py", line 102, in read
    raise ReadTimeoutError(endpoint_url=e.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None"

Could we introduce thread to concat more faster?

Before this test, I excluded files of 5MB or less to exclude the process of downloading locally.
When the concat process was executed using the average 20MB file group, the result was about 1 file per minute. It is a calculation that 200 files are combined in 1 minute. It means 100GB processing takes 100 minutes.

I have a total 100GB of files per hour and I want to finish the merge within 20 minutes.
For that, I thought that it was necessary to perform parallel processing using threads.

image

Unidecode Error utf-8

Hi guys, I've tried using this lib and got an error:

Screenshot from 2020-09-23 11-56-19

I saw the code that generate it: get_small_parts method

Looking better for this error, I found the root reason: files that are compressed in gzip format or another one will be an exception for this code. In my case, my files are gzip format and the turn around was using the gzip lib to read files.

s3-concat not merging and no error

bucket = 's3-data-buck'
path_to_concat = 'customer' ## all files under this prefix in above bucket
concatenated_file = 'concat.csv'
min_file_size = '5MB'

job = S3Concat(bucket, concatenated_file, min_file_size,content_type='application/json')
job.add_files(path_to_concat)

When I ran the above code, no error occurs, and no merge happened. Is there any debug for this?
Anything wrong with the above call?

Add concat order optimization

Hi, I found so great tool.

In some case, Developers often prefer performance rather than file order.
It seems current add_files() with prefix adds just alphabetical order.

Could you make the following improvements that can significantly improve performance?

  1. Sort the results obtained by job.add_files(path_to_concat) in size order
  2. Perform efficient JOIN in order from the largest file
  3. Download only small files under 5MB locally and then merge to upload

I guess we would better to add option for add_files() function such a join_from_large_files=true option.

sorting with size example

>>> objects_list = []
>>> objects_list.extend([('file_A', 1000),('file_B', 1500),('file_C', 500)])
>>> objects_list
[('file_A', 1000), ('file_B', 1500), ('file_C', 500)]
>>> objects_list.sort(key=lambda x: x[1])
>>> objects_list
[('file_C', 500), ('file_A', 1000), ('file_B', 1500)]

feature request: add support for gzipped content

Hi,

I manage millions of small files that needs to be concatenated and your tool works great for that.
In the case of gzipped source files, I needed to decompress the file before doing the concatenation based on the ContentEncoding header in S3 response.

If you're interested, I can submit a PR with the changes to handle automatically gzipped content (It's only 6 lines of code).
Maybe an arg can be added in MultipartUploadJob if you wan't to enable it manually.

Thanks for this lib.

Combining files < 5mb + using regex

Hi great work here! Thanks for putting this :)
I have a few questions:

  • Possible to use regex in bucket names? I'm trying to combine ( as an example) ALL the json files in:
    (let's say I'm trying to combine the following 3 files)
1. s3://fluentd/events/012345esdf/2017/12/a.json
2. s3://fluentd/events/9872absdd/2017/12/b.json
3. s3://fluentd/events/333333dd/2017/12/c.json

What should be my BUCKET string will be?
Also I didn't understand what the difference between
out_path argument and path_to_concat argument.
FYI my files are < 5 MB so using small_parts_threads=4

feature request: add support for NDJSON, log message to pushed s3 key & compression

Hi @xtream1101

Please add support for:

  1. NDJSON format (basically adding a new-line separator), e.g.
def construct_jsonl_str(items) -> str:
    json_str = ''
    for item in items:
        json_str += json.dumps(item) + os.linesep  # added workaround to convert to NDJSON
    return json_str
  1. log message with s3 key to be put (probably somewhere here)

  2. Please add compression to concatenated file

P.S I can contribute the requestd above logic by myself, if possible

change to add_file definition to use head_object instead of get_object

Hi @xtream1101 !

Thanks for your work on this. It's really useful.

I see that add_file definition needs only ContentLength which should be present in metadata. We can simply make head_object call rather than get_object which is expensive.

def add_file(self, key):
resp = self.s3.get_object(Bucket=self.bucket, Key=key)
self.all_files.append((key, resp['ContentLength']))

Is there a reason to use get_object instead of head_object?

Memory Leak - Threads are leaking

Hi @xtream1101 , Thanks a bunch for putting this together! We are using this for our project. We detected a memory leak in this library. The threads are being created every time we invoke a concat operation and they are not being cleaned up. This is exactly where the bug is https://github.com/xtream1101/s3-concat/blob/master/s3_concat/utils.py#L36. We are currently using ThreadPool to deal with this issue but it'd be great if you can fix it here so that we can tag along and use all great features of the library that will be coming in the future. Here's a quick glance at our fix.

from multiprocessing.pool import ThreadPool as Pool

def _thread_run(item, callback):
    for _ in range(3):
        # re try 3 times before giving up
        try:
            response = callback(item)
            return response
        except Exception:
            logger.exception("Retry failed batch of: {}".format(item))


def _threads(num_threads, data, callback):
    pool = Pool(num_threads)
    results = pool.starmap(_thread_run, [(item, callback) for item in data])
    pool.close()
    pool.join()
    return results

Please let us know if you need further details on this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.