xtream1101 / s3-concat Goto Github PK
View Code? Open in Web Editor NEWConcat multiple files in s3
License: MIT License
Concat multiple files in s3
License: MIT License
I don't know how it can happen as I only have small files and there is only one thread, but somehow batches fail. here is the code I use
job = S3Concat(s3_bucket, merged_csv, min_file_size=None)
job.add_files(path.join(prefix, "data"))
job.concat()
and not often but it happens to raise the error mentioned in the title. Here are the logs https://pastebin.com/raw/svSZ2sgu.
I am getting a lot of ReadTimeoutError
s. Not sure why the endpoint URL
is "None"
.
Is there a setting that I can change to limit eliminate the occurrence of this error?
CRITICAL:s3_concat.multipart_upload_job:Read timeout on endpoint URL: "None": When getting my/s3/prefix/my-file.json from the bucket my-s3-bucket
ERROR:s3_concat.utils:Retry failed batch of: [5, (5, [('my/s3/prefix/my-file-01.json', 343575), ('my/s3/prefix/my-file-02.json', 531245), ... ])]
Traceback (most recent call last):
File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
yield
File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 566, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 532, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/usr/local/lib/python3.7/http/client.py", line 478, in read
s = self._safe_read(self.length)
File "/usr/local/lib/python3.7/http/client.py", line 628, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/env/lib/python3.7/site-packages/botocore/response.py", line 99, in read
chunk = self._raw_stream.read(amt)
File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 592, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/opt/env/lib/python3.7/site-packages/urllib3/response.py", line 448, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
urllib3.exceptions.ReadTimeoutError: AWSHTTPSConnectionPool(host='ucc-filings-results-prod.s3.amazonaws.com', port=443): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/env/lib/python3.7/site-packages/s3_concat/utils.py", line 19, in _thread_run
response = callback(item)
File "/opt/env/lib/python3.7/site-packages/s3_concat/multipart_upload_job.py", line 108, in get_small_parts
)['Body'].read()
File "/opt/env/lib/python3.7/site-packages/botocore/response.py", line 102, in read
raise ReadTimeoutError(endpoint_url=e.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None"
Before this test, I excluded files of 5MB or less to exclude the process of downloading locally.
When the concat process was executed using the average 20MB file group, the result was about 1 file per minute. It is a calculation that 200 files are combined in 1 minute. It means 100GB processing takes 100 minutes.
I have a total 100GB of files per hour and I want to finish the merge within 20 minutes.
For that, I thought that it was necessary to perform parallel processing using threads.
Hi guys, I've tried using this lib and got an error:
I saw the code that generate it: get_small_parts method
Looking better for this error, I found the root reason: files that are compressed in gzip format or another one will be an exception for this code. In my case, my files are gzip format and the turn around was using the gzip lib to read files.
If SSE (KMS) is enabled on S3, there is no option to provide KMS key for this utility and is filing of course with "Access Denied" error.
bucket = 's3-data-buck'
path_to_concat = 'customer' ## all files under this prefix in above bucket
concatenated_file = 'concat.csv'
min_file_size = '5MB'
job = S3Concat(bucket, concatenated_file, min_file_size,content_type='application/json')
job.add_files(path_to_concat)
When I ran the above code, no error occurs, and no merge happened. Is there any debug for this?
Anything wrong with the above call?
Hi, I found so great tool.
In some case, Developers often prefer performance rather than file order.
It seems current add_files() with prefix adds just alphabetical order.
Could you make the following improvements that can significantly improve performance?
I guess we would better to add option for add_files() function such a join_from_large_files=true
option.
sorting with size example
>>> objects_list = []
>>> objects_list.extend([('file_A', 1000),('file_B', 1500),('file_C', 500)])
>>> objects_list
[('file_A', 1000), ('file_B', 1500), ('file_C', 500)]
>>> objects_list.sort(key=lambda x: x[1])
>>> objects_list
[('file_C', 500), ('file_A', 1000), ('file_B', 1500)]
Hi,
I manage millions of small files that needs to be concatenated and your tool works great for that.
In the case of gzipped source files, I needed to decompress the file before doing the concatenation based on the ContentEncoding
header in S3 response.
If you're interested, I can submit a PR with the changes to handle automatically gzipped content (It's only 6 lines of code).
Maybe an arg can be added in MultipartUploadJob
if you wan't to enable it manually.
Thanks for this lib.
Hi great work here! Thanks for putting this :)
I have a few questions:
1. s3://fluentd/events/012345esdf/2017/12/a.json
2. s3://fluentd/events/9872absdd/2017/12/b.json
3. s3://fluentd/events/333333dd/2017/12/c.json
What should be my BUCKET string will be?
Also I didn't understand what the difference between
out_path
argument and path_to_concat
argument.
FYI my files are < 5 MB so using small_parts_threads=4
Hi @xtream1101
Please add support for:
def construct_jsonl_str(items) -> str:
json_str = ''
for item in items:
json_str += json.dumps(item) + os.linesep # added workaround to convert to NDJSON
return json_str
log message with s3 key to be put (probably somewhere here)
Please add compression to concatenated file
P.S I can contribute the requestd above logic by myself, if possible
Hi @xtream1101 !
Thanks for your work on this. It's really useful.
I see that add_file definition needs only ContentLength
which should be present in metadata. We can simply make head_object call rather than get_object which is expensive.
def add_file(self, key):
resp = self.s3.get_object(Bucket=self.bucket, Key=key)
self.all_files.append((key, resp['ContentLength']))
Is there a reason to use get_object instead of head_object?
Hi @xtream1101 , Thanks a bunch for putting this together! We are using this for our project. We detected a memory leak in this library. The threads are being created every time we invoke a concat operation and they are not being cleaned up. This is exactly where the bug is https://github.com/xtream1101/s3-concat/blob/master/s3_concat/utils.py#L36. We are currently using ThreadPool to deal with this issue but it'd be great if you can fix it here so that we can tag along and use all great features of the library that will be coming in the future. Here's a quick glance at our fix.
from multiprocessing.pool import ThreadPool as Pool
def _thread_run(item, callback):
for _ in range(3):
# re try 3 times before giving up
try:
response = callback(item)
return response
except Exception:
logger.exception("Retry failed batch of: {}".format(item))
def _threads(num_threads, data, callback):
pool = Pool(num_threads)
results = pool.starmap(_thread_run, [(item, callback) for item in data])
pool.close()
pool.join()
return results
Please let us know if you need further details on this.
Hello,
Are you planning to add an option to delete source files after successful concat operation ?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.