S3 file storage support for Invenio.
The package offers integration with any S3 REST API compatible object storage.
Further documentation is available on https://invenio-s3.readthedocs.io/
S3 file storage support for Invenio.
Home Page: https://invenio-s3.readthedocs.io
License: MIT License
S3 file storage support for Invenio.
The package offers integration with any S3 REST API compatible object storage.
Further documentation is available on https://invenio-s3.readthedocs.io/
the name of the bucket from the S3 configuration gets somehow repeated as part of path, so that files are written to s3://bucket_name/bucket_name/files
When uploading big files the number of total parts gets bigger than the maximum (1000) and makes the upload fail. This happens probably because s3fs doesn't use the file size to calculate the chunk size and it rather uses the default value (5Mb) if no other value is provided.
Maybe we can assume that the file size is passed as a parameter (somehow making it mandatory) and when instantiating the s3fs object pass the chunk size, default_block_size
, which should be the max of 5Mb and file_size/1000
(5Mb is the smallest we can go). This works on my head, but I haven't tried it yet, and there might be some issues
Other alternatives are welcome.
Package version (if known): master
When calculating the number of parts we use the default integer rounding
size // current_app.config['S3_MAXIMUM_NUMBER_OF_PARTS']
https://github.com/inveniosoftware/invenio-s3/blob/master/invenio_s3/storage.py#L26
Which can (and will) result in uploading a bigger number of parts than the maximum allowed number (max+1) when the floating part is smaller than .5 (3.1 will result in 3 rather than 4(
The number of parts should never exceed S3_MAXIMUM_NUMBER_OF_PARTS
Right now when asl for the checksum of a file we digest the file and calculate it application-side, it would be nice to return the value the storage server is giving us directly, similar to https://github.com/inveniosoftware/invenio-xrootd/blob/master/invenio_xrootd/storage.py#L60
Hello.
Is there anything that we can do to increase the upload speed to an S3 service via invenio-s3?
I compared the upload speed obtained in our app versus a direct upload to S3 with boto3 (from the same machine that serves our app), and I am getting different results. For a 1 GB file, when uploading through our app we see first 150-200 Mbps data transfer from the browser for about 1 minute, with gunicorn sitting at 99% CPU; then for about 2 minutes we see no upload from the browser, while gunicorn sits at 10-15% CPU, until the browser finally receives a 200 response (total 3 minutes). With a direct upload to S3 via boto3, instead, it takes about 13 seconds in total.
To simplify testing, I'm using a simple Flask view, in which I have the following lines that do the job:
f = request.files['file']
s3fs = S3FSFileStorage('s3://test_s3/test-file-2')
s3fs.initialize(size=0, acl='private')
s3fs.update(f.stream, acl='private')
In the real app, we actually create a record with invenio_deposit.api.Deposit.create()
, then attach the file to the record, but we see the same speed as in this simple test.
Our setup is: Apache2 acting as front line server, with a reverse proxy to gunicorn on the same machine. Setting or not DEBUG=True
in config.py does not seem to make a difference for this.
We are actually using our own fork of invenio-s3, with some changes that we needed to make it work (I opened PR #8 in case you find them useful), but I don't think they are relevant to issue.
I also found some code to profile requests to gunicorn: I'll paste below the result, but I'm not quite sure how to interpret it.
Thanks a lot in advance for the help!
[POST] URI /s3/upload
130142856 function calls (130133607 primitive calls) in 226.399 seconds
Ordered by: internal time, cumulative time
List reduced from 1503 to 30 due to restriction <30>
ncalls tottime percall cumtime percall filename:lineno(function)
413 96.123 0.233 96.123 0.233 {method 'poll' of 'select.poll' objects}
269586 15.689 0.000 15.689 0.000 {method 'read' of '_ssl._SSLSocket' objects}
8372516 14.946 0.000 42.725 0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/werkzeug/wsgi.py:733(_iter_basic_lines)
16745019 13.118 0.000 28.627 0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/tempfile.py:903(write)
2 12.813 6.406 99.145 49.573 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/werkzeug/formparser.py:531(parse_parts)
16737045 12.508 0.000 12.508 0.000 {method 'write' of '_io.BufferedRandom' objects}
16745022 11.310 0.000 57.705 0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/werkzeug/formparser.py:427(parse_lines)
2078 7.606 0.004 7.606 0.004 {method 'update' of '_hashlib.HASH' objects}
1048577 5.407 0.000 19.415 0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/gunicorn/http/body.py:112(read)
8372516 3.666 0.000 46.394 0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/werkzeug/wsgi.py:687(make_line_iter)
131285 3.255 0.000 3.255 0.000 {method 'write' of '_ssl._SSLSocket' objects}
206 3.100 0.015 3.100 0.015 {method 'read' of '_io.BufferedRandom' objects}
16745019 2.995 0.000 2.997 0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/tempfile.py:792(_check)
3434517 2.814 0.000 2.814 0.000 {method 'write' of '_io.BytesIO' objects}
1312794 2.354 0.000 10.290 0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/gunicorn/http/unreader.py:21(read)
132917 2.079 0.000 2.079 0.000 {method 'read' of '_io.BytesIO' objects}
16386 1.822 0.000 22.031 0.001 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/gunicorn/http/body.py:199(read)
16385 1.733 0.000 1.733 0.000 {method 'splitlines' of 'bytes' objects}
8425799 1.617 0.000 1.617 0.000 {method 'append' of 'list' objects}
8601967 1.136 0.000 1.136 0.000 {built-in method builtins.len}
8391318 1.126 0.000 1.185 0.000 {method 'join' of 'bytes' objects}
1048577 1.022 0.000 1.832 0.000 /home/ubuntu/.virtualenvs/archive/lib/python3.5/site-packages/gunicorn/http/unreader.py:53(unread)
2362416 0.629 0.000 0.629 0.000 {method 'seek' of '_io.BytesIO' objects}
131285 0.438 0.000 4.185 0.000 /usr/lib/python3.5/ssl.py:881(sendall)
3716346 0.427 0.000 0.427 0.000 {method 'tell' of '_io.BytesIO' objects}
1065394 0.409 0.000 0.409 0.000 {built-in method builtins.min}
2113508 0.404 0.000 0.404 0.000 {method 'getvalue' of '_io.BytesIO' objects}
269586 0.379 0.000 16.350 0.000 /usr/lib/python3.5/ssl.py:783(read)
264250 0.362 0.000 6.925 0.000 /usr/lib/python3.5/ssl.py:907(recv)
11 0.332 0.030 0.332 0.030 /usr/lib/python3.5/json/decoder.py:345(raw_decode)
Currently only one bucket and one endpoint is supported. Some use cases require multiple buckets/endpoint URLs. Config should be changed for that.
If the region name is not configured, it will automatically mapped to 'us-east-1'. Might be a problem for AWS users outside the US or users of other S3 implementations (Ceph S3 seems to work fine with 'us-east-1)
invenio-s3/invenio_s3/__init__.py
Line 21 in bb03233
This should not include the domain URL just hte bucket nme
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.