I noticed that s3_website (via filey-diff, in aws_sdk_s3.rb) makes a HEAD call for eve

Thanks for your details comments <a class="user-mention notranslate" data-hovercard-ty

I feel that the key things are: Use <a href="http://docs.aws.a

Improve performance of the push operation about s3_website HOT 11 CLOSED

spudbean commented on June 3, 2024

Improve performance of the push operation

from s3_website.

Comments (11)

laurilehmijoki commented on June 3, 2024

Thanks for paying attention to the amount of requests. There's indeed a lot of room for optimising the HTTP traffic that s3_website generates.

The HTTP head requests you are seeing are related to the "calculate new of changed files" feature of s3_website. Because of that feature, s3_website checks whether a file has changed or not before uploading it.

Regarding your idea

Your idea of capturing the ETag values from the GET Bucket response sounds good.

The gzip info does not seem to be available in the GET Bucket response. We could work around that shortcoming by first assuming that the file is not gzipped. If none of the ETags in the GET Bucket response did not match the MD5 of the file-being-uploaded, we could then assume that the file is gzipped on the S3 bucket and try to match the MD5 again. However, this would leave room for some nasty MD5 hash collision bugs that would be very difficult to understand.

What do you think of the above implementation plan?

Other stuff

In the issue #16 @bkaid proposes the --force option, that would skip the diff calculation altogether. The force option could be beneficial for you. No-one has implemented it yet, though.

In addition, we could speed up s3_website by switching to EventMachine. The AWS SDK Ruby gem that s3_website currently uses has a connection throttling mechanism which sets an upper bound to the s3_website level of concurrency.

from s3_website.

spudbean commented on June 3, 2024

Thanks for your details comments @laurilehmijoki !

Yes, the complication is that we have the remote MD5 hash, and we know if the file should be gzipped according to the local s3_website.yml, but we don't know if the remote file is gzipped. Therefore, we don't know if the remote MD5 is the hash of the raw content, or the gzipped content.

Maybe we can assume a good default most of the time, but use commandline toggles to allow users to reset the ambiguity:

By default, s3_website assumes that if a local file would be gzipped according to s3_website.yml, then the remote file is gzipped too. Therefore, we use the MD5 found in the GET Bucket response and do not do a HEAD per object.
--check-compression tells s3_website to do a HEAD for each file, to check the remote gzip encoding. Users should use this option each time they change the gzip parameters in s3_website.yml.
--force tells s3_website to avoid doing a "diff" at all, and just re-upload every file.

We could assume that users do not change their gzip parameters very often, and not bother with --check-compression, and instead tell such users to use --force?

from s3_website.

laurilehmijoki commented on June 3, 2024

The --check-compression switch sounds something that might be difficult to explain.

What do you think of my proposal above? It should be fine to first assume that the file is not gzipped on S3. If the hash in the GET Bucket request matches the hash of the local file, we can conclude that the file does not need to be uploaded. If the hash of the S3 object doesn't match the local file, we gzip the local file and see if the hash of the result of the gzip matches.

Do you spot any problems in the above algorithm?

from s3_website.

laurilehmijoki commented on June 3, 2024

I just changed the title of this issue to "Improve performance of the push operation".

from s3_website.

edwardball commented on June 3, 2024

Similar thought here. If I need to correct a typo on a single page, I've only actually changed one file but the diff is taking a long time because it has to go through every file in the site.

If you do decide to implement a --force option, I would find it really useful to be able to specify just a single file to upload.

from s3_website.

pjanik commented on June 3, 2024

We have website which consists of ~15 000 files, so we are currently working on this issue (the diff operation is so slow that it's impossible to use s3_webiste in current form). When we have things working fine and tested, I will try to prepare pull request.

The solution should be reasonably easy (mentioned GET bucket requests), but I'm only thinking about gzipped files. At the moment s3_website always compares MD5 of the raw content. Ideas presented above try to keep this behaviour. Actually I'm wondering why?
It causes that when you change gzip settings in config, s3_website will ignore files that should be gzipped and re-uploaded. An easy solution is to manually delete remote files that should be now gzipped, but it feels like workaround to me.

I'm wondering why can't we just gzip local files first (according to configuration) and then compare MD5 of gzipped and remote files, completely ignoring whether remote file is really gizpped or not?

Possible cases:

If remote files is gzipped and the content hasn't been changed, MD5 will match => no upload ✅
If remote file is gzipped, but the content has been changed, MD5 won't match => upload ✅
If remote file isn't gzipped, MD5 won't match no matter if the content has been changed or not => upload - seems ✅ to me, as I expect gzipped version on the server, but it's different from the current behaviour.

There are some cases that may be problematic in theory, e.g.:

MD5 of gzipped file with the old content matches MD5 of non-gzipped file with new content.

But it seems to be as likely as every other MD5 collision.

Am I missing something here? What do you think? Any help or tips would be greatly appreciated.

from s3_website.

laurilehmijoki commented on June 3, 2024

I'm trying to wrap my head around this performance problem by rewriting the whole s3_website push command in Scala. If I can find the time to come up with something working I will let you know.

@pjanik you are prorably on the right track. I cannot comment for sure as I have not put enough thinking into this subject.

from s3_website.

laurilehmijoki commented on June 3, 2024

Here is more info on my plans to rewrite the push operation in Scala: #86.

Any comments on the topic are welcome.

from s3_website.

pjanik commented on June 3, 2024

I feel that the key things are:

Use GET Bucket.
Avoid all HEAD and GET requests in filey-diff S3 data source.
Optimise some slow methods (e.g. select_in_outer_array in filey-diff is O(n^2) what doesn't work well for 15,000 files).
Solve issues with multithread support (we have to use disable_parallel_processing=true as otherwise we receive EOF errors randomly).

Of course approach with Scala and more fancy techniques also sounds great, but it seems to be a lot of work as well. I feel like 1-4 should bring completely new quality and be reasonably easy to achieve even using current implementation.

I've implemented a quick proof of concept focusing on 1-3:
https://github.com/concord-consortium/s3_website/tree/WIP-fast-diff
https://github.com/concord-consortium/filey-diff/tree/WIP-fast-diff

Main changes:

GET Bucket is used.
ignore_on_server rules are taken into account during calculation.
files are gzipped locally before MD5 is calculated (ofc according to config settings).
etag is always used as source of MD5 for remote files, HEAD and GET requests are not sent anymore.

It seems to work good enough for our needs. We can deploy 15,000 files while previously the diff operation couldn't end at all. Now the diff is pretty fast and the whole upload takes about 12 mins. If we had multithread support fixed, it would be great.

These changes also affected behaviour of push operation after update of gzip configuration (I described it in the previous post). Now push always updates files that should be gzipped / non-gzipped automatically, what I find beneficial.

Issues:

I didn't updated tests, so they are failing at the moment. I'm planning to take a look at it.
When multipart upload is used, etag doesn't represent MD5 of the file content. In such case push operation always detects that file was changed and re-uploads it. It can be easily solved, as such etags have special form (suffix: -<no_of_parts>) so we can detect it and e.g. read custom metadata with correct MD5 using single HEAD request. However I guess that the current implementation also doesn't handle it well for non-gzipped files (in such case only -<no_of_parts> suffix seems to be removed, but but the rest of etag isn't a correct MD5 anyway).

from s3_website.

laurilehmijoki commented on June 3, 2024

I took a look at the commits on your forks. They look great. I'd like to merge them immediately, because there are other people in addition to you who would benefit from the performance improvement.

The integration tests are written with Cucumber and VCR. This means that if the HTTP interactions change in any significant way, we have to record the VCR cassettes again. This is quite a lot of work.

I've been quite pedantic with the tests, because I don't feel good about releasing uncovered lines of Ruby code. The dynamic nature of the language calls for extensive testing. But let's face it: the performance issue is so important to solve that I'd like to reconsider the value of the Cucumber+VCR tests – maybe they should be removed.

However, I don't feel comfortable being the maintainer of a tool that has a multitude of features which are not tested.

To recap, it seems that we are facing the following options here:

Remove the Cucumbers tests that are too expensive to maintain
OR: re-record the Cucumber tests
OR: discard the valuable performance improvement that you have written
OR: can you come up with other options?

Regarding the two first points: I doubt I can support a tool that is missing those tests. Moreover, I don't have the time and motivation to rewrite them at the moment.

Regarding the third point: many users would benefit remarkably from the performance improvement, and we should strive to get it released.

from s3_website.

laurilehmijoki commented on June 3, 2024

Thanks to the pull request #88 sent by @pjanik, the version 1.7.5 of this gem is now much faster!

from s3_website.

Improve performance of the push operation about s3_website HOT 11 CLOSED

Comments (11)

Regarding your idea

Other stuff

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs