GithubHelp home page GithubHelp logo

sync / recursive copy method? about aws.s3 HOT 21 CLOSED

cboettig avatar cboettig commented on August 19, 2024
sync / recursive copy method?

from aws.s3.

Comments (21)

leeper avatar leeper commented on August 19, 2024

You can pass a prefix header when running GET on a bucket. This will allow you to identify "subfolders" of objects or a subsets of objects. So, here's a TODO:

  • Check that these headers are getting passed to the request as we're expecting
  • Add a test of the marker and prefix headers for getobject
  • Add a higher-level function like sync to complement our other higher-levels (s3save and s3load)

from aws.s3.

leeper avatar leeper commented on August 19, 2024

So, looking through things again, you definitely do not want to get/put all of the objects. Instead you want to do a PUT Object copy for every object. We haven't implemented this yet (though we have a placeholder function called copyobject already).

I've created a new function called copybucket with the idea that it will identify all objects in a bucket and then call copyobject on all of them.

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

@leeper looks like PUT Object Copy is for stuff already elsewhere on S3, right? In the cases above I'm considering uploading and downloading (sync'ing) to a local disk somewhere; not another S3 bucket.

Still stuck though since as far as I can tell, my marker just gets ignored and I can only download the first 1000 files. Have you had a chance to try and replicate this error? Could just be me doing something stupid. Uploading is fine since it doesn't need a marker, just loops over the local files on disk (and I'm not often uploading > 1000 files anyhow, though I do need to download more than that).

from aws.s3.

leeper avatar leeper commented on August 19, 2024

@cboettig Oh, my bad, I misunderstood. Hmm... Have you tried running the code in a httr::with_verbose() environment to make sure that headers are getting passed as expected?

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

yup, the marker is listed in the headers, e.g.

b$request
<request>
GET https://s3-us-west-2.amazonaws.com/packages.ropensci.org
Output: write_memory
Options:
* useragent: libcurl/7.43.0 r-curl/0.9.1 httr/1.0.0
* customrequest: GET
Headers:
* Accept: application/json, text/xml, application/xml, */*
* marker: logs/2015-06-26-23-39-10-DDF97AC1C0687956s i
* x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
* x-amz-date: 20150804T162618Z
* Authorization: ...

but the returned content is identical to when I omit the marker; just the alphabetically first 1000 entries each time.

from aws.s3.

leeper avatar leeper commented on August 19, 2024

Okay I think the issue is here. The headers are passed to the request, but they're not being included in the signature for the request (so I guess AWS then ignores the unsigned headers?). I don't have time to test this today, but I think it may be as simple as doing:

canonical_headers = c(headers, 
                      list(host = p$hostname, 
                           `x-amz-date` = d_timestamp)),

in s3HTTP.

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

Okay, good catch. Does this mean s3HTTP should just be including everything it gets in the header argument as part of the signature then?

from aws.s3.

leeper avatar leeper commented on August 19, 2024

It may. I was under the impression that we didn't need to include all of the headers in the signature, but I don't know why I thought that.

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

Hmm, tried just adding all of what we give to headers argument into canonical headers signature, that made a lot of unit tests very sad: https://travis-ci.org/cloudyr/aws.s3/builds/74316941

from aws.s3.

leeper avatar leeper commented on August 19, 2024

😞 I guess it would be helpful if our tests were a bit more verbose on failure so that we could actually debug them. Do these appear to be signature errors? (I mean, I guess they have to be, but I'm wondering what the actual error message from AWS is.)

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

@leeper right, it's the standard signature error, not super informative:

$Code
[1] "SignatureDoesNotMatch"

$Message
[1] "The request signature we calculated does not match the signature you provided. Check your key and signing method."

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

@leeper ah ha! When all else fails, read the documentation.

It looks to me like we're passing stuff as headers that ought to be passed as parameters, e.g. GET /?prefix=N&marker=Ned&max-keys=40. That's why they are being ignored, and that's why they cause trouble when we try to include them in the signature.

I think we need to add support to pass URL query parameters in s3HTTP

from aws.s3.

leeper avatar leeper commented on August 19, 2024

Ah, I hope that's the case. We had this before when we had the url argument, but we've since lost that. Let's add a query argument and make sure it's all handled correctly relative to what's currently here: https://github.com/cloudyr/aws.s3/blob/master/R/http.r#L50.

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

Right, I don't think it belongs in the URL argument, a query should be passed to an httr::GET request using the query parameter, e.g. GET(url, query=query), not already baked into the URL for parse_url to extract and then add back

from aws.s3.

leeper avatar leeper commented on August 19, 2024

Yes, exactly. It's much cleaner and follows httr style more closely.

from aws.s3.

leeper avatar leeper commented on August 19, 2024

@cboettig I'm coming back to this but not really sure what actions have to be taken to address this. Can you let me know if any changes still need to be made?

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

Basically just matching the sync function in the aws CLI program for S3, which just gives a higher-level loop over the get and post method

from aws.s3.

leeper avatar leeper commented on August 19, 2024

Okay, great. Thanks, @cboettig.

For my reference, here's the documentation page for the sync command from the CLI: http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

from aws.s3.

leeper avatar leeper commented on August 19, 2024

I've just pushed a really rough mockup of this. Some notes so far:

  1. it doesn't handle the case where there are >1000 objects in a bucket
  2. the response value should probably be more useful
  3. it will fail probably fail ungracefully in lots of cases

from aws.s3.

leeper avatar leeper commented on August 19, 2024

This should now be done. I've added a verbose argument so it should print everything it's doing. If you encounter issues, please open a new issue.

from aws.s3.

cboettig avatar cboettig commented on August 19, 2024

@leeper Looks like sync is a 2-way sync? This seems a bit weird, the more classical convention used by aws cli of sync some/source/path s3://some/destination seems more intuitive. (also nice that either to or from (or both) can be s3 paths.

More generally curious on your thoughts about aligning the R package interface with the aws cli options?

from aws.s3.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.