cloudyr / aws.s3 Goto Github PK

View Code? Open in Web Editor NEW

380.0 380.0 148.0 563 KB

Amazon Simple Storage Service (S3) API Client

Home Page: https://cloud.r-project.org/package=aws.s3

R 99.04% HTML 0.12% Shell 0.44% Makefile 0.40%

amazon aws aws-s3 cloudyr r r-package s3 s3-storage

aws.s3's People

Contributors

Stargazers

Watchers

Forkers

humbertcostas mcdelaney ryoogata mark-thompson cdc skasingularity instacart russellpierce zapier jason-huling reinholdsson scene53 nqbao davidpitkin elinw ryninho seandavi nosoc dlazarou sindhu70 jason-jea chaene82 kislerdm alessiopetrozziello rujudk arturosaco englianhu magviana gregflood918 washcycle blavoie kvasagiri kashenfelter pingles thierryo zhixunwang amoeba kaneplusplus paul-bell bierkaai patrick-miller nirmalpatel clesiemo3 vhcg77 andrewsali rvolykh blackc0re meraki0918 michaelchirico arvy-p takewiki eg-wpak sykehys an2deg gnyman msteijaert jon-mago dtenenba raphaelblehoue datadondom samfunk sheffe codymarquart namwoopark gefioninsurance divyakyatam bdaubney agable-vt shilpa9a presagia-analytics hellekin08 akaboshi900306 ricmuck brooklynbagel schuemie shridhar1102 al224 zacharyrsmith fcocquemas rafalbachorz rgcollar nuclearhe octaviancorlade aschatten hskksk drorata adithyapadthe da505819 orenov dzhw himanshu-sikaria shipt tschutte lorenzwalthert inseefrlab gamesight doytsujin gonzalezben81 pkinif stochastique

aws.s3's Issues

Error handling tests

All of our current tests are of successful operations. We also need to test errors to make sure they're spitting back what is expected.

Add \dontrun{} examples to documentation

Its possible to upload files with public policy?
Today, after upload the file, I have to go into the AWS interface to make it public.
I tried: aws.s3::putobject("myFile", "myBucketPath", headers = c(ACL = "public-read"))
Without success. The file is uploaded, but it's private.

Provide a way to access S3 via AWS IAM roles

I'd like a way to do S3 access without the secret and key, relying on instance roles.

Approach:

Fetch http://169.254.169.254/latest/meta-data/iam/security-credentials/
If successful, returns a role name

$ curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
myrole

Fetch http://169.254.169.254/latest/meta-data/iam/security-credentials/NAME
Response (if successful), is a JSON object with the key, secret, and a token

{
  "Code" : "Success",
  "LastUpdated" : "2016-08-22T23:42:03Z",
  "Type" : "AWS-HMAC",
  "AccessKeyId" : "",
  "SecretAccessKey" : "",
  "Token" : "",
  "Expiration" : "2016-08-23T06:06:28Z"
}

The token is definitely required here -- you can't just use the secret and key. Since the tokens expire, you tend to query for a new token some time before the expiration.

Read csv or excel file

Hi, I'm trying to read csv or xlsx file from S3 directly using this package.

I get the file but with a format that I can't understand.

Do I need to do something different to get_object function to read csv. or xlsx files?

Write an s3source() function

This will be a shallow wrapper around save_object() and source():

s3source(object, bucket, ...) {
    tmp <- tempfile() # or maybe a raw connection
    save_object(object, bucket, file = tmp)
    source(file = tmp, ...)
}

This will be really useful for quickly executing code on EC2 based upon code (and other files) uploaded from a local machine.

More useful out-of-region errors

Apparently errors for out-of-region bucket requests currently look a little cryptic (see #45). It would be good to clearly communicate what is going on.

object PUT operations require `Content-Length` in the header

per the doc Content-Length is a required header for a PUT operation on a bucket. @cboettig perhaps this helps with #5? Not sure if it also applies to POST operations (haven't delved into that doc yet)

Configure Expect: Continue for PUT requests

https://docs.aws.amazon.com/AmazonS3/latest/dev/RESTRedirect.html

Change README install instructions

This should use a drat-centered installation workflow rather than devtools::install_github(). drat will be much more reliable because only stable releases can be installed.

`url` value different than "https://s3.amazonaws.com/"

This a question, not an issue.

s3HTTP() uses url = "https://s3.amazonaws.com/" when in default region. How can I use a different value like "https://cgiardata.s3.amazonaws.com/"?

Creating a new folder

@leeper

Thanks for the excellent package. Just wondering if there is a method to create a new folder in a
bucket using aws.s3?

Thanks.

Error in UseMethod("xmlSApply")

I've installed this wonderfully looking package and tried to retrieve a list of files in one of my S3 buckets, but I seem to be getting following error:

> aws.s3::getbucket(bucket = "kaggle.ml.data") No encoding supplied: defaulting to UTF-8. Error in UseMethod("xmlSApply") : no applicable method for 'xmlSApply' applied to an object of class "c('xml_document', 'xml_node')"

Platform:
Mac OS X 10.11.3 (El Captain)

R version:
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.4.0 (64-bit)

Signature validation error on keys with certain unsafe characters

We were seeing a signature validation error being returned from the save_object() and put_object() calls (and likely others we didn't test), but not on the get_bucket() call. Eventually I tracked it down to the fact that we have '=' in our keys (storing partitioned Hive tables on S3, so part of the path is '/product_id=1/date=2016-01-01/' for example). get_bucket() puts the prefix in the query parameter, which ends up somewhere in the body of the request; whereas save_object() and put_object() put the key in the path parameter, which then gets appended to the url of the request itself. Manually URL encoding the '=' symbol with '%3D' fixed the issue on the previously failing calls, but for API cleanliness it might be something worth having the library itself handle.

sync / recursive copy method?

Is there a natural way to go about syncing / recursively copying all the files from a bucket (or subdirectory thereof) to the local machine, or vice versa?

On the upload side, I found looping over putobject to work perfectly well (awesome work on getting the Put method sorted out!).

On the download side, I've had considerably more trouble. The first problem is not knowing what files to download. I've tried looping over getbucket methods to get a list of all files. My problem is made a bit worse by working with a bucket that has ~ 4K files (these are actually just the small 'log files' that S3 bucket writes automatically when you turn on logging. My plan is to parse them to get download / traffic measurements on publicly exposed S3 buckets, which may be of more general interest).

Anyway, since the API cuts off after 1000, I try some looping, passing new values of marker as the key from the last item in the content list of the previous call. I'm not having much luck with this, as I seem to get the same 1000 returned values regardless of the value of marker in getbucket. Not sure if that's a bug or me doing something wrong. Has anyone tested that marker is not just being ignored when it's added to the header of the requests made by getbucket?

My errors aside, I wonder if there's a more natural way to do this than to loop over getbucket and then to loop over getobject for all the 4K objects. Maybe that's fine but could just use a friendly wrapper. aws cli provides both the --recursive flag to a cp method, and the slightly more concise method just to sync directly. Any thoughts?

order of arguments in object methods

Okay, this seems a bit nit-picky, but there's some inconsistency in how the various object methods order there arguments; particularly, whether they take a file name first or a bucket name first. I'd argue that they should always take the file name first, as the bucket name is more of a parameter, and this makes it more consise to lapply over a list of files or what not.

But maybe it's not worth breaking the package API just to make that change. (and maybe the current format is more consistent with other wrappers like boto?) Anyway, just thought I'd mention it.

Use accelerate endpoints

https://aws.amazon.com/blogs/aws/aws-storage-update-amazon-s3-transfer-acceleration-larger-snowballs-in-more-regions

object methods shouldn't set parsed_response (or other s3HTTP defaults) internally

some of the object wrappers like getobject set values of arguments to s3HTTP() internally in a way that isn't exposed in their own arguments. e.g. getobject explicitly sets parsed_response = FALSE in its call to s3HTTP, meaning that this argument can no longer be passed through ... part of the getobject() call as one might expect from the documentation.

As a result, there is no way to get an unparsed response from these functions, as they work by getting an unparsed response from s3HTTP() and then parsing it themselves.

I understand that the default parsing strategy of s3HTTP may not be appropriate for all the object methods. If we cannot modify the s3HTTP parsing strategy to do the right thing in each case and want custom parsing for each object method, then maybe we should set the s3HTTP parse method to be FALSE by default, such that the object methods could also respect the parsed_response = FALSE argument and return an unparsed http response object when asked to do so?

Code coverage service

Is coveralls not showing numbers because we haven't authorized them on cloudyr? That would be my impression. They want full read-write access to repos, so I'm a little hesitant to grant that authorization. The alternative would be to switch to codecov.io, which doesn't expect as much repo access. But, I don't have a lot of experience with either service.

Either way, let's get a test coverage shield in the README.

Expand examples a bit

I am trying to figure this package out and I am struggling. Specifically, I was hoping to find some example code for the s3save and s3load functions.

For s3save do you just set up a bucket like so

ex <- getbucket(
  bucket = 'test-bucket',
  key = "AWS-KEY",
  secret = "AWS-SECRET",
  region = 'us-west-2'
  )

and then use that bucket in the s3save function?

s3save(mtcars, 
  bucket = ex
  , object = "mtcars"
  , opts = list(quote = FALSE))

It seems like I have to put something in opts or it errors out so I put in quote = FALSE so it could pass that to the do.call function.

When I run that I get
No encoding supplied: defaulting to UTF-8.
but nothing shows up in the bucket.

Probably a stackoverflow question but I didn't see any other aws.s3 tagged questions there so I thought this would be a better spot for now

Enhance putobject to upload content in memory

For those of us who'd like to avoid disk IO where possible.

It seems like this may already be the intended functionality, given https://github.com/cloudyr/aws.s3/blob/master/R/http.r#L117, but doing so now yields errors.

Perhaps there is a plan to implement this when objects are released, but this would be a fairly simple thing to implement now.

Write CSV into Amazon S3 bucket without storing it on local machine.

Can anyone help me on how to save a .csv file directly into Amazon s3 without saving it in local ?

Save a data frame directly into S3 as a csv.

I tried this-

put_object(file = "sub_loc_imp.csv", object = "sub_loc_imp", bucket = "dev-sweep")

it popped an error-

Warning message:
In parse_aws_s3_response(r, Sig) : Forbidden (HTTP 403).

Facilitate production of ACL and bucket policy structures

The PUT ACL methods require a fairly complicated XML structure to handle bucket and/or object permissions.

Another access management feature is a JSON bucket policy structure. There's an online tool to generate bucket policies (and policies for some of the other AWS services).

curl ssl error when using bucket named with periods?

As far I can tell, I have two identically configured buckets, one named drat and one named packages.ropensci.org. I can get / copy files to either one using the AWS CLI. When I try to get or push files to drat via aws.s3, things are rosy. When I try packages.ropensci.org, I get the error:

 x = getobject("packages.ropensci.org", "index.html", region="us-west-2", key = Sys.getenv("AWS_ACCESS_KEY_ID"), secret = Sys.getenv("AWS_SECRET_ACCESS_KEY"))
Error in curl::curl_fetch_memory(url, handle = handle) : 
  SSL peer certificate or SSH remote key was not OK

Enter a frame number, or 0 to exit   

1: getobject("packages.ropensci.org", "index.html", region = "us-west-2", key 
2: s3HTTP(verb = "GET", url = paste0("https://", bucket, ".s3.amazonaws.com/",
3: httr::GET(url, H, ...)
4: request_perform(req, hu$handle$handle)
5: request_fetch(req$output, req$url, handle)
6: request_fetch.write_memory(req$output, req$url, handle)
7: curl::curl_fetch_memory(url, handle = handle)

I'm stumped, no idea what is different. I took a quick look at the headers and url in the debugger but nothing stands out to me there either. (If you wanna take a look at this exact example I can email you some credentials)

Implement delete_object for multiple objects

This currently doesn't work due to never implementing the XML specification. Now that xml2 implements creating XML documents, this should be trivial without adding a dependency (but we will need to version the xml2 dependency).

Strategy for creating XML request bodies

Most of the PUT methods that allow one to configure a bucket or object require XML structures. In a couple of cases I've just used simple string concatenation to build these, but the structures can get quite complicated. How do we want to approach this? Should we:

use the XML package to build actual XML objects,
stick with string concatenation, or
force the user to write their own XML?

GET requests for objects are sensitive to trailing /

for instance, a GET request to https://1000genomes.s3.amazonaws.com/README.pilot_data will succeed, but https://1000genomes.s3.amazonaws.com/README.pilot_data/ will throw a 404.

right now we always build the url + '/' + bucket + path, which has the unintended effect of appending a trailing slash to the url, even if the correct url is passed to s3HTTP.

unit testing?

Is there a sensible way to go about unit testing (e.g. particularly for methods requiring keys?) e.g. does AWS provide and testing endpoints with dummy credentials? Or can we define some dummy credentials for testing purposes (e.g. create and then delete a file) and pass them encrypted to travis?

PUT methods for objects

would love to be helpful here - was able to get getbucket() to run, but after digging around in the code a bit, I noticed that the header construction was a lot different in getbucket() than in any of the other methods (eg all the stuff for objects). I rolled my own function cribbing off of getbucket and was able to write to my s3 bucket.

anything pointers/suggestions about what I should be doing differently? I see a lot of ... arguments in object.r -- should I be building my headers using some constuctor, then passing that to putobject, etc?

will be happy to write up whatever I learn and submit back as a vignette / documentation / whatever

didn't work

> aws.s3::putobject('my_bucket', 1)
Error: is.character(headers) is not TRUE
Called from: request(headers = c(..., .headers))

works

putobject_alm <- function(
  bucket, 
  object,                        
  prefix, 
  delimiter,
  max,
  marker, 
  ...
) {
  if(inherits(bucket, "s3_bucket"))
      bucket <- bucket$Name
  h <- list()
  if(!missing(prefix))
      h$prefix <- prefix
  if(!missing(delimiter))
      h$delimiter <- delimiter
  if(!missing(max))
      h$"max-keys" <- max
  if(!missing(marker))
      h$marker <- marker
  h$`x-amz-content-sha256` <- ""
  r <- s3HTTP("PUT", paste0("https://",bucket,".s3.amazonaws.com/", object), 
    headers = h, ...
  )
  if(inherits(r, "aws_error")) {
      return(r)
  } else {
      structure(r, class = "s3_object")
  }
}

putobject_alm('my_bucket', 1)

Add unit tests for get_bucket with query parameters

Is it possible to download and load a CSV or other file from S3 into a data table

Currently I only see documentation for loading an R object or file into a vector.

Add a func to generate curl connection

I think this can just be a new "verb" in s3http(), returning the connection object to enable reading an object directly as a connection rather than all at once.

parse_response = FALSE should always return response object

Meanwhile, parsing of responses could also be more functionalized and a bit more robust

Is this really on CRAN?

The install.packages line drew my interest (there's a ton of S3 clients on Github, but none that seem to be on CRAN).

Do you have any plans to (re)submit it?

how do we want to approach returning the unparsed response?

Saw this on bucketlist() and now on getbucket() -- when parse_response = FALSE, the default behavior of the function is to return only the 'contents' of the response -- which doesn't hold for an unparsed response, and you get

> ex
Bucket: 

named list()

Right now I am handling this by identifying unparsed responses by their class (response) and returning that list without any additional processing -- here's one example. basically just drafting off the way we currently handle errors:

  if (inherits(r, "aws_error") | inherits(r, "response")) {
        return(r)

This feels a bit clumsy to me, and presumably whatever we come up with for returning unparsed responses is going to become a common design pattern throughout, so I thought we should kick this around for discussion.

Do we need a more bulletproof way of identifying unparsed S3 responses? make them class s3_raw_response?
@cboettig I liked how you gave s3HTTP a single return point. do we want to do that for our get functions as well?

Add a CONTRIBUTING file

Need to do this and update the style guide with a template for it.

Add links to S3 documentation for all functions

Needs POST methods implemented

Got bogged down my the region thing so didn't have a chance to play with this yet, but on the to-do list. might involve more package API breakage.

Restrictions on put object size?

Getting this somewhat cryptic error from calls to objectput now, looks like the object I'm trying to put may be too large. Any idea where this is documented in the AWS S3 docs? Maybe a post method would work still?

Error in curl::handle_setopt(handle, .list = req$options) : 
  Option postfieldsize_large (30120) not supported.
Calls: s3copy ... <Anonymous> -> request_perform -> <Anonymous> -> .Call

all path parameters (eg `/?cors`) need to be passed to s3HTTP using `path=`

Don't write to disk for s3save and s3saveRDS

We should be able to POST from an in-memory representation:

mydata <- serialize(iris, NULL)
req <- httr::POST("http://httpbin.org/post", body = mydata)

And thus save disk i/o.

Dependency on xml2 > 1.0.0

xml2 1.0.0 was just posted to CRAN in July and > 1.0.0 is only available via github_install. Might it be possible to revert to >= 1.0.0 dependence?

Headers argument in getobject function

Hi,
First, thanks guys for the package, which is working really fine.
I get stuck though because I am unable to use the 'headers' argument in getobject function.
The main use I see personnally is the ability to use Content-Length and Content-Range in order to download only a specified part of a file.
It would be great if this feature could be implemented in aws.s3.
Thanks again
Jean-Eudes

Tests were expecting 'handle'

Were these correct? I removed them here: 7cdf463 due to failure, but we should restore them and track down the problem if that's somewhat we actually need in the response object.

default region from sys.env

The aws cli and most other clients seem to support detecting the environmental variable AWS_DEFAULT_REGION, in addition to the key and secret key. would be straight forward to add.

Fix bucket versioning

memDecompress error

Hi team

I may be working in a very strange use case, I'm not sure. Feel free to disregard this if so.

I'm working in r-studio-server hosted on an EC2 instance in AWS (amazon linux OS, R 3.2.2, rstudio-server version 0.99.465). I'm trying to use the aws.s3 package to access an .rds file that is in an s3 bucket. The file is approximately 200mb on disk as an RDS. The EC2 instance is m4.2xlarge, so there should be around 32GB of RAM available.

The bucket is called "chek1", and get_bucket("chek1") works fine.

However, when I do:
> s3readRDS(object = "SAMHDA/RAWdata/vcat.08-14.rds", bucket = "chek1")

I get the following cryptic error message

Error in memDecompress(from = as.vector(r), type = "gzip") : 
  internal error -3 in memDecompress(2)

I'm not sure whats going on here. Does anyone have any ideas/workarounds? I really liked the look/feel of this package, and was pretty surprised to get tripped up with this. Googling the error message only returns a random conversation from 2012 between @hadley wickham and brian ripley lol.

Adam

IPv6 support

https://aws.amazon.com/about-aws/whats-new/2016/08/internet-protocol-version-6-ipv6-support-for-amazon-s3/?sc_channel=em&sc_campaign=Launch_RN20160816&sc_publisher=aws&sc_medium=em_19321&sc_content=launch_la_nontier1&sc_country=UK&sc_geo=EMEA&sc_category=mult&sc_outcome=launch&mkt_tok=eyJpIjoiTVdFMU56bGlPVEEyTlRneCIsInQiOiJVM0NxN21mWGk1RVwvSHNPMHpQYWV1Vll2NUQ3N0xVV2hmXC83TDFMNXo0UHN0bzZkdHlpUlUzWkJCTzM4ZTFEeXcyem1oMnIzVmREZjZ6NktHYUkwcTE0NXNZMXBXbmk3ekdseHJQdGNuYUpRPSJ9

README: Sys.getenv() must be Sys.setenv()

In the code below, located in README.md file, Sys.getenv() should be Sys.setenv():

Sys.getenv("AWS_ACCESS_KEY_ID" = "mykey", "AWS_SECRET_ACCESS_KEY" = "mysecretkey", "AWS_DEFAULT_REGION" = "us-east-1")

tests failing?

Sorry for not filing a more intelligent issue, but haven't had a chance to dig deeply. Maybe something has changed upstream? Seems that builds on travis that used to pass are failing now, and for some reason my existing aws.s3 scripts are now failing to execute successfully (though often not throwing errors, just not actually posting / downloading / listing objects). The same commands with same credentials seem fine using the (python) aws cli. Thanks!

cloudyr / aws.s3 Goto Github PK

aws.s3's People

Contributors

Stargazers

Watchers

Forkers

aws.s3's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs