cloudyr / aws.s3 Goto Github PK
View Code? Open in Web Editor NEWAmazon Simple Storage Service (S3) API Client
Home Page: https://cloud.r-project.org/package=aws.s3
Amazon Simple Storage Service (S3) API Client
Home Page: https://cloud.r-project.org/package=aws.s3
All of our current tests are of successful operations. We also need to test errors to make sure they're spitting back what is expected.
Its possible to upload files with public policy?
Today, after upload the file, I have to go into the AWS interface to make it public.
I tried: aws.s3::putobject("myFile", "myBucketPath", headers = c(ACL = "public-read"))
Without success. The file is uploaded, but it's private.
I'd like a way to do S3 access without the secret and key, relying on instance roles.
Approach:
$ curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
myrole
{
"Code" : "Success",
"LastUpdated" : "2016-08-22T23:42:03Z",
"Type" : "AWS-HMAC",
"AccessKeyId" : "",
"SecretAccessKey" : "",
"Token" : "",
"Expiration" : "2016-08-23T06:06:28Z"
}
The token is definitely required here -- you can't just use the secret and key. Since the tokens expire, you tend to query for a new token some time before the expiration.
Hi, I'm trying to read csv or xlsx file from S3 directly using this package.
I get the file but with a format that I can't understand.
Do I need to do something different to get_object function to read csv. or xlsx files?
This will be a shallow wrapper around save_object()
and source()
:
s3source(object, bucket, ...) {
tmp <- tempfile() # or maybe a raw connection
save_object(object, bucket, file = tmp)
source(file = tmp, ...)
}
This will be really useful for quickly executing code on EC2 based upon code (and other files) uploaded from a local machine.
Apparently errors for out-of-region bucket requests currently look a little cryptic (see #45). It would be good to clearly communicate what is going on.
This should use a drat-centered installation workflow rather than devtools::install_github()
. drat will be much more reliable because only stable releases can be installed.
This a question, not an issue.
s3HTTP() uses url = "https://s3.amazonaws.com/"
when in default region. How can I use a different value like "https://cgiardata.s3.amazonaws.com/"
?
Thanks for the excellent package. Just wondering if there is a method to create a new folder in a
bucket using aws.s3
?
Thanks.
I've installed this wonderfully looking package and tried to retrieve a list of files in one of my S3 buckets, but I seem to be getting following error:
> aws.s3::getbucket(bucket = "kaggle.ml.data") No encoding supplied: defaulting to UTF-8. Error in UseMethod("xmlSApply") : no applicable method for 'xmlSApply' applied to an object of class "c('xml_document', 'xml_node')"
Platform:
Mac OS X 10.11.3 (El Captain)
R version:
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.4.0 (64-bit)
We were seeing a signature validation error being returned from the save_object() and put_object() calls (and likely others we didn't test), but not on the get_bucket() call. Eventually I tracked it down to the fact that we have '=' in our keys (storing partitioned Hive tables on S3, so part of the path is '/product_id=1/date=2016-01-01/' for example). get_bucket() puts the prefix in the query parameter, which ends up somewhere in the body of the request; whereas save_object() and put_object() put the key in the path parameter, which then gets appended to the url of the request itself. Manually URL encoding the '=' symbol with '%3D' fixed the issue on the previously failing calls, but for API cleanliness it might be something worth having the library itself handle.
Is there a natural way to go about syncing / recursively copying all the files from a bucket (or subdirectory thereof) to the local machine, or vice versa?
On the upload side, I found looping over putobject
to work perfectly well (awesome work on getting the Put method sorted out!).
On the download side, I've had considerably more trouble. The first problem is not knowing what files to download. I've tried looping over getbucket
methods to get a list of all files. My problem is made a bit worse by working with a bucket that has ~ 4K files (these are actually just the small 'log files' that S3 bucket writes automatically when you turn on logging. My plan is to parse them to get download / traffic measurements on publicly exposed S3 buckets, which may be of more general interest).
Anyway, since the API cuts off after 1000, I try some looping, passing new values of marker
as the key
from the last item in the content list of the previous call. I'm not having much luck with this, as I seem to get the same 1000 returned values regardless of the value of marker
in getbucket
. Not sure if that's a bug or me doing something wrong. Has anyone tested that marker
is not just being ignored when it's added to the header of the requests made by getbucket
?
My errors aside, I wonder if there's a more natural way to do this than to loop over getbucket
and then to loop over getobject
for all the 4K objects. Maybe that's fine but could just use a friendly wrapper. aws cli provides both the --recursive
flag to a cp
method, and the slightly more concise method just to sync
directly. Any thoughts?
Okay, this seems a bit nit-picky, but there's some inconsistency in how the various object
methods order there arguments; particularly, whether they take a file name first or a bucket name first. I'd argue that they should always take the file name first, as the bucket name is more of a parameter, and this makes it more consise to lapply
over a list of files or what not.
But maybe it's not worth breaking the package API just to make that change. (and maybe the current format is more consistent with other wrappers like boto?) Anyway, just thought I'd mention it.
some of the object wrappers like getobject
set values of arguments to s3HTTP()
internally in a way that isn't exposed in their own arguments. e.g. getobject
explicitly sets parsed_response = FALSE
in its call to s3HTTP
, meaning that this argument can no longer be passed through ...
part of the getobject()
call as one might expect from the documentation.
As a result, there is no way to get an unparsed response from these functions, as they work by getting an unparsed response from s3HTTP()
and then parsing it themselves.
I understand that the default parsing strategy of s3HTTP
may not be appropriate for all the object methods. If we cannot modify the s3HTTP parsing strategy to do the right thing in each case and want custom parsing for each object
method, then maybe we should set the s3HTTP parse method to be FALSE by default, such that the object methods could also respect the parsed_response = FALSE
argument and return an unparsed http response object when asked to do so?
Is coveralls not showing numbers because we haven't authorized them on cloudyr? That would be my impression. They want full read-write access to repos, so I'm a little hesitant to grant that authorization. The alternative would be to switch to codecov.io, which doesn't expect as much repo access. But, I don't have a lot of experience with either service.
Either way, let's get a test coverage shield in the README.
I am trying to figure this package out and I am struggling. Specifically, I was hoping to find some example code for the s3save
and s3load
functions.
For s3save
do you just set up a bucket like so
ex <- getbucket(
bucket = 'test-bucket',
key = "AWS-KEY",
secret = "AWS-SECRET",
region = 'us-west-2'
)
and then use that bucket in the s3save
function?
s3save(mtcars,
bucket = ex
, object = "mtcars"
, opts = list(quote = FALSE))
It seems like I have to put something in opts or it errors out so I put in quote = FALSE
so it could pass that to the do.call function.
When I run that I get
No encoding supplied: defaulting to UTF-8.
but nothing shows up in the bucket.
Probably a stackoverflow question but I didn't see any other aws.s3 tagged questions there so I thought this would be a better spot for now
For those of us who'd like to avoid disk IO where possible.
It seems like this may already be the intended functionality, given https://github.com/cloudyr/aws.s3/blob/master/R/http.r#L117, but doing so now yields errors.
Perhaps there is a plan to implement this when objects are released, but this would be a fairly simple thing to implement now.
Can anyone help me on how to save a .csv file directly into Amazon s3 without saving it in local ?
I tried this-
put_object(file = "sub_loc_imp.csv", object = "sub_loc_imp", bucket = "dev-sweep")
it popped an error-
Warning message:
In parse_aws_s3_response(r, Sig) : Forbidden (HTTP 403).
The PUT ACL methods require a fairly complicated XML structure to handle bucket and/or object permissions.
Another access management feature is a JSON bucket policy structure. There's an online tool to generate bucket policies (and policies for some of the other AWS services).
As far I can tell, I have two identically configured buckets, one named drat
and one named packages.ropensci.org
. I can get / copy files to either one using the AWS CLI. When I try to get or push files to drat
via aws.s3
, things are rosy. When I try packages.ropensci.org
, I get the error:
x = getobject("packages.ropensci.org", "index.html", region="us-west-2", key = Sys.getenv("AWS_ACCESS_KEY_ID"), secret = Sys.getenv("AWS_SECRET_ACCESS_KEY"))
Error in curl::curl_fetch_memory(url, handle = handle) :
SSL peer certificate or SSH remote key was not OK
Enter a frame number, or 0 to exit
1: getobject("packages.ropensci.org", "index.html", region = "us-west-2", key
2: s3HTTP(verb = "GET", url = paste0("https://", bucket, ".s3.amazonaws.com/",
3: httr::GET(url, H, ...)
4: request_perform(req, hu$handle$handle)
5: request_fetch(req$output, req$url, handle)
6: request_fetch.write_memory(req$output, req$url, handle)
7: curl::curl_fetch_memory(url, handle = handle)
I'm stumped, no idea what is different. I took a quick look at the headers and url in the debugger but nothing stands out to me there either. (If you wanna take a look at this exact example I can email you some credentials)
This currently doesn't work due to never implementing the XML specification. Now that xml2 implements creating XML documents, this should be trivial without adding a dependency (but we will need to version the xml2 dependency).
Most of the PUT methods that allow one to configure a bucket or object require XML structures. In a couple of cases I've just used simple string concatenation to build these, but the structures can get quite complicated. How do we want to approach this? Should we:
for instance, a GET request to https://1000genomes.s3.amazonaws.com/README.pilot_data
will succeed, but https://1000genomes.s3.amazonaws.com/README.pilot_data/
will throw a 404.
right now we always build the url + '/' + bucket + path, which has the unintended effect of appending a trailing slash to the url, even if the correct url is passed to s3HTTP
.
Is there a sensible way to go about unit testing (e.g. particularly for methods requiring keys?) e.g. does AWS provide and testing endpoints with dummy credentials? Or can we define some dummy credentials for testing purposes (e.g. create and then delete a file) and pass them encrypted to travis?
would love to be helpful here - was able to get getbucket()
to run, but after digging around in the code a bit, I noticed that the header construction was a lot different in getbucket()
than in any of the other methods (eg all the stuff for objects). I rolled my own function cribbing off of getbucket
and was able to write to my s3 bucket.
anything pointers/suggestions about what I should be doing differently? I see a lot of ...
arguments in object.r -- should I be building my headers using some constuctor, then passing that to putobject
, etc?
will be happy to write up whatever I learn and submit back as a vignette / documentation / whatever
didn't work
> aws.s3::putobject('my_bucket', 1)
Error: is.character(headers) is not TRUE
Called from: request(headers = c(..., .headers))
works
putobject_alm <- function(
bucket,
object,
prefix,
delimiter,
max,
marker,
...
) {
if(inherits(bucket, "s3_bucket"))
bucket <- bucket$Name
h <- list()
if(!missing(prefix))
h$prefix <- prefix
if(!missing(delimiter))
h$delimiter <- delimiter
if(!missing(max))
h$"max-keys" <- max
if(!missing(marker))
h$marker <- marker
h$`x-amz-content-sha256` <- ""
r <- s3HTTP("PUT", paste0("https://",bucket,".s3.amazonaws.com/", object),
headers = h, ...
)
if(inherits(r, "aws_error")) {
return(r)
} else {
structure(r, class = "s3_object")
}
}
putobject_alm('my_bucket', 1)
Currently I only see documentation for loading an R object or file into a vector.
I think this can just be a new "verb" in s3http()
, returning the connection object to enable reading an object directly as a connection rather than all at once.
Meanwhile, parsing of responses could also be more functionalized and a bit more robust
The install.packages
line drew my interest (there's a ton of S3 clients on Github, but none that seem to be on CRAN).
Do you have any plans to (re)submit it?
Saw this on bucketlist()
and now on getbucket()
-- when parse_response = FALSE
, the default behavior of the function is to return only the 'contents' of the response -- which doesn't hold for an unparsed response, and you get
> ex
Bucket:
named list()
Right now I am handling this by identifying unparsed responses by their class (response
) and returning that list without any additional processing -- here's one example. basically just drafting off the way we currently handle errors:
if (inherits(r, "aws_error") | inherits(r, "response")) {
return(r)
This feels a bit clumsy to me, and presumably whatever we come up with for returning unparsed responses is going to become a common design pattern throughout, so I thought we should kick this around for discussion.
s3_raw_response
?s3HTTP
a single return point. do we want to do that for our get functions as well?Need to do this and update the style guide with a template for it.
Got bogged down my the region thing so didn't have a chance to play with this yet, but on the to-do list. might involve more package API breakage.
Getting this somewhat cryptic error from calls to objectput
now, looks like the object I'm trying to put may be too large. Any idea where this is documented in the AWS S3 docs? Maybe a post method would work still?
Error in curl::handle_setopt(handle, .list = req$options) :
Option postfieldsize_large (30120) not supported.
Calls: s3copy ... <Anonymous> -> request_perform -> <Anonymous> -> .Call
We should be able to POST from an in-memory representation:
mydata <- serialize(iris, NULL)
req <- httr::POST("http://httpbin.org/post", body = mydata)
And thus save disk i/o.
xml2 1.0.0 was just posted to CRAN in July and > 1.0.0 is only available via github_install. Might it be possible to revert to >= 1.0.0 dependence?
Hi,
First, thanks guys for the package, which is working really fine.
I get stuck though because I am unable to use the 'headers' argument in getobject function.
The main use I see personnally is the ability to use Content-Length and Content-Range in order to download only a specified part of a file.
It would be great if this feature could be implemented in aws.s3.
Thanks again
Jean-Eudes
Were these correct? I removed them here: 7cdf463 due to failure, but we should restore them and track down the problem if that's somewhat we actually need in the response object.
The aws cli and most other clients seem to support detecting the environmental variable AWS_DEFAULT_REGION
, in addition to the key and secret key. would be straight forward to add.
Hi team
I may be working in a very strange use case, I'm not sure. Feel free to disregard this if so.
I'm working in r-studio-server hosted on an EC2 instance in AWS (amazon linux OS, R 3.2.2, rstudio-server version 0.99.465). I'm trying to use the aws.s3 package to access an .rds file that is in an s3 bucket. The file is approximately 200mb on disk as an RDS. The EC2 instance is m4.2xlarge, so there should be around 32GB of RAM available.
The bucket is called "chek1", and get_bucket("chek1")
works fine.
However, when I do:
> s3readRDS(object = "SAMHDA/RAWdata/vcat.08-14.rds", bucket = "chek1")
I get the following cryptic error message
Error in memDecompress(from = as.vector(r), type = "gzip") :
internal error -3 in memDecompress(2)
I'm not sure whats going on here. Does anyone have any ideas/workarounds? I really liked the look/feel of this package, and was pretty surprised to get tripped up with this. Googling the error message only returns a random conversation from 2012 between @hadley wickham and brian ripley lol.
Adam
In the code below, located in README.md
file, Sys.getenv()
should be Sys.setenv()
:
Sys.getenv("AWS_ACCESS_KEY_ID" = "mykey", "AWS_SECRET_ACCESS_KEY" = "mysecretkey", "AWS_DEFAULT_REGION" = "us-east-1")
Sorry for not filing a more intelligent issue, but haven't had a chance to dig deeply. Maybe something has changed upstream? Seems that builds on travis that used to pass are failing now, and for some reason my existing aws.s3
scripts are now failing to execute successfully (though often not throwing errors, just not actually posting / downloading / listing objects). The same commands with same credentials seem fine using the (python) aws cli. Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.