Uploading larger R objects fails unexpected. <div class="highlight highlight-sourc

I believe that discussion is happening here, if it's of interest: <a class="issue-link

I made an example mostly written in a bash with only a piece of R that creates

Uploading larger R objects fails unexpected about irods_client_library_rirods HOT 17 CLOSED

MartinSchobben commented on June 16, 2024

Uploading larger R objects fails unexpected

from irods_client_library_rirods.

Comments (17)

alanking commented on June 16, 2024 1

I believe that discussion is happening here, if it's of interest: irods/irods_client_rest_cpp#160

from irods_client_library_rirods.

korydraughn commented on June 16, 2024 1

Ok, so when you say chunked, you're talking about manually sending parts of the file, not HTTP's chunked encoding. That's good.

I noticed you're passing truncate=0. Try changing it to truncate=false and seeing if it works.

from irods_client_library_rirods.

MartinSchobben commented on June 16, 2024 1

oh my this works: truncate=false

from irods_client_library_rirods.

trel commented on June 16, 2024

how big is the maximum buffer?

can it be increased... should it be a parameter?

what is the downside to increasing such a value?

from irods_client_library_rirods.

MartinSchobben commented on June 16, 2024

Should buffer size be defined in the header of the curl request? If so, then I can include another parameter or define a usefull value. I am actually also unsure how one knows what buffer size is needed and what's it for anyway.

from irods_client_library_rirods.

korydraughn commented on June 16, 2024

It sounds like the count parameter is what you want.

It is specified as part of the URL just like logical-path. See https://github.com/irods/irods_client_rest_cpp#stream

&count=<integer>

The /stream endpoint is meant to be used like POSIX read/write functions. You must make multiple calls to stream large amounts of data. The count parameter tells the endpoint how many bytes are to be read or written.

As for the size of the buffer, larger values for count mean fewer network calls. Smaller values for count mean more network calls. Try starting with a count of 8192 bytes and increasing it to see what kind of performance you get. The value you land on can be the default. The user can then choose to override that if they feel it is too small or large.

from irods_client_library_rirods.

MartinSchobben commented on June 16, 2024

I see. I did set it to 1000 by default, so I will change that but leave it accessible for the user.

from irods_client_library_rirods.

MartinSchobben commented on June 16, 2024

I made an example mostly written in a bash script with only a piece of R that creates an 10,000 row-sized matrix. This mimics about what the R function does but I can't seem to fix count in such a way that this works without exceeding the buffer.

#!/bin/sh

# get token
export SECRETS=$(echo -n rods:rods | base64)
export TOKEN=$(curl -X POST -H "Authorization: Basic ${SECRETS}" http://localhost/irods-rest/0.9.3/auth)

# R object
Rscript -e "foo <- matrix(1:10000); saveRDS(foo, 'foo.rds')"

# url encode with php
export LPATH=$( php -r "echo urlencode('/tempZone/home/rods/foo.rds');"; )

# create file
curl -X PUT -H "Authorization: ${TOKEN}" \
  -H "Accept-Encoding: gzip, deflate, br" \
  -d @foo.rds \
  "http://localhost/irods-rest/0.9.3/stream?logical-path=${LPATH}&offset=0&count=8192"

# delete file
rm foo.rds

from irods_client_library_rirods.

trel commented on June 16, 2024

How big is 'foo.rds' on the disk? That is the number that 'should' work for the count=.

from irods_client_library_rirods.

korydraughn commented on June 16, 2024

How big is 'foo.rds' on the disk? That is the number that 'should' work for the count=.

That is true if it is small. The REST API allocates a buffer of size count bytes. If count is too large, the server could throw a std::bad_alloc exception.

offset and count must be used to read/write large files.

from irods_client_library_rirods.

MartinSchobben commented on June 16, 2024

Does that mean R has to chop the foo.rds file in pieces and then send the pieces over the REST API to one and the same location?

from irods_client_library_rirods.

korydraughn commented on June 16, 2024

That is correct.

The C++ REST API does not support parallel transfer yet, so the speed of a transfer will depend on the size of the file. That is a known issue and will be worked on in a future release of the API.

from irods_client_library_rirods.

MartinSchobben commented on June 16, 2024

I am now able to upload larger object with chunking. You can see here that local and irods file sizes match:

library("rirods")
rirods:::local_create_irods()
iauth("rods", "rods")
# big object
foo <- matrix(1:100000)
# save locally
saveRDS(foo, "foo.rds")
# check size
file.size("foo.rds")
#> [1] 212424
# put in irods
iput(foo)
# check size on irods
ils(stat = TRUE)
#>                  logical_path status_information.last_write_time
#> 1 /tempZone/home/rods/foo.rds                         1670143173
#>   status_information.size        type
#> 1                  212424 data_object

But something goes wrong during the upload, and on closer inspection by looking at the raw vector, it seems like that the chunks in the front get overwritten with 00 00 00 bytes. Whereas the last chunk represents the source truthfully.

Any ideas?

^{Created on 2022-12-04 with reprex v2.0.2}

from irods_client_library_rirods.

korydraughn commented on June 16, 2024

So only the last part of the data is correct?

How is the chunked transfer implemented in rirods?
Is there a loop?
What does the HTTP request look like?
What API parameters are being set on the request?

from irods_client_library_rirods.

MartinSchobben commented on June 16, 2024

Technically a loop is at work, which outputs n times this response:

<- HTTP/1.1 200 OK
<- Server: nginx/1.23.1
<- Date: Tue, 06 Dec 2022 16:54:23 GMT
<- Content-Length: 33
<- Connection: keep-alive
<- Access-Control-Allow-Origin: *
<- Access-Control-Allow-Headers: *
<- Access-Control-Allow-Methods: AUTHORIZATION,ACCEPT,GET,POST,OPTIONS,PUT,DELETE

The request would look like this (2 steps at the end of the loop/file):

-> PUT /irods-rest/0.9.3/stream?logical-path=%2FtempZone%2Fhome%2Frods%2Fbaz.rds&offset=206356&count=3034&truncate=0 HTTP/1.1
-> Host: localhost
-> User-Agent: httr2/0.2.2 r-curl/4.3.3 libcurl/7.68.0
-> Accept: */*
-> Accept-Encoding: deflate, gzip, br
-> Authorization: <REDACTED>
-> Content-Length: 3034

-> PUT /irods-rest/0.9.3/stream?logical-path=%2FtempZone%2Fhome%2Frods%2Fbaz.rds&offset=209390&count=3034&truncate=0 HTTP/1.1
-> Host: localhost
-> User-Agent: httr2/0.2.2 r-curl/4.3.3 libcurl/7.68.0
-> Accept: */*
-> Accept-Encoding: deflate, gzip, br
-> Authorization: <REDACTED>
-> Content-Length: 3034

from irods_client_library_rirods.

korydraughn commented on June 16, 2024

We've made a tweak to the C++ REST API to allow larger HTTP requests. Before the change, the /stream endpoint for PUT was limited to about 4096 bytes (not good for uploads).

If that seems important to have for your talk, you can try building a package with that change. The PR is available at the following:

irods/irods_client_rest_cpp#175.

Generating a package can be accomplished using the Docker builder provided by the project. See the instructions at the following:

https://github.com/irods/irods_client_rest_cpp#building-with-docker.

And here's instructions for running with Docker Compose.

https://github.com/irods/irods_client_rest_cpp#starting-the-services-with-docker-compose.

from irods_client_library_rirods.

korydraughn commented on June 16, 2024

We're going to change that in the next release to accept 0/1 instead of true/false.

irods/irods_client_rest_cpp#176

from irods_client_library_rirods.

Uploading larger R objects fails unexpected about irods_client_library_rirods HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs