GithubHelp home page GithubHelp logo

Comments (17)

alanking avatar alanking commented on June 16, 2024 1

I believe that discussion is happening here, if it's of interest: irods/irods_client_rest_cpp#160

from irods_client_library_rirods.

korydraughn avatar korydraughn commented on June 16, 2024 1

Ok, so when you say chunked, you're talking about manually sending parts of the file, not HTTP's chunked encoding. That's good.

I noticed you're passing truncate=0. Try changing it to truncate=false and seeing if it works.

from irods_client_library_rirods.

MartinSchobben avatar MartinSchobben commented on June 16, 2024 1

oh my this works: truncate=false

from irods_client_library_rirods.

trel avatar trel commented on June 16, 2024

how big is the maximum buffer?

can it be increased... should it be a parameter?

what is the downside to increasing such a value?

from irods_client_library_rirods.

MartinSchobben avatar MartinSchobben commented on June 16, 2024

Should buffer size be defined in the header of the curl request? If so, then I can include another parameter or define a usefull value. I am actually also unsure how one knows what buffer size is needed and what's it for anyway.

from irods_client_library_rirods.

korydraughn avatar korydraughn commented on June 16, 2024

It sounds like the count parameter is what you want.

It is specified as part of the URL just like logical-path. See https://github.com/irods/irods_client_rest_cpp#stream

&count=<integer>

The /stream endpoint is meant to be used like POSIX read/write functions. You must make multiple calls to stream large amounts of data. The count parameter tells the endpoint how many bytes are to be read or written.

As for the size of the buffer, larger values for count mean fewer network calls. Smaller values for count mean more network calls. Try starting with a count of 8192 bytes and increasing it to see what kind of performance you get. The value you land on can be the default. The user can then choose to override that if they feel it is too small or large.

from irods_client_library_rirods.

MartinSchobben avatar MartinSchobben commented on June 16, 2024

I see. I did set it to 1000 by default, so I will change that but leave it accessible for the user.

from irods_client_library_rirods.

MartinSchobben avatar MartinSchobben commented on June 16, 2024

I made an example mostly written in a bash script with only a piece of R that creates an 10,000 row-sized matrix. This mimics about what the R function does but I can't seem to fix count in such a way that this works without exceeding the buffer.

#!/bin/sh

# get token
export SECRETS=$(echo -n rods:rods | base64)
export TOKEN=$(curl -X POST -H "Authorization: Basic ${SECRETS}" http://localhost/irods-rest/0.9.3/auth)

# R object
Rscript -e "foo <- matrix(1:10000); saveRDS(foo, 'foo.rds')"

# url encode with php
export LPATH=$( php -r "echo urlencode('/tempZone/home/rods/foo.rds');"; )

# create file
curl -X PUT -H "Authorization: ${TOKEN}" \
  -H "Accept-Encoding: gzip, deflate, br" \
  -d @foo.rds \
  "http://localhost/irods-rest/0.9.3/stream?logical-path=${LPATH}&offset=0&count=8192"

# delete file
rm foo.rds

from irods_client_library_rirods.

trel avatar trel commented on June 16, 2024

How big is 'foo.rds' on the disk? That is the number that 'should' work for the count=.

from irods_client_library_rirods.

korydraughn avatar korydraughn commented on June 16, 2024

How big is 'foo.rds' on the disk? That is the number that 'should' work for the count=.

That is true if it is small. The REST API allocates a buffer of size count bytes. If count is too large, the server could throw a std::bad_alloc exception.

offset and count must be used to read/write large files.

from irods_client_library_rirods.

MartinSchobben avatar MartinSchobben commented on June 16, 2024

Does that mean R has to chop the foo.rds file in pieces and then send the pieces over the REST API to one and the same location?

from irods_client_library_rirods.

korydraughn avatar korydraughn commented on June 16, 2024

That is correct.

The C++ REST API does not support parallel transfer yet, so the speed of a transfer will depend on the size of the file. That is a known issue and will be worked on in a future release of the API.

from irods_client_library_rirods.

MartinSchobben avatar MartinSchobben commented on June 16, 2024

I am now able to upload larger object with chunking. You can see here that local and irods file sizes match:

library("rirods")
rirods:::local_create_irods()
iauth("rods", "rods")
# big object
foo <- matrix(1:100000)
# save locally
saveRDS(foo, "foo.rds")
# check size
file.size("foo.rds")
#> [1] 212424
# put in irods
iput(foo)
# check size on irods
ils(stat = TRUE)
#>                  logical_path status_information.last_write_time
#> 1 /tempZone/home/rods/foo.rds                         1670143173
#>   status_information.size        type
#> 1                  212424 data_object

But something goes wrong during the upload, and on closer inspection by looking at the raw vector, it seems like that the chunks in the front get overwritten with 00 00 00 bytes. Whereas the last chunk represents the source truthfully.

Any ideas?

Created on 2022-12-04 with reprex v2.0.2

from irods_client_library_rirods.

korydraughn avatar korydraughn commented on June 16, 2024

So only the last part of the data is correct?

How is the chunked transfer implemented in rirods?
Is there a loop?
What does the HTTP request look like?
What API parameters are being set on the request?

from irods_client_library_rirods.

MartinSchobben avatar MartinSchobben commented on June 16, 2024

Technically a loop is at work, which outputs n times this response:

<- HTTP/1.1 200 OK
<- Server: nginx/1.23.1
<- Date: Tue, 06 Dec 2022 16:54:23 GMT
<- Content-Length: 33
<- Connection: keep-alive
<- Access-Control-Allow-Origin: *
<- Access-Control-Allow-Headers: *
<- Access-Control-Allow-Methods: AUTHORIZATION,ACCEPT,GET,POST,OPTIONS,PUT,DELETE

The request would look like this (2 steps at the end of the loop/file):

-> PUT /irods-rest/0.9.3/stream?logical-path=%2FtempZone%2Fhome%2Frods%2Fbaz.rds&offset=206356&count=3034&truncate=0 HTTP/1.1
-> Host: localhost
-> User-Agent: httr2/0.2.2 r-curl/4.3.3 libcurl/7.68.0
-> Accept: */*
-> Accept-Encoding: deflate, gzip, br
-> Authorization: <REDACTED>
-> Content-Length: 3034
-> PUT /irods-rest/0.9.3/stream?logical-path=%2FtempZone%2Fhome%2Frods%2Fbaz.rds&offset=209390&count=3034&truncate=0 HTTP/1.1
-> Host: localhost
-> User-Agent: httr2/0.2.2 r-curl/4.3.3 libcurl/7.68.0
-> Accept: */*
-> Accept-Encoding: deflate, gzip, br
-> Authorization: <REDACTED>
-> Content-Length: 3034

from irods_client_library_rirods.

korydraughn avatar korydraughn commented on June 16, 2024

We've made a tweak to the C++ REST API to allow larger HTTP requests. Before the change, the /stream endpoint for PUT was limited to about 4096 bytes (not good for uploads).

If that seems important to have for your talk, you can try building a package with that change. The PR is available at the following:

Generating a package can be accomplished using the Docker builder provided by the project. See the instructions at the following:

And here's instructions for running with Docker Compose.

from irods_client_library_rirods.

korydraughn avatar korydraughn commented on June 16, 2024

We're going to change that in the next release to accept 0/1 instead of true/false.

from irods_client_library_rirods.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.