GithubHelp home page GithubHelp logo

fail to save big gs about cytolib HOT 5 CLOSED

rglab avatar rglab commented on July 25, 2024
fail to save big gs

from cytolib.

Comments (5)

mikejiang avatar mikejiang commented on July 25, 2024 1

I ended up implemented the second distributed approach, i.e. one pb file per sample. Now it saves ok for big gs

> save_gs(gs_big, tmp)
Done
To reload it, use 'load_gs' function

> list.files(tmp)
  [1] "90b6757a-26ab-4158-bfd2-fb4272fd1054.pb" "s1.h5"                                  
  [3] "s1.pb"                                   "s10.h5"                                 
  [5] "s10.pb"                                  "s100.h5"                                
  [7] "s100.pb"                                 "s11.h5"                                 
...
[195] "s96.pb"                                  "s97.h5"                                 
[197] "s97.pb"                                  "s98.h5"                                 
[199] "s98.pb"                                  "s99.h5"                                 
[201] "s99.pb"                                 

And sub-loading is more efficient than before

> system.time(gs1 <- load_gs(tmp, select = c("s1", "s100")))
   user  system elapsed 
  2.290   0.068   2.382 
> sampleNames(gs1)
[1] "s1"   "s100"

from cytolib.

mikejiang avatar mikejiang commented on July 25, 2024

This buffer size limitation was introduced by switching to protobuf-lite (RGLab/RProtoBufLib#6 (comment)), which doesn't support iostream and thus imposes the size restriction from using StringOutputStream wrapped over single string buffer

from cytolib.

jacobpwagner avatar jacobpwagner commented on July 25, 2024

I see. Is it worth switching back then? I kept pretty detailed notes on minimization of the protobuf bundle, so if we want to do that again after moving back to the full library, it should be reasonably quick.

from cytolib.

mikejiang avatar mikejiang commented on July 25, 2024

Yeah, switching back to full version of protobuf will be one quick solution. There are two other alternatives, which require the change of the existing message format

  1. still save to the single pb file , but with multiple string buffer writes to the same file preceded by a small int byte that records each buffer size (so that they can be reloaded by multiple buffer reads)
  2. write each gh(i.e. sample) to its own pb file

The second approach will be potentially good for concurrent loading as well as efficient sub-loading through select argument (i.e. load_gs(path, select = c(1:3))) since it no longer has to the load and parse the entire message for all samples.

Either of the two could still fail theoretically if the single sample reaches the same buffer limit (when the total number of gates are huge and events number is large enough). This probably would not happen practically. (Or I could be wrong on this, given the nature of faust application)

Anyway, in the short run, I will do the switching. The discussion above is for the record in future.

from cytolib.

DillonHammill avatar DillonHammill commented on July 25, 2024

This is great @mikejiang!

from cytolib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.