Comments (17)
I think what I have to do is to create a version of the scan reading function that reads only what I need and nothing more. That should cut down on the time spent gathering the additional metadata that my code downstream simply ignores. If that is not going to be good enough, it might be necessary for the vendor to provide some "accelerator" functions, using their deep knowledge of the file format.
Also, I realized that the way data is passed into R at the moment is by generation and subsequent parsing of R source code. So the second trick would be to pass the data maybe as raw bytes, and then disentangle them on the R end using a simpler method than full-blown "eval" which has to be ready for anything an R programmer can throw at it - thus more complex - thus slower.
from rawrr.
Thank you! I have achieved very comparable results (modulo the start, some caches were not warm enough):
Testing on our files now.
from rawrr.
Regarding your plans of implementing a search engine directly in R: I have big doubts that this makes sense! R is an interpreted language and not suited for heavy data lifting. This is why most R functions that crucially depend on performance are implemented in C.
see http://adv-r.had.co.nz/Performance.html
If you still think you are missing a crucial functionality that could be provided by rarwrr please feel free to suggest something and we can think about making it happen, BUT it should make sense from a code design perspective.
from rawrr.
Below is a chart (it tops at 8192 spectra because the code crashed, investigating now) showing the times.
The difference is that each microbenchmark is ran on a completely different .raw file to reduce the effect of caching. I used a 24 fraction set of .raw files to make sure I have a fresh one for each query.
Here is the same thing with spectra per second plotted on Y axis. The update you provided did have a dramatic effect on read times. Thank you!
from rawrr.
@romanzenka we know about that. the current version tries to fetch everything. we are going to fix it.
A possible workaround is applying some filtering using rawrr::readIndex('someRawFileName')
and fetching only the scans of interest. Why do you want to read all spectra at once?
C
from rawrr.
We are essentially making a specialized "search engine" that processes all spectra.
We understand that going to C / .NET would be best for such job, but R is very convenient otherwise and has a lot of functionality we like. Being able to do these odd jobs in R would be great.
I'd be willing to try to provide a pull request, but I am afraid I'd collide with your design plans as you are already aware of this issue.
from rawrr.
Hi @romanzenka,
some comments: it is true that fetching a small number of spectra is relatively slow. This is due to a big processing overhead when calling our managed code (the rawrr.exe) using a system call, plus writing tmp files to disc and needing to read and parse tmp data. I recommend looking at this presentation, especially slide 5. @cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls). But the first results look very promising and would boost reading speeds especially for very small and selective data requests on many files!
Hope this helps,
Tobi
from rawrr.
...and because you phrased this statement is a very actual way:
"It would take 3 hours just to read a single file."
No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the rawrr::readSpectrum()
function n times targeting a single spectrum. I guess I don't have to go into the details why this is not smart. ;-) The proof is again on slide 5.
from rawrr.
No, it would NOT, since you can not multiple the time it takes to read a single spectrum times n. This is only the case if you would call the
rawrr::readSpectrum()
function n times targeting a single spectrum. I guess I don't have to go into the details why this is not smart. ;-) The proof is again on slide 5.
I understand that very well, which is why I only call the function once. The speed is still so slow that it is not useable. I suspect that is because that the function gathers metadata one spectrum at a time, which likely involves many seeks within the .raw file to gather all that info + complex parsing and similar.
from rawrr.
@cpanse is working a mechanism that would allow the managed code to provide direct in memory access via RCPP, but he is still struggling with details of the code management (which runtime to use and how to link the dlls).
I agree that having the engine in memory, "heated up and rearing to go" would be of great benefit if you can pull it off.
The low speed I am experiencing is most likely not a result of writing/parsing text files - that operation takes a tiny fraction of the time considering a size of one spectrum. A second is basically an eon in computer time... my hard drive can pump ~100MB in a single second into memory. The inefficiency is likely elsewhere, but I shall not speculate before I have numbers.
from rawrr.
A developer from the ProteoWizard/MSconvert project once told me: "When using vendor libraries you need to know how to pet the cat!" So, if you think you know better than @cpanse, please go ahead and suggest changes to our managed code. The C# source is available here. We are always open for pull requests as long as they comply with the Bioc guidelines and fit into the package scope. An example can be found here
from rawrr.
@romanzenka Can you provide more details of your request?
-
What data do you want? E.g., centroided peaks or segments (profile)?
-
How do you want the data to be read by R? e.g., base64 encoded one peak list a line using the
scan
method. -
Can you provide me access to a raw file you are going to use? (you can also send me an email [email protected] with the download link)
I think #44 is the ultimate way to go. Meanwhile, I can try to provide a code snippet to solve your issue.
from rawrr.
-
at the moment it is incredibly bare-bones. I basically need the precursor m/z and charge, then two arrays (or one interleaved, or whatever) of m/z + intensity pairs, centroided.
-
Since I spoke to you, did some minor benchmarking.
a <- 1:10000 / 7 # Some numbers
v <- paste0("list(a=c(", paste(a, collapse=", "), ")")
microbenchmark::microbenchmark(eval(v))
... and I am getting about 1.5 microseconds for this.
That could mean that maybe the R parse is fast enough and this is not the culprit, so we could spare ourselves the pain of doing a binary transfer or base64.
- I will send you an e-mail, just need to check I am not sharing anything "secret" first.
from rawrr.
@romanzenka I hope that helps.
commit 1637d6f on [email protected]:packages/rawrr (check out and R CMD build or wait for two days)
# fetch via ExperimentHub
library(ExperimentHub)
eh <- ExperimentHub::ExperimentHub()
EH4547 <- normalizePath(eh[["EH4547"]])
(rawfile <- paste0(EH4547, ".raw"))
if (!file.exists(rawfile)){
file.copy(EH4547, rawfile)
}
R> bm <- lapply(2^(0:14), function(n, ...){
+ m0 <- microbenchmark::microbenchmark({S <- rawrr::readSpectrum(rawfile, 1:n, mode='default')}, ...)
+ m1 <- microbenchmark::microbenchmark({S <- rawrr::readSpectrum(rawfile, 1:n, mode='barebone')}, ...)
+
+ data.frame(time = c(m0$time, m1$time), mode=c('default', 'barebone'), n=n)
+ }, times=1, unit="nanosecond") |> Reduce(f='rbind')
R> bm
time mode n
1 983118992 default 1
2 906433494 barebone 1
3 902113611 default 2
4 871311213 barebone 2
5 890822867 default 4
6 879356766 barebone 4
7 895267636 default 8
8 909109441 barebone 8
9 930387498 default 16
10 881011362 barebone 16
11 929100467 default 32
12 857490072 barebone 32
13 914358999 default 64
14 872367250 barebone 64
15 962366760 default 128
16 876129902 barebone 128
17 996060642 default 256
18 908822154 barebone 256
19 1170730769 default 512
20 925475452 barebone 512
21 1963340186 default 1024
22 1120511427 barebone 1024
23 3557690212 default 2048
24 1409178241 barebone 2048
25 6165030108 default 4096
26 1976297334 barebone 4096
27 10846751392 default 8192
28 3010938648 barebone 8192
29 29449842481 default 16384
30 6763253400 barebone 16384
R> lattice::xyplot(time ~ n, groups=bm$mode, data=bm, type='b', scale=list(log=TRUE), ylab='time [in nanosecond]', xlab='number of spectra')
R> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.0.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
locale:
[1] C/UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] tartare_1.7.2 ExperimentHub_2.1.4 AnnotationHub_3.1.5
[4] BiocFileCache_2.0.0 dbplyr_2.1.1 BiocGenerics_0.39.2
loaded via a namespace (and not attached):
[1] KEGGREST_1.33.0 tidyselect_1.1.1
[3] BiocVersion_3.14.0 purrr_0.3.4
[5] lattice_0.20-44 vctrs_0.3.8
[7] generics_0.1.0 htmltools_0.5.2
[9] stats4_4.1.1 yaml_2.2.1
[11] utf8_1.2.2 interactiveDisplayBase_1.31.2
[13] blob_1.2.2 rlang_0.4.11
[15] pillar_1.6.3 later_1.3.0
[17] withr_2.4.2 glue_1.4.2
[19] DBI_1.1.1 rappdirs_0.3.3
[21] bit64_4.0.5 GenomeInfoDbData_1.2.7
[23] lifecycle_1.0.1 zlibbioc_1.39.0
[25] Biostrings_2.61.2 memoise_2.0.0
[27] Biobase_2.53.0 IRanges_2.27.2
[29] fastmap_1.1.0 httpuv_1.6.3
[31] GenomeInfoDb_1.29.8 curl_4.3.2
[33] fansi_0.5.0 AnnotationDbi_1.55.1
[35] Rcpp_1.0.7 xtable_1.8-4
[37] promises_1.2.0.1 filelock_1.0.2
[39] BiocManager_1.30.16 cachem_1.0.6
[41] S4Vectors_0.31.4 XVector_0.33.0
[43] mime_0.11 bit_4.0.4
[45] microbenchmark_1.4.9 png_0.1-7
[47] digest_0.6.27 dplyr_1.0.7
[49] shiny_1.7.0 grid_4.1.1
[51] tools_4.1.1 bitops_1.0-7
[53] magrittr_2.0.1 RCurl_1.98-1.4
[55] tibble_3.1.4 RSQLite_2.2.8
[57] rawrr_1.3.2 crayon_1.4.1
[59] pkgconfig_2.0.3 ellipsis_0.3.2
[61] rstudioapi_0.13 assertthat_0.2.1
[63] httr_1.4.2 R6_2.5.1
[65] compiler_4.1.1
Cheers
from rawrr.
I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me.
I'm updating the test to a) read only MS2 spectra b) cycle through different files so we do not get overly optimistic results thanks to caching of previously loaded data.
Hopefully I will have plots shortly - what I am curious about seeing is "spectra per second", so I'll modify the plot a bit.
from rawrr.
Well, I tracked down the bug. If I load 16,384 spectra from a particular file, my R crashes when it tries to source
the resulting 1.1GB of R source code. The extraction itself takes about 1 minute, at some impressive 270 spectra per second... but then R cannot handle the parse on my 32GB RAM laptop. I get:
negative length vectors are not allowed
I think we ran over max vector lengths in R. That might be a future improvement, for now I will simply run the input in chunks big enough to get me speed, but small enough not to kill R.
from rawrr.
I have noticed that if I try to read non-centroided spectrum with "barebones", I get an error - which is 100% ok with me.
thanks; I fixed that. commit 36f43e1 C
from rawrr.
Related Issues (20)
- Enhancement - Complete readIndex() function HOT 9
- Peak charges for MS1 spectras HOT 4
- Spectrum scan centroid mZ, intensity and noises values do not match HOT 2
- Error in Example: Length of "x" and "y" are not matching HOT 3
- Read noise value for profile mode mass spectra HOT 4
- Read_Spectrum - Sum Spectra
- unit should be minute / auc computation in seconds HOT 24
- validate_rawrrSpectrum 'StartTime' HOT 4
- "Error: line 1 did not have 9 elements" for readIndex() and readChromatogram() + "Error : No scan vector is provided"for readSpectrum HOT 9
- Problem executing readChromatogram inside Singularity container HOT 5
- Add a check if `input` file exists and is not empty
- Error in if (rvs != "No RAW file specified!") { : the condition has length > 1 HOT 16
- Switch to RawFileReader 5.0.93 HOT 2
- different total number of Spectra in msconvert, compomics/ThermoRawFileParser and thermofisherlsms/RawFileReader HOT 2
- Request for auc.rawrrChromatogram HOT 3
- profile mode in readSpectrum HOT 11
- speed up readIndex/readSpectrum by base::textConnection HOT 14
- rawrr::buildRawrrExe() fails HOT 6
- auc.rawrrChromatogram question
- readChromatogram not working in xic mode (Windows 11)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rawrr.