Comments (7)
@horta it looks like htslib
is using libcurl
to read from GCS https://github.com/samtools/htslib/pull/446/commits
from sgkit-vcf.
Thanks Jeff!
bgen C library open a file with this interface: struct bgen_file* bgen_file_open(char const* filepath);
I wonder if we could just have something like: struct bgen_file* bgen_file_open(struct ifstream *stream);
Since there is no command line that uses bgen C library, I think it is better to have python handle S3/GCS/URL/etc. authentication etc., and pass to the C level only a stream of bytes that can be fseek'd and fread'd.
I googled a bit and I didnt find much. I would try to avoid C++ (specially at the public interface) mostly to facilitate its use by Python.
from sgkit-vcf.
Htslib had to work really hard to get support for remote URLs in there using libcurl - it was a big, deep change to the library which required a lot of work abstracting out the htsfile interface. I think it'd be very difficult to monkey-patch remote URL support onto bgen reader (if that's what you're talking about here @horta).
from sgkit-vcf.
That is what I have in mind. bgen library does fread, fseek, fclose, fopen, and fwrite. It assumes that the underlying file is fseekable and that serial access is better than random access.
An URL file is not fseekable but I'm assuming that S3 files might be. Is that true? Looking at the AWS SDK for Python, it does not seem to be the case: https://github.com/boto/botocore/blob/c168486392658dbddd78d130999f85fc0faf0a51/botocore/response.py#L29
However, AWS SDK for PHP seems to be the case: https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/s3-stream-wrapper.html
Reading the section "Opening Seekable Streams" on the other hand indicates that is just a local buffered fseek (fake fseek, which amounts to nothing).
If S3 connections is not fseekable, I don't think it makes sense to suport reading S3 files as it would amount to download it in practice...
from sgkit-vcf.
GCS seems to provide arbitrary chunk reading: https://stackoverflow.com/questions/14248333/google-cloud-storage-seeking-within-files
That is not fseek but a wrapper around it could provide a fseek operation that does not amount to download the whole file but mostly the parts you are interested in.
from sgkit-vcf.
S3 is the same: https://stackoverflow.com/questions/45108722/is-there-s3-range-read-function-that-allows-to-read-assigned-byte-range-from-a
from sgkit-vcf.
For future reference, it would be great if someone developed a BytesIO abstraction using this function https://libcloud.readthedocs.io/en/stable/storage/api.html#libcloud.storage.base.StorageDriver.download_object_range_as_stream from libcloud. It would make it possible to read cloud files as it where local with possibly not much bandwidth waste. That seems too good of an idea to not have any library out there doing it or possiblity I'm missing something.
from sgkit-vcf.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sgkit-vcf.