GithubHelp home page GithubHelp logo

Comments (7)

hammer avatar hammer commented on September 26, 2024

@horta it looks like htslib is using libcurl to read from GCS https://github.com/samtools/htslib/pull/446/commits

from sgkit-vcf.

horta avatar horta commented on September 26, 2024

Thanks Jeff!

bgen C library open a file with this interface: struct bgen_file* bgen_file_open(char const* filepath);

I wonder if we could just have something like: struct bgen_file* bgen_file_open(struct ifstream *stream);

Since there is no command line that uses bgen C library, I think it is better to have python handle S3/GCS/URL/etc. authentication etc., and pass to the C level only a stream of bytes that can be fseek'd and fread'd.

I googled a bit and I didnt find much. I would try to avoid C++ (specially at the public interface) mostly to facilitate its use by Python.

from sgkit-vcf.

jeromekelleher avatar jeromekelleher commented on September 26, 2024

Htslib had to work really hard to get support for remote URLs in there using libcurl - it was a big, deep change to the library which required a lot of work abstracting out the htsfile interface. I think it'd be very difficult to monkey-patch remote URL support onto bgen reader (if that's what you're talking about here @horta).

from sgkit-vcf.

horta avatar horta commented on September 26, 2024

That is what I have in mind. bgen library does fread, fseek, fclose, fopen, and fwrite. It assumes that the underlying file is fseekable and that serial access is better than random access.

An URL file is not fseekable but I'm assuming that S3 files might be. Is that true? Looking at the AWS SDK for Python, it does not seem to be the case: https://github.com/boto/botocore/blob/c168486392658dbddd78d130999f85fc0faf0a51/botocore/response.py#L29

However, AWS SDK for PHP seems to be the case: https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/s3-stream-wrapper.html
Reading the section "Opening Seekable Streams" on the other hand indicates that is just a local buffered fseek (fake fseek, which amounts to nothing).

If S3 connections is not fseekable, I don't think it makes sense to suport reading S3 files as it would amount to download it in practice...

from sgkit-vcf.

horta avatar horta commented on September 26, 2024

GCS seems to provide arbitrary chunk reading: https://stackoverflow.com/questions/14248333/google-cloud-storage-seeking-within-files
That is not fseek but a wrapper around it could provide a fseek operation that does not amount to download the whole file but mostly the parts you are interested in.

from sgkit-vcf.

horta avatar horta commented on September 26, 2024

S3 is the same: https://stackoverflow.com/questions/45108722/is-there-s3-range-read-function-that-allows-to-read-assigned-byte-range-from-a

from sgkit-vcf.

horta avatar horta commented on September 26, 2024

For future reference, it would be great if someone developed a BytesIO abstraction using this function https://libcloud.readthedocs.io/en/stable/storage/api.html#libcloud.storage.base.StorageDriver.download_object_range_as_stream from libcloud. It would make it possible to read cloud files as it where local with possibly not much bandwidth waste. That seems too good of an idea to not have any library out there doing it or possiblity I'm missing something.

from sgkit-vcf.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.