GithubHelp home page GithubHelp logo

Comments (3)

algattik avatar algattik commented on June 18, 2024 1

Hi, we already deal with many potential combinations of ingestion, processing and serving technologies, before adding an extra dimension of programming language, and there is some maintenance overhead for each sample over time, that represents some work for a community-led project. The purpose of the repo is that based on one or more samples, together with Azure documentation you should be able to build a "skeleton" pipeline in a quite straightforward manner.

Have you had some difficulties achieving this? What kind of Python samples would you be looking at? We can certainly look at adding one if it can help the community. Are you considering Python Azure functions, or PySpark code?

from streaming-at-scale.

SeaDude avatar SeaDude commented on June 18, 2024

Hello, thanks for the details. I've mostly been having difficulty finding a Python example of streaming data from Azure Blob Storage to external providers using Azure Function async methods and chunk() 'ing.

The areas that are murky to me are:

  • Do I use the azure.storage.blob libraries and asyncio? OR
  • Do I use the azure.storage.blob.aio libraries?
    • Do the .aio libraries require asyncio operations too?
    • What is the difference between the two libraries?
  • Some chunk()'ing examples are upwards of 500 lines of code, yet I feel like I'm close with the code below.

This code will download blobs from Azure storage in 256MB chunks (synchronously), load the whole file into memory, then upload to AWS using multipart upload.

  • Its quite fast at ~ 2mins / 2GB, but this does not scale when size of blobs > the App Service Plan memory. It crashes the Function.

I need an example of piping chunks to boto3, but I haven't figured out how to do it yet. Ideally it would process multiple blobs at once in parallel.

def create_blob_client(credential, blob_url):
    try:
        blob_client = BlobClient.from_blob_url(credential, blob_url, max_single_get_size=256*1024*1024, max_chunk_get_size=256*1024*1024)
        logging.info(f'##### Blob Service Client created successfully')
    except Exception as e:
        logging.error(f"Error creating Blob Service Client: {e}")
    return blob_client

def load_blob_into_memory(blob_client, eg_msg_json):
    try:
        blob_data = blob_client.download_blob().readall()
        blob_bytes = io.BytesIO(blob_data)
    except Exception as e:
        logging.error(f'Error downloading blob: {e}')
    return blob_bytes

def create_boto3_client(s3_id, s3_secret_key):
    try:
        s3_resource = boto3.resource(
                's3',
                aws_access_key_id = s3_id,
                aws_secret_access_key = s3_secret_key
        )
    except Exception as e:
        logging.error(f'Failed to create boto3client: {e}')
    return s3_resource

def upload_file_to_s3(s3_bucket, aws_dir, blob_byte_stream):
    try:
        config = boto3.s3.transfer.TransferConfig(multipart_threshold=1024*25, max_concurrency = 10, multipart_chunksize = 1024*25, use_threads = True)
        s3.Bucket(s3_bucket).upload_fileobj(blob_byte_stream, Key = aws_dir, Config = config)
    except Exception as e:
        logging.error(f'Failed to upload file: {e}')
    return

credentials = DefaultAzureCredential()

blob_client = create_blob_client(credentials)

s3_id = os.environ['S3_ID']
s3_secret_key = os.environ['S3_SECRET_KEY']

s3 = create_boto3_client(s3_id, s3_secret_key)

s3_bucket = os.environ['S3_BUCKET_NAME']
aws_dir = '/test/downloads/test.png'
blob_byte_stream = load_blob_to_memory(blob_client)

upload_file_to_s3(s3_bucket, aws_dir, blob_byte_stream)

image

from streaming-at-scale.

SeaDude avatar SeaDude commented on June 18, 2024

Will close now as I've given way more details than I should have :). All good, I was just perusing the repo looking for leads on how to code my solution and noticed there was no Python listed:

image

My main question was basically "why is this?": I was curious if Python was slower than C#? Less supported by Microsoft or just hasn't made its way into the repo yet.

No biggie :)

from streaming-at-scale.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.