Hello, I'm curious why there are no Python samples in this repo.

Curious why no Python samples here? about streaming-at-scale HOT 3 CLOSED

azure-samples commented on June 18, 2024

Curious why no Python samples here?

from streaming-at-scale.

Comments (3)

algattik commented on June 18, 2024 1

Hi, we already deal with many potential combinations of ingestion, processing and serving technologies, before adding an extra dimension of programming language, and there is some maintenance overhead for each sample over time, that represents some work for a community-led project. The purpose of the repo is that based on one or more samples, together with Azure documentation you should be able to build a "skeleton" pipeline in a quite straightforward manner.

Have you had some difficulties achieving this? What kind of Python samples would you be looking at? We can certainly look at adding one if it can help the community. Are you considering Python Azure functions, or PySpark code?

from streaming-at-scale.

SeaDude commented on June 18, 2024

Hello, thanks for the details. I've mostly been having difficulty finding a Python example of streaming data from Azure Blob Storage to external providers using Azure Function async methods and chunk() 'ing.

The areas that are murky to me are:

Do I use the azure.storage.blob libraries and asyncio? OR
Do I use the azure.storage.blob.aio libraries?
- Do the .aio libraries require asyncio operations too?
- What is the difference between the two libraries?
Some chunk()'ing examples are upwards of 500 lines of code, yet I feel like I'm close with the code below.

This code will download blobs from Azure storage in 256MB chunks (synchronously), load the whole file into memory, then upload to AWS using multipart upload.

Its quite fast at ~ 2mins / 2GB, but this does not scale when size of blobs > the App Service Plan memory. It crashes the Function.

I need an example of piping chunks to boto3, but I haven't figured out how to do it yet. Ideally it would process multiple blobs at once in parallel.

def create_blob_client(credential, blob_url):
    try:
        blob_client = BlobClient.from_blob_url(credential, blob_url, max_single_get_size=256*1024*1024, max_chunk_get_size=256*1024*1024)
        logging.info(f'##### Blob Service Client created successfully')
    except Exception as e:
        logging.error(f"Error creating Blob Service Client: {e}")
    return blob_client

def load_blob_into_memory(blob_client, eg_msg_json):
    try:
        blob_data = blob_client.download_blob().readall()
        blob_bytes = io.BytesIO(blob_data)
    except Exception as e:
        logging.error(f'Error downloading blob: {e}')
    return blob_bytes

def create_boto3_client(s3_id, s3_secret_key):
    try:
        s3_resource = boto3.resource(
                's3',
                aws_access_key_id = s3_id,
                aws_secret_access_key = s3_secret_key
        )
    except Exception as e:
        logging.error(f'Failed to create boto3client: {e}')
    return s3_resource

def upload_file_to_s3(s3_bucket, aws_dir, blob_byte_stream):
    try:
        config = boto3.s3.transfer.TransferConfig(multipart_threshold=1024*25, max_concurrency = 10, multipart_chunksize = 1024*25, use_threads = True)
        s3.Bucket(s3_bucket).upload_fileobj(blob_byte_stream, Key = aws_dir, Config = config)
    except Exception as e:
        logging.error(f'Failed to upload file: {e}')
    return

credentials = DefaultAzureCredential()

blob_client = create_blob_client(credentials)

s3_id = os.environ['S3_ID']
s3_secret_key = os.environ['S3_SECRET_KEY']

s3 = create_boto3_client(s3_id, s3_secret_key)

s3_bucket = os.environ['S3_BUCKET_NAME']
aws_dir = '/test/downloads/test.png'
blob_byte_stream = load_blob_to_memory(blob_client)

upload_file_to_s3(s3_bucket, aws_dir, blob_byte_stream)

from streaming-at-scale.

SeaDude commented on June 18, 2024

Will close now as I've given way more details than I should have :). All good, I was just perusing the repo looking for leads on how to code my solution and noticed there was no Python listed:

My main question was basically "why is this?": I was curious if Python was slower than C#? Less supported by Microsoft or just hasn't made its way into the repo yet.

No biggie :)

from streaming-at-scale.

Curious why no Python samples here? about streaming-at-scale HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs