Comments (3)
Hi, we already deal with many potential combinations of ingestion, processing and serving technologies, before adding an extra dimension of programming language, and there is some maintenance overhead for each sample over time, that represents some work for a community-led project. The purpose of the repo is that based on one or more samples, together with Azure documentation you should be able to build a "skeleton" pipeline in a quite straightforward manner.
Have you had some difficulties achieving this? What kind of Python samples would you be looking at? We can certainly look at adding one if it can help the community. Are you considering Python Azure functions, or PySpark code?
from streaming-at-scale.
Hello, thanks for the details. I've mostly been having difficulty finding a Python example of streaming data from Azure Blob Storage to external providers using Azure Function async
methods and chunk()
'ing.
The areas that are murky to me are:
- Do I use the
azure.storage.blob
libraries andasyncio
? OR - Do I use the
azure.storage.blob.aio
libraries?- Do the
.aio
libraries requireasyncio
operations too? - What is the difference between the two libraries?
- Do the
- Some
chunk()
'ing examples are upwards of 500 lines of code, yet I feel like I'm close with the code below.
This code will download blobs from Azure storage in 256MB chunks (synchronously), load the whole file into memory, then upload to AWS using multipart upload.
- Its quite fast at ~ 2mins / 2GB, but this does not scale when size of blobs > the App Service Plan memory. It crashes the Function.
I need an example of piping chunks to boto3
, but I haven't figured out how to do it yet. Ideally it would process multiple blobs at once in parallel.
def create_blob_client(credential, blob_url):
try:
blob_client = BlobClient.from_blob_url(credential, blob_url, max_single_get_size=256*1024*1024, max_chunk_get_size=256*1024*1024)
logging.info(f'##### Blob Service Client created successfully')
except Exception as e:
logging.error(f"Error creating Blob Service Client: {e}")
return blob_client
def load_blob_into_memory(blob_client, eg_msg_json):
try:
blob_data = blob_client.download_blob().readall()
blob_bytes = io.BytesIO(blob_data)
except Exception as e:
logging.error(f'Error downloading blob: {e}')
return blob_bytes
def create_boto3_client(s3_id, s3_secret_key):
try:
s3_resource = boto3.resource(
's3',
aws_access_key_id = s3_id,
aws_secret_access_key = s3_secret_key
)
except Exception as e:
logging.error(f'Failed to create boto3client: {e}')
return s3_resource
def upload_file_to_s3(s3_bucket, aws_dir, blob_byte_stream):
try:
config = boto3.s3.transfer.TransferConfig(multipart_threshold=1024*25, max_concurrency = 10, multipart_chunksize = 1024*25, use_threads = True)
s3.Bucket(s3_bucket).upload_fileobj(blob_byte_stream, Key = aws_dir, Config = config)
except Exception as e:
logging.error(f'Failed to upload file: {e}')
return
credentials = DefaultAzureCredential()
blob_client = create_blob_client(credentials)
s3_id = os.environ['S3_ID']
s3_secret_key = os.environ['S3_SECRET_KEY']
s3 = create_boto3_client(s3_id, s3_secret_key)
s3_bucket = os.environ['S3_BUCKET_NAME']
aws_dir = '/test/downloads/test.png'
blob_byte_stream = load_blob_to_memory(blob_client)
upload_file_to_s3(s3_bucket, aws_dir, blob_byte_stream)
from streaming-at-scale.
Will close now as I've given way more details than I should have :). All good, I was just perusing the repo looking for leads on how to code my solution and noticed there was no Python listed:
My main question was basically "why is this?": I was curious if Python was slower than C#? Less supported by Microsoft or just hasn't made its way into the repo yet.
No biggie :)
from streaming-at-scale.
Related Issues (20)
- akskafka-databricks-cosmosdb fails with helm3
- eventhubskafka-flink-eventhubskafka fails with missing Zookeeper incubator chart
- hdinsightkafka-flink-hdinsightkafka fails to create cluster
- akskafka-databricks-cosmosdb fails to deploy Kafka
- [QUERY] Not able to read AVRO file (following the tutorial for event-hubs + apache drill) HOT 1
- Eventhuub-Function-SQL Deployment issues in functions HOT 1
- Databricks provisioning fails when creating the solution HOT 2
- ERROR: Service principal 'http://bie3i8adx-reader' doesn't exist HOT 3
- More info on sample
- CQRS Link Returns 404
- Duplicate data is being generated HOT 2
- Document deduplication strategies
- Add deduplication in eventhubs-databricks-delta
- Adding scenarios that each sample fits best
- Evaluate use of blobs as storage HOT 1
- Misspelling of the word function HOT 2
- Comparison of cost vs performance across all the options
- Add CosmosDB via MongoDB API sample
- Deployment of eventhubs-streamanalytics-azuresql fails HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from streaming-at-scale.