https://git.davewentzel.com/workshops/synapse
- Synapse workspaces
- Azure DevOps (AzDO) or GitHub (and gh actions)
- pretty much every other Azure data service
- Data Engineers/DBAs/Data Professionals
- Data Scientists
- App Developers
Our company has hundreds of brick and mortar stores. Over the years, they have amassed large amounts of historical data stored in disparate systems. They wish to combine their historic data and tie it together with near real-time data streams to produce dashboard KPIs and machine learning models that enable them to make informed up to the minute decisions.
Our company has over 5 years and 30 billion rows of transactional sales data in Oracle, finance data stored in SAP Hana, and marketing data in Teradata. They also monitor the data coming in from their social media Twitter account.
They need a solution that allows them to query across and analyze the data from all these sources. Regardless of volume, they want these queries to return in seconds.
Our company has 100 stores each equipped with 50 IoT sensors that monitor customer behavior in the store aisles. They need to ingest sensor data in near real time to allow them to quickly identify patterns that can be shared between stores in an aim to improve sales with last minute offers and improved product placement.
we will try to build an end-to-end solution using Azure Synapse Analytics. The workshop will cover:
- data loading
- data preparation
- data transformation
- data serving
- machine learning
- batch data
- streaming data
We will try to do everything using the same datasets, but can't guarantee it.
We want to build something like this:
... this is a "reference architecture" for a standard corporate information factory
which focuses on a Synapse-based implementation (aka "SQL-focused" solution). But this of course is not the only way to do things. In fact, if you want to think of your data in terms of "streams" this may be a better way to think of your data:
...note that here we are using other technologies to process the stream data before it "comes to rest". These are not the ONLY technologies to process streaming data, in fact, they may not even be the best.
Topic | Lab Name | Description |
---|---|---|
Setup | Lab 000: local development machine setup | You will need some tools installed on your local machine. Let's get those setup. |
Setup | Lab 001: setup Synapse workspace | understand a little more about what problems this service is trying to solve |
Setup | Lab 001a: Setup Source Control Integration | Never work in Synapse Live Mode |
Setup | Lab 021: Version Control Integration | general best practices when doing Synapse browser-based development |
Setup | Lab 002: import sample data into Synapse | get familiar with the workspace UI. |
Setup | Lab 003 (Optional): Configure additional users to access a Synapse workspace | You do not need to do this unless everyone in the workshop wants to share access to a single Synapse workspace. |
- Lab 005: creating a linked service to another storage account
- skip this lab for now.
Understanding data through data exploration is the biggest challenge faced today by data professionals. Generally I do data exploration using either:
- Synapse serverless SQL pool
- I'm good at SQL so I want to use this to start my analysis, plus, it has a wicked-cool UI for exploring data lake files
- Spark (databricks or Synapse Spark)
- If I realize I need something more complex for analysis like python or pandas, or I need to do some ML.
Lab Name | Description |
---|---|
Lab 010: Understanding Data Lakes | An overview of the structure and purpose of a data lake |
Lab 011: Data Discovery and Sandboxing with SQL Serverless | we also look at querying CSV and JSON data |
Lab 012: Data Discovery and Sandboxing with Spark | |
Lab 020: Shared Metadata | the 3 components of a Synapse Workspace share much of their metadata to aid in reuse. We explore that in this lab. |
Lab 030: Logical Datawarehousing with Synapse Serverless and your Data Lake | open the sql file on the link in Synapse workspace and follow along |
Thinking about how to leverage your data lake to do ETL with TSQL and Serverless | |
Lab 031: ETL with Synapse Serverless, your data lake, and TSQL |
There are a lot of different ways to do ELT/ETL using Synapse. We'll explore each way in this section:
Topic | Lab Name | Description |
---|---|---|
General Setup | Lab 050: Understanding Data Factory (Integrate) Best Practices | Even if you are not planning to use ADF/Synapse "Integrate" experience, you will likely want to version control your notebooks and SQL files. We cover things like gitflow as well. |
General Setup | Lab 051: Best Practices for source controlling SQL scripts | Let's walk through what I think is THE BEST WAY to think about how to do data lake-driven ETL. |
Spark | Lab 052: Manipulating a Data Lake with Spark | ./notebooks/Lab052.ipynb directly in your Synapse workspace under Notebooks. |
Spark | Lab 053: Understanding Delta Tables with Spark | We'll explore using Delta tables from Synapse Spark pools TODO |
Spark | Lab 054: Sharing Data Between SparkSQL, Scala, and pySpark | Using multiple languages in Spark is the key to solving problems, but sharing variables and dataframes isn't always intuitive. We'll also look at how to persist data so Serverless Pools can access it. WIP/TODO...see version in workspace |
- Lab 055: Writing a SQL Script to copy data from one data lake zone to another
- we use Serverless as a SQL-based ELT tool
- Lab 056: Using Spark to write data into Synapse SQL Pools (Dedicated)
- Lab 056a: Using Spark to write data into Synapse SQL Pools - .NET version
- Lab 057: Loading Data from a Data Lake into Synapse SQL Pool using the "Integrate" box-and-line experience (ADF Copy Activity)
- Lab 058: Loading Data from a Data Lake into Synapse SQL Pool using the "Integrate" box-and-line experience (ADF Dataflows)
TODO: load campaign analytics table, might be good for ADF data flows.
- Lab 400: Consuming a Model in Synapse TODO
- Lab 410: Using Cognitive Search with Synapse TODO
- Lab 420: Basic ML lifecycle using Spark and Synapse Dedicated Pools
- Lab 421: Train an automl model against an existing Spark dataset
- this requires you to complete Lab420.
- Lab 422: Use an existing model to batch inference against Synapse Dedicated Pool data
- this requires you to complete Lab 420 and Lab 421
You should probably delete the resource group we created today to control costs.
If you'd rather keep the resources for future reference you can simply PAUSE the dedicated SQL Pool and the charges should be minimal.
- templates folder has a bunch of my patterns that you may be able to leverage