GithubHelp home page GithubHelp logo

mykovalenko / synapse Goto Github PK

View Code? Open in Web Editor NEW

This project forked from davew-msft/synapse

0.0 1.0 0.0 4.47 MB

License: Apache License 2.0

TSQL 1.08% PowerShell 2.50% Jupyter Notebook 96.19% Python 0.23%

synapse's Introduction

Synapse End to End Workshop

https://git.davewentzel.com/workshops/synapse

What Technologies are Covered

  • Synapse workspaces
  • Azure DevOps (AzDO) or GitHub (and gh actions)
  • pretty much every other Azure data service

Target audience

  • Data Engineers/DBAs/Data Professionals
  • Data Scientists
  • App Developers

Business Case Background

Our company has hundreds of brick and mortar stores. Over the years, they have amassed large amounts of historical data stored in disparate systems. They wish to combine their historic data and tie it together with near real-time data streams to produce dashboard KPIs and machine learning models that enable them to make informed up to the minute decisions.

Our company has over 5 years and 30 billion rows of transactional sales data in Oracle, finance data stored in SAP Hana, and marketing data in Teradata. They also monitor the data coming in from their social media Twitter account.

They need a solution that allows them to query across and analyze the data from all these sources. Regardless of volume, they want these queries to return in seconds.

Our company has 100 stores each equipped with 50 IoT sensors that monitor customer behavior in the store aisles. They need to ingest sensor data in near real time to allow them to quickly identify patterns that can be shared between stores in an aim to improve sales with last minute offers and improved product placement.

Workshop Objectives

we will try to build an end-to-end solution using Azure Synapse Analytics. The workshop will cover:

  • data loading
  • data preparation
  • data transformation
  • data serving
  • machine learning
  • batch data
  • streaming data

We will try to do everything using the same datasets, but can't guarantee it.

We want to build something like this:

... this is a "reference architecture" for a standard corporate information factory which focuses on a Synapse-based implementation (aka "SQL-focused" solution). But this of course is not the only way to do things. In fact, if you want to think of your data in terms of "streams" this may be a better way to think of your data: ...note that here we are using other technologies to process the stream data before it "comes to rest". These are not the ONLY technologies to process streaming data, in fact, they may not even be the best.

What is Synapse Data Warehouse?

See Synapse Workspace

Workshop Agenda

Setup and Prep Labs

Topic Lab Name Description
Setup Lab 000: local development machine setup You will need some tools installed on your local machine. Let's get those setup.
Setup Lab 001: setup Synapse workspace understand a little more about what problems this service is trying to solve
Setup Lab 001a: Setup Source Control Integration Never work in Synapse Live Mode
Setup Lab 021: Version Control Integration general best practices when doing Synapse browser-based development
Setup Lab 002: import sample data into Synapse get familiar with the workspace UI.
Setup Lab 003 (Optional): Configure additional users to access a Synapse workspace You do not need to do this unless everyone in the workshop wants to share access to a single Synapse workspace.

Working with Linked Services

Data Discovery and Sandboxing

Understanding data through data exploration is the biggest challenge faced today by data professionals. Generally I do data exploration using either:

  • Synapse serverless SQL pool
    • I'm good at SQL so I want to use this to start my analysis, plus, it has a wicked-cool UI for exploring data lake files
  • Spark (databricks or Synapse Spark)
    • If I realize I need something more complex for analysis like python or pandas, or I need to do some ML.
Lab Name Description
Lab 010: Understanding Data Lakes An overview of the structure and purpose of a data lake
Lab 011: Data Discovery and Sandboxing with SQL Serverless we also look at querying CSV and JSON data
Lab 012: Data Discovery and Sandboxing with Spark
  • we do basic data lake queries using Spark
  • we will use Lab 052 for a much deeper dive later
  • Lab 020: Shared Metadata the 3 components of a Synapse Workspace share much of their metadata to aid in reuse. We explore that in this lab.
    Lab 030: Logical Datawarehousing with Synapse Serverless and your Data Lake open the sql file on the link in Synapse workspace and follow along
    Thinking about how to leverage your data lake to do ETL with TSQL and Serverless
    Lab 031: ETL with Synapse Serverless, your data lake, and TSQL

    ETL/ELT Options

    There are a lot of different ways to do ELT/ETL using Synapse. We'll explore each way in this section:

    Topic Lab Name Description
    General Setup Lab 050: Understanding Data Factory (Integrate) Best Practices Even if you are not planning to use ADF/Synapse "Integrate" experience, you will likely want to version control your notebooks and SQL files. We cover things like gitflow as well.
    General Setup Lab 051: Best Practices for source controlling SQL scripts Let's walk through what I think is THE BEST WAY to think about how to do data lake-driven ETL.
    Spark Lab 052: Manipulating a Data Lake with Spark
  • Import ./notebooks/Lab052.ipynb directly in your Synapse workspace under Notebooks.
  • Read the instructions and complete the lab
  • Spark Lab 053: Understanding Delta Tables with Spark We'll explore using Delta tables from Synapse Spark pools TODO
    Spark Lab 054: Sharing Data Between SparkSQL, Scala, and pySpark Using multiple languages in Spark is the key to solving problems, but sharing variables and dataframes isn't always intuitive. We'll also look at how to persist data so Serverless Pools can access it. WIP/TODO...see version in workspace

    Using SQL Serverless

    Spark to Synapse Dedicated Pools

    The ADF (Integrate) box-and-line tools

    TODO: load campaign analytics table, might be good for ADF data flows.

    Power BI Integration

    Security Topics

    ML/AI in Synapse

    Monitoring

    Wrap Up

    You should probably delete the resource group we created today to control costs.

    If you'd rather keep the resources for future reference you can simply PAUSE the dedicated SQL Pool and the charges should be minimal.

    Other Notes

    synapse's People

    Contributors

    davewentzel avatar

    Watchers

    James Cloos avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google โค๏ธ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.