MCW CosmosDB real-time advanced analytics

License: MIT License

C# 35.90% Jupyter Notebook 63.14% PowerShell 0.96%

mcw-cosmos-db-real-time-advanced-analytics's Introduction

Cosmos DB real-time advanced analytics

This workshop is archived and no longer being maintained. Content is read-only.

Woodgrove Bank, who provides payment processing services for commerce, is looking to design and implement a proof-of-concept (PoC) of an innovative fraud detection solution. They want to provide new services to their merchant customers, helping them save costs by applying machine learning and advanced analytics to detect fraudulent transactions. Their customers are around the world, and the right solutions for them would minimize any latencies experienced using their service by distributing as much of the solution as possible, as closely as possible, to the regions in which their customers use the service.

March 2022

Target audience

Application developer
AI developer
Data scientist

Abstracts

Workshop

In this workshop, you will learn to design a data pipeline solution that leverages Cosmos DB for both the scalable ingest of streaming data, and the globally distributed serving of both pre-scored data and machine learning models. The solution leverages Azure Cosmos DB in concert with the Azure Synapse Analytics through Azure Synapse Link to enable a modern data warehouse solution that can be used to create risk reduction solutions for scoring transactions for fraud in an offline, batch approach and in a near real-time, request/response approach. The Azure Cosmos DB change feed is used for near real-time scoring, while the Azure Cosmos DB analytical store is used for batch processing and high-performance analytical queries.

At the end of this workshop, you will be better able to design and implement solutions that leverage the strengths of Cosmos DB in support of advanced analytics solutions that require high throughput ingest, low latency serving and global scale in combination with scalable machine learning, big data and real-time processing capabilities.

Whiteboard design session

Woodgrove Bank, who provides payment processing services for commerce, is looking to design and implement a PoC of an innovative fraud detection solution. They want to provide new services to their merchant customers, helping them save costs by applying machine learning and advanced analytics to detect fraudulent transactions. Their customers are around the world, and the right solutions for them would minimize any latencies experienced using their service by distributing as much of the solution as possible, as closely as possible, to the regions in which their customers use the service.

In this whiteboard design session, you will work in a group to design the data pipeline PoC that could support the needs of Woodgrove Bank.

At the end of this workshop, you will be better able to design solutions that leverage the strengths of Cosmos DB in support of advanced analytics solutions that require high throughput ingest, low latency serving and global scale in combination with scalable machine learning, big data and real-time processing capabilities.

Hands-on lab

In this hands-on lab session, you will implement a PoC of the data pipeline that could support the needs of Woodgrove Bank.

At the end of this workshop, you will be better able to implement solutions that leverage the strengths of Cosmos DB in support of advanced analytics solutions that require high throughput ingest, low latency serving and global scale in combination with scalable machine learning, big data and real-time processing capabilities.

Azure services and related products

Azure Cosmos DB
Azure Synapse Analytics
Azure Data Lake Storage Gen2
Azure Event Hubs
Azure Kubernetes Service
Azure Machine Learning
Power BI

Azure solutions

Globally Distributed Data

Related references

Help & Support

We welcome feedback and comments from Microsoft SMEs & learning partners who deliver MCWs.

Having trouble?

First, verify you have followed all written lab instructions (including the Before the Hands-on lab document).
Next, submit an issue with a detailed description of the problem.
Do not submit pull requests. Our content authors will make all changes and submit pull requests for approval.

If you are planning to present a workshop, review and test the materials early! We recommend at least two weeks prior.

Please allow 5 - 10 business days for review and resolution of issues.

mcw-cosmos-db-real-time-advanced-analytics's People

Contributors

Stargazers

Watchers

Forkers

abhishekpathania01 retracement sqlmarty ghumannavneet kenmuse joelhulen griffinbird kylebunting yz-work vorlage azode afelthe markjbrown mabalija dipanshu21-git seeratarora srushti-714 kokohar alfeuduran julienmichel qs2ag fabianwilliams-zz kreidoss osama222 ebele25 samratbhatnagar robinaggarwal llenroc bhaskers-blu-org2 taffywrinkle claudiusgonzo kritcs18-zz codesizzler kpolakey opsgilitybrian abhi0751 saimachi global-localhost global19 global19-atlassian-net turretin nextdynamic kai-tr dipankar98228 chinmayapadhi element824 solliancenet cmftall faizateeq21 blgorman seeratarora23 msworkshop yang-jiayi bigdatasciencegroup aadimanchanda cloudlabs-mcw yoshihiro-matsumoto azuredevops619 girishshivakumar cjalkam msftsean timahenning raghumuthu captainquirky enesrine garyhopems mir7777 ap2006kr python-repository-hub lokeshwarvangala mo-cores-mo-problems ashishpachauri2307 alec13355 cloudtrainerwork aouyworking jomit kenhausman vijisoms jayant2022 everboom test-mass-forker-org-1 2mileslab tonee84 didacloud johnlahai2 holydarkz muniraj321 azureandsecurityotaku parveenkrraina

mcw-cosmos-db-real-time-advanced-analytics's Issues

Events Hubs Request Failed

There is a problem with the eventhub connection string
We had to add the EntityPath in it first

when running the application the following error appears

"Event Hubs request eventually failed with: Put token failed. status-code: 404, status-description: The messaging entity 'sb://-.servicebus.windows.net/woodgrove-events' could not be found. TrackingId:51ebabc3-1f2e-4644-890f-dbdb1e78b772_G1, SystemTracker:woodgrove-events.servicebus.windows.net:woodgrove-events, Timestamp:2019-05-03T11:11:50. "

Secret scanning alert

Received this notice. Please review and advise.
Close as options are Revoked, False positive, Used in tests, and Won't fix

Thank you!

New Workshop QC - PPT

@kylebunting
Please review the WDS trainer presentation PPT for missing/system generated alt-texts and update them.
Thanks,
Dawnmarie

Lab Feedback

Was running into this issue when running one of the notebook exercises "no module named 'sklearn.compose' " when using scikit-learn v0.20.1. Uninstalling and installing scikit-learn v0.20.2 resolved this.

Lab feedback

While waiting for your job to start, select Workspace from the left-hand menu, and navigate to the 3-Batch-Score-Transactions notebook under the Exercise 4 folder. - not under exercise 4 folder.

Feedback on Trainer guide - Globally Distributed Data section

Under "Customer Objections portion"

Current text: "We are concerned about how much it costs to use Cosmos DB for our solution. What is the real value in using this service?"
Change to: "We are concerned about how much it costs to use Cosmos DB for our solution. What is the real value of the service, and how do we set up Cosmos DB in an optimal way?

The "Infographic for common scenarios"

It makes more sense for the Cosmos DB in the Data Ingest section to have the "Multi-master," as multi-master enables writes to any region.

Under "Preferred Solution" section

Right now, the entire section is a large block of text. Can we break out the sections into smaller paragraphs?
Current sentence: With change feed enabled in Cosmos DB, the transactions can be read as a stream of incoming data within an Azure Databricks notebook, using the azure-cosmosdb-spark connector, and stored long-term within an Azure Databricks Delta table backed by Azure Data Lake Storage.
Change Feed is always available in any Cosmos DB collection, and user doesn't have to explicitly enable it. They can just read from it.
Change to: "Using the built-in change feed feature in Cosmos DB..."

Under "Globally distributed data"

Current sentence: It is a simple process to add or remove geographical regions associated with a Cosmos DB database at any time with a few clicks, or programmatically through a single API call. Because Cosmos DB automatically indexes the data stored within a Cosmos container (database) upon ingestion, users can query the data without having to deal with a schema or the complications of index management in a globally distributed setup.
Adding and removing regions happens at the account level (not the database level). Inside an account, there can be multiple databases, and each database can have multiple containers, aka collections. Containers and databases are not the same thing, as implied by the current wording.
Cosmos DB indexes data at a container (aka collection) level.
Change: Overall sentence is fine, change to reflect above clarifications.

Current wording: In Azure Cosmos DB, provisioned throughput is represented as request units/second (RUs). RUs measure the cost of both read and write operations against your Cosmos DB container. Because Cosmos DB is designed with transparent horizontal scaling (e.g., scale out) and multi-master replication, you can very quickly and easily increase or decrease the number of RUs to handle thousands to hundreds of millions of requests per second around the globe with a single API call.
When you set a number of RUs for a container, Cosmos DB ensures that those RUs are available in all regions associated with your Cosmos DB account. When you scale out the number of regions by adding a new one, Cosmos will automatically provision the same quantity of RUs in the newly added region. You cannot selectively assign different RUs to a specific region. These RUs are provisioned for a container (or database) for all associated regions.

Provisioned throughput is represented as request units / second (RU/s). Any reference of RUs should be RU/s.
Add a sentence that explains Cosmos DB is a guaranteed throughput model.
Add a sentence explaining you can provision throughput at a container level, or at a database level.

Current sentence: The Session consistency level is the default, and is suitable for most operations. One thing to consider in your design is, when set to lower consistency level, any arbitrary set of operations can be executed in an ACID-compliant (Atomicity, Consistency, Isolation, Durability) transaction by performing those operations from within a stored procedure.

We should explain what session consistency is, and why it's a good choice here.
The consistency of the database account (e.g. session, eventual, etc) is independent of whether you can do ACID transactions. They are completely unrelated, and we shouldn't confuse people. Regardless of the consistency level chosen, you can always do ACID transactions across multiple records in a stored procedure, AS LONG as it is scoped to a single logical partition key value. This is another reason why choosing a partition key becomes important, as transactions can only be done across multiple records as long as they all have the same logical partition key value.
Clarify above to reflect this. I would completely remove discussion of ACID transactions in this section. Just focus on choosing a consistency level. We can mention of ACID transactions in the section in no. 9.

Current sentence: Read-consistency is tied to the ordering and propagation of the write/update operations. If there are no writes being made to the data set, then the consistency level is not a factor.

I would clarify - it's that if no new writes are made to a data set at the same time the reads are happening, then all reads, regardless of consistency level will show the latest data. (See no.8)

Current sentence: In the case of Woodgrove Bank, Cosmos DB is being used for storing suspicious transactions that they identify by performing scheduled batch processing against all transactions. In this case, there are very few writes (which would happen in batch as the suspicious transactions are written out) compared to the number of reads (which might happen each time a customer reviews the flagged transactions). Therefore, a consistency level of Session will suffice for these documents, resulting in higher read throughput (approximately 2x faster) compared to strong and bounded staleness.

Need to make it more clear that because the writes are happening in scheduled batches, for most of the read-heavy workload, all reads will get the latest data anyway. So you can choose session consistency.

Current sentence: If Cosmos DB is also being used for data ingest for real-time payment transactions, you can guarantee ACID-compliant transactions by performing the transactions in a stored procedure, if they have multiple, related transactions they would like written at once. Another option is to monitor the Probabilistically Bounded Staleness (PBS) metric of Cosmos DB transactions, which shows how often each transaction actually achieves a stronger consistency level than what is set. This will provide them with the information they need to select the most appropriate consistency level for those transactions.

Again, ACID transactions are only over a single partition key value.
I would remove the PBS section. It doesn't help choose a consistency level. The way to choose a consistency level is to review the application's requirements, what each level offers, and choose the right one based on that.

TTL value not being saved to container

Something changed recently that is not saving TTL value to document

Issue in Exercise -1, Task -4

In exercise -1> task -4> step -37, I didn't find the ttl value in any of the items. I had set Time to Live as On (no default) option under Scale & Settings blade in data explorar. But I didn't get the expected output.

Exercise 3 Task 2: Error in notebook Prepare realtime scoring model

After executing the first cell in this notebook, a couple errors kept me from moving on with the lab.

ERROR: fairlearn 0.4.6 has requirement scikit-learn>=0.22.1, but you'll have scikit-learn 0.20.3 which is incompatible.
ERROR: onnxruntime 1.3.0 has requirement numpy>=1.16.6, but you'll have numpy 1.16.2 which is incompatible.
ERROR: fairlearn 0.4.6 has requirement numpy>=1.17.2, but you'll have numpy 1.16.2 which is incompatible.
ERROR: fairlearn 0.4.6 has requirement scikit-learn>=0.22.1, but you'll have scikit-learn 0.20.3 which is incompatible.

So I changed the cell to:

!pip install scikit-learn==0.22.1
!pip install numpy==1.17.2

and from then on, the notebook ran successfully. Don't know if this affects the resulting model, but was the only way to continue.

Next update suggestions

Suggestions from the 9/2021 (PR #58) update we were unable to include:

SME feedback:
We illustrate in Excersize6:Task2 how to query nested JSON using CROSSAPPLY, but we also should have an example how to query JSON ARRAY object as well as this is very common with Cosmos JSON doc structures where we effectively use embedding for 1:few relationships. You can potentially expand this use-case by converting "phone"/"email" attributes to an array of emails/phones with some other "type" property to differentiate (like home, mobile or default/other) and illustrate how to explode the array in the SQL Serverless query/SPARK as this is common ask.

The lab showcase Cosmos SQL API - but we also need to have a similar example/lab for Cosmos Mongo API for similar workload. I would love to see the same lab ported to also load into Cosmos Mongo API and have a version of Synapse queries/Views/Spark notebooks for that where they will be different.

From author:
For your first point, we should make note of that suggestion and plan on adding nested JSON queries in the next update.

As for your second point, that is a great suggestion, but it is outside of the scope of this update. I will have to defer to the MCW stakeholders as to whether they wish to invest in a Mongo version in a future update.

Sign off received - please finalize workshop

@ZoinerTejada @joelhulen
Hello,
Our tech lead has signed off, please review feedback and finalize this workshop. Let me know when it's ready for QC/publishing.

ARM Script

Please make sure we have an ARM script to deploy resources for this lab.

Sept 2021 Update Suggestions

MicrosoftCloudWorkshop.com description needs an update.

On MicrosoftCloudWorkshop.com, this MCW shows this:

Design a data pipeline solution leveraging Cosmos DB for scalable ingest and global distribution. Use Azure Databricks Delta with a modern data warehouse to reduce risk.

However, Azure Databricks is not anywhere in the HOL materials. It looks like it was edited out in September 2020.

Azure Databricks Delta was replaced with Azure Synapse Analytics. Perhaps the MicrosoftCloudWorkshop.com description could be:

Design a data pipeline solution leveraging the Azure Cosmos DB change feed in concert with the Azure Synapse Analytics to enable a modern data warehouse solution to reduce risk.

June 2020 Update Suggestions

Issue in Before Hands On Lab Task 13

Task 13 Create an Azure Data Lake Storage Gen2 account We should create Azure Data Lake Storage Gen2 , but as per the mentioned steps and screenshots storage account with kind StogareV2 is been creating.
Could you please look into this and update?

Thanks
Abhishek

Error in Synapse Notebook

In exercise-2/ Task-4/step-2 while running the code in synapse workspace, getting the below mentioned error.
NameError : name 'display' is not defined
Traceback (most recent call last):
NameError: name 'display' is not defined

Deploy error

Thank you for fixing #51 but unfortunately there is a new error.

{
"status": "Failed",
"error": {
"code": "InvalidTemplate",
"message": "行 '476'、列 '9' のリソース '/subscriptions/09322f29-9523-4776-9f1b-f70a10f0eaeb/resourceGroups/MCW-Cosmos-DB-Real-Time-Advanced-Analytics/providers/Microsoft.Resources/deployments/UpdateSparkPool01' に対するテンプレート言語の式を処理できません。'The template variable 'location' is not found. Please see https://aka.ms/arm-template/#variables for usage details.'",
"additionalInfo": [
{
"type": "TemplateViolation",
"info": {
"lineNumber": 476,
"linePosition": 9,
"path": ""
}
}
]
}
}

Lab feedback

Please remove subscription id and resource group names in notebooks and add a comment for user to replace their own. This is the case in several places.

Exercise.2: Task 10- Error in Notebook

HOL step-by step guide, Exercise.2: Task 10- Cosmos-DB-Change Feed is giving error on running command no.4.
It was working fine 2 days ago but after the updates where it got updated 9 hours ago today, it is giving error in the above mentioned notebook.

Before hands on needs to update.

We need to have a VM in the template with all the pre-req installed because in the lab we are using some application that needs to be installed to run through the lab.

VM should be with VS 2019.
.net 5 SDK should be installed.

In the before hands-on lab, it is not mentioned anywhere that we need to have VM with VS 2019 and .NET SDK, and that is causing confusion.

Thanks

Exercise 2 Task 10: Issue in running Exercise 2 Notebook named 2-Cosmos-DB-Change-Feed.

Hi @joelhulen I am running doing this lab and having issue with running the Notebooks Exercise 2-Cosmos-DB-Change-Feed

Mainly I am facing issue with CMD 29 Spark jobs are running and taking long time to execute ( around more than 40 minutes and still in running state).

Could you please help me with resolving this? ( any suggestion will be helpful) I can share the environment details with you, if needed so that you can quickly look into this.

Thanks,
Abhishek

Exercise 3 Task 2: Error in notebook Prepare real-time scoring model

I am getting this error on cell 2:
ImportError: cannot import name 'urlretrieve'

I tried commenting out # ws = Workspace.from_config() and follow the steps as recommended on step 10, but it did not help.

I ran the following to see the version:
import azureml.core
print(azureml.core.VERSION)

Output menu
1.36.0

I also tried following but it did not help:
from urllib.request import urlretrieve

Also, ran pip install --upgrade azureml-core

Please help.

Exercise 3: Task 2: Notebook 2-Prepare-Batch-Scoring-Model-cmd 98 failing

Exercise 3: Task 2: Notebook 2-Prepare-Batch-Scoring-Model, cmd 98 is failing as shown below:

Can you please have a look into this?

Thanks,
Seerat

HTML Lab guide is not updated

@DawnmarieDesJardins could you please re-run the scripts to get the HTML lab guide updated?

Thanks,
Gaurav

Deploy error

When deploying the ARM template, deployment failed with the error below even though if I change different region or different subscription.

https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2FMicrosoft%2FMCW-Cosmos-DB-Real-Time-Advanced-Analytics%2Fmaster%2FHands-on%20lab%2FDeployment%2Fenvironment-template.json

{
"status": "Failed",
"error": {
"code": "ValidationFailed",
"message": "Spark pool request validation failed.",
"details": [
{
"code": "SparkComputePropertiesNotAllowed",
"message": "Synapse Spark pool must be created before libraries are installed."
}
]
}
}

Task 2 - Step 13

This step states: "As an experiment, scale the number of requested RU/s for your Cosmos DB collection down to 750. After doing so, you should see increasingly slower transfer rates to Cosmos DB due to throttling. You will also see the pending queue growing at a higher rate. The reason for this is because when the number of writes (remember, writes use 5 RU/s vs. just 1 RU/s for reads) exceeds the allotted amount of RU/s, Cosmos DB sends a 429 response with a retry_after header value to tell the consumer that it is resource-constrained. The SDK automatically handles this by waiting for the specified amount of time, then retrying. After you are done experimenting, set the RU/s back to 15,000."

I like this exercise and this paragraph. However, writes do not always use 5 RU/s and reads do not always use 1 RU/s. This is typical write/read performance for a 1 KB document but larger documents will consume more RU's.

import azureml fails

Working through this lab, I got to exercise 3 notebook 1-Prepare-Scoring-Web-Service, and I ran into an error in the "Deploy Model" step. The code is as follows:
import azureml
from azureml.core import Workspace
from azureml.core.model import Model

I receive the following error.

ImportError: No module named 'azureml'

ImportError Traceback (most recent call last)
in ()
----> 1 import azureml
2 from azureml.core import Workspace
3 from azureml.core.model import Model

ImportError: No module named 'azureml'

My cluster configuration is as follows:
5.2 (includes Apache Spark 2.4.0, Scala 2.11)
Python 3
spark.databricks.delta.preview.enabled true

I also tried running on ADB 5.1 (includes Apache Spark 2.4.0, Scala 2.11)
Python 3
spark.databricks.delta.preview.enabled true
as specified in the setup module, but I got the same results.

Can you advise as to the appropriate cluster configuration or location of the azureml module? I tried installing the azureml library from CRAN, but it fails with the following error:
java.lang.RuntimeException: Installation failed with message:

Error installing R package: Could not install package with error: 1: package ‘azureml’ is not available (for R version 3.4.4)

I got this error with both of the above cluster configurations.

any help would be appreciated.

Exercise2 Task 1: App Registration options got updated in Azure Portal

In Exercise 2 Task 1 There is a task to create a "+New application registration" under App registrations, but due to updates in Azure Portal, we have to do those steps under App registration (Legacy):

only under App registration (Legacy) we have the right options.

Can you please check and fix this ASAP, we have workshop scheduled next week. Please fix this in lab guide.

Thanks,
Abhishek

Lab Feedback

Lab Feedback
Pre-lab

Task 12: Provision Event Hubstask
16.Copy the Connection string-primary key value. Save this value for the Sender policy in Notepad or similar for later.- Please add screen shot here.

Issue in Exercise 1, task 1

In Exercise 1, Task 1, Step 18 after adding code to TODO tasks as per the instructions, I am getting an error while debugging the solution and not able to view the PaymentGenerator console due to this error.

Please find the attcahed screenshot below for refrence.

As per the error message I have navigated to the folder C:\CosmosMCW\Hands-on lab\lab-files\TransactionGenerator\bin\Denug\net5.0\TransactionGenerator.exe and checked forTransactionGenerator.exe file in the extracted starter files but couldn't find it .

Feedback

Current wording: In Azure Cosmos DB, provisioned throughput is represented as request units/second (RUs). RUs measure the cost of both read and write operations against your Cosmos DB container. Because Cosmos DB is designed with transparent horizontal scaling (e.g., scale out) and multi-master replication, you can very quickly and easily increase or decrease the number of RUs to handle thousands to hundreds of millions of requests per second around the globe with a single API call.
When you set a number of RUs for a container, Cosmos DB ensures that those RUs are available in all regions associated with your Cosmos DB account. When you scale out the number of regions by adding a new one, Cosmos will automatically provision the same quantity of RUs in the newly added region. You cannot selectively assign different RUs to a specific region. These RUs are provisioned for a container (or database) for all associated regions.

Please mention the service allows customers to increment/decrement in small increments of 1000 RU/s at the database level and in even smaller increments of 100 RU/s at the container level.

The deployment of the Model to ACI Fails with the following error

Message: Service deployment polling reached non-successful terminal state, current service state: Transitioning:

Error:
{
"statusCode": 400,
"code": "EnvironmentBuildFailed",
"message": "Failed Building the Environment."
}
InnerException None
ErrorResponse
{
"error": {

Secure String in Program.cs Line 433

I had to add a method to converet the secure string before passing to the new DocumentClient method. May want to incorporate this fix.

Also I needed to add the following to get the connectionPolicy to work
private static ConnectionPolicy connectionPolicy;

using System.Security;
.
.
.

using (_cosmosDbClient = new DocumentClient(new Uri(arguments.CosmosDbEndpointUrl), ConvertToSecureString(arguments.CosmosDbAuthorizationKey), connectionPolicy))
{
.
.
.
.
private static SecureString ConvertToSecureString(string password)
{
if (password == null)
throw new ArgumentNullException("password");

        var securePassword = new SecureString();

        foreach (char c in password)
            securePassword.AppendChar(c);

        securePassword.MakeReadOnly();
        return securePassword;
    }

.
.
}

Issue in Exercise 3, task 2

In Exercise 3, task 2, step 8, while installing libraries using !pip install, pyldevis library installation is getting failed with multiple errors as shown below.

Below is the complete error message:

ERROR: pyldavis 3.3.1 requires sklearn, which is not installed.
ERROR: responsibleai 0.15.0 has requirement interpret-community>=0.22.0, but you'll have interpret-community 0.21.0 which is incompatible.
ERROR: raiwidgets 0.15.0 has requirement ipython==7.16.1, but you'll have ipython 7.16.2 which is incompatible.
ERROR: raiwidgets 0.15.0 has requirement jinja2==2.11.3, but you'll have jinja2 2.11.2 which is incompatible.
ERROR: pyldavis 3.3.1 has requirement numpy>=1.20.0, but you'll have numpy 1.18.5 which is incompatible.
ERROR: pyldavis 3.3.1 has requirement pandas>=1.2.0, but you'll have pandas 0.25.3 which is incompatible.
ERROR: pycaret 2.3.5 has requirement lightgbm>=2.3.1, but you'll have lightgbm 2.3.0 which is incompatible.
ERROR: pycaret 2.3.5 has requirement numpy==1.19.5, but you'll have numpy 1.18.5 which is incompatible.
ERROR: pycaret 2.3.5 has requirement scikit-learn==0.23.2, but you'll have scikit-learn 0.24.2 which is incompatible.
ERROR: azureml-train-automl-runtime 1.36.0 has requirement scikit-learn<0.23.0,>=0.19.0, but you'll have scikit-learn 0.24.2 which is incompatible.
ERROR: azureml-responsibleai 1.37.0 has requirement azureml-core~=1.37.0, but you'll have azureml-core 1.36.0.post2 which is incompatible.
ERROR: azureml-responsibleai 1.37.0 has requirement azureml-dataset-runtime~=1.37.0, but you'll have azureml-dataset-runtime 1.36.0 which is incompatible.
ERROR: azureml-responsibleai 1.37.0 has requirement azureml-interpret~=1.37.0, but you'll have azureml-interpret 1.36.0 which is incompatible.
ERROR: azureml-responsibleai 1.37.0 has requirement azureml-telemetry~=1.37.0, but you'll have azureml-telemetry 1.36.0 which is incompatible.
ERROR: azureml-explain-model 1.37.0 has requirement azureml-interpret~=1.37.0, but you'll have azureml-interpret 1.36.0 which is incompatible.
ERROR: azureml-datadrift 1.37.0 has requirement azureml-core~=1.37.0, but you'll have azureml-core 1.36.0.post2 which is incompatible.
ERROR: azureml-datadrift 1.37.0 has requirement azureml-dataset-runtime[fuse,pandas]=1.37.0, but you'll have azureml-dataset-runtime 1.36.0 which is incompatible.
ERROR: azureml-datadrift 1.37.0 has requirement azureml-telemetry=1.37.0, but you'll have azureml-telemetry 1.36.0 which is incompatible.
ERROR: azureml-automl-runtime 1.36.0 has requirement scikit-learn<0.23.0,>=0.19.0, but you'll have scikit-learn 0.24.2 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.37.0 has requirement azureml-automl-core~=1.37.0, but you'll have azureml-automl-core 1.36.1 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.37.0 has requirement azureml-automl-runtime~=1.37.0, but you'll have azureml-automl-runtime 1.36.0 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.37.0 has requirement azureml-core~=1.37.0, but you'll have azureml-core 1.36.0.post2 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.37.0 has requirement azureml-telemetry~=1.37.0, but you'll have azureml-telemetry 1.36.0 which is incompatible.
ERROR: autokeras 1.0.16.post1 has requirement tensorflow<2.6,>=2.3.0, but you'll have tensorflow 2.1.0 which is incompatible.

January 2020 - content update

This workshop is scheduled for a January 2020 content update. Please review the workshop and provide recommended changes for SME review.

Feedback on Readme

Under the Outline: Key Concerns for Customer situation section:

Minimizing access latency to globally distributed .data.

Typo - Remove the .

Providing a unified platform that can support their near term data pipeline needs and provides a long term to standard for their data science, data engineering and development needs.

"and provides a long term to standard" -> "and provides a long-term standard platform for..."

Leveraging Cosmos DB change feed with Event Hubs

Using Change Feed with event hubs is not part of the solution. The guidance is to choose Event hub or Cosmos DB, so this line should be removed.

Clarify current customer situation and how new work fits in

There was some confusion during the WDS this week as to where the POC they are designing fits in to the customer's current environment. Update to clarify the following, and add a simple diagram:

The customer is already successfully processing online payments on behalf of their merchants' customers through RESTful APIs they (Woodgrove) provide. The POC should not interrupt this process in any way.
The customer is asking for 2 additions to their current process:
- A RESTful API that can be called for immediate scoring on a transaction to see whether it should be blocked due to reasonably high-level confidence that it is fraudulent.
- A real-time data ingestion pipeline they can pass data to at the time they save the payment transaction data from within their API. This should sit side-by-side with their current process, not change it.
Perhaps indicate which database Woodgrove is using to store transaction data at this point. Could be a talking point for eventually replacing with Cosmos DB down the road, since they are interested in proofing out the ability to ingest and serve data at a global scale.
Clarify that requirement for real-time scoring of the payment transaction as fraudulent is not the same as the real-time ingest of all payment transaction data.
Clarify that the deeper analysis that is performed in batch may use a slightly different ML model, but the point is that since it is a more intensive run, it needs to be run in batch and score transactions a little less leniently than the real-time scoring model. Remember, we want to minimize false positives during real-time scoring, but do a deeper analysis of the transactions later on and possibly tag those that were not blocked as suspicious.

Failing on spark.sql("SELECT * FROM transactions")

I will be the first to admit i probably missed a step, but i went through the notebooks twice.

Exercise 4 - Task 3: Distributing batch scored data globally using Cosmos DB

Where is the delta table created?

--Load transactions from the Delta table into a Spark DataFrame
transactions = spark.sql("SELECT * FROM transactions")

Fails with

AnalysisException: 'Table or view not found: transactions; line 1 pos 14'

AnalysisException Traceback (most recent call last)
in ()
1 # Load transactions from the Delta table into a Spark DataFrame
----> 2 transactions = spark.sql("SELECT * FROM transactions")
3
4 # Get a Pandas DataFrame from the Spark DataFrame
5 pandas_df = transactions.toPandas()

Error while running SQL Scripts in synapse workspace.

In Exercise-6/ task-2/step-8 it's throwing below mentioned error while running the SQL script in synapse work space.

Changed database context to 'Woodgrove'. Invalid object name 'Accounts'.

Ex2, Task5, Step2 throws NameError: name 'data_path' is not defined

Exercise 2, Task 5, Step 2

This line of code

transactions = data_path

Throws "NameError: name 'data_path' is not defined"

Based on the screenshots it looks like this is a change in the auto-generated code. It now uses "df" instead of "data_path".

So technically the screenshots are out of date and the steps to generate the notebook is a little different as well (can probably wait for next test/fix).

I am testing this on an AERS subscription so might be different for normal subs.

Pre-lab feedback

Can we have a ARM script to deploy these resources, instead of manually having to create them on the portal?

microsoft / mcw-cosmos-db-real-time-advanced-analytics Goto Github PK

mcw-cosmos-db-real-time-advanced-analytics's Introduction

Cosmos DB real-time advanced analytics

This workshop is archived and no longer being maintained. Content is read-only.

Target audience

Abstracts

Workshop

Whiteboard design session

Hands-on lab

Azure services and related products

Azure solutions

Related references

Help & Support

Please allow 5 - 10 business days for review and resolution of issues.

mcw-cosmos-db-real-time-advanced-analytics's People

Contributors

Stargazers

Watchers

Forkers

mcw-cosmos-db-real-time-advanced-analytics's Issues

ImportError: No module named 'azureml'

AnalysisException: 'Table or view not found: transactions; line 1 pos 14'

Recommend Projects

Recommend Topics

Recommend Org

Jobs