azure / azuredatabricksbestpractices Goto Github PK

Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs

License: Creative Commons Attribution 4.0 International

azure azuredatabricks scalability performance performance-monitoring security deployment spark python grafana

azuredatabricksbestpractices's Introduction

Azure Databricks Best Practices

Authors: Dhruv Kumar, Senior Solutions Architect, Databricks Premal Shah, Azure Databricks PM, Microsoft Bhanu Prakash, Azure Databricks PM, Microsoft

Written by: Priya Aswani, WW Data Engineering & AI Technical Lead

Published: June 22, 2019

Version: 1.0

Click here for the Best Practices

Disclaimers:

This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.

Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred.

This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

azuredatabricksbestpractices's People

Contributors

Stargazers

Watchers

Forkers

damutch dem108 rohannaik anandajothim mabalija azeltov kreidoss asherif844 bhardwaj01 bensadeghi gstaubli digitalarche mozamani nextdynamic cjalkam wangmiao1981 kp-db prarbdh shaunwadsworth-alinta olufemig sachindraonline paulopsgility herbzacz anishekkamal furmangg blueelvis teresafds khajaasmath786 joserraffrey osama222 hurtn ragpan29 baranyiszabolcs anujjamwal paswani jixjia sndychvn kthejoker dhruvkumar raosanrao2002 svivekan skuyra romeritomorais antgross vijayreddy1708 mallik-g lahaddad nicole-hong xigyenge travisclagrone inny89 cegladanych ashisaraswat bhpraka shahiddr ganeshchand vasavisuryasetty dnlbunting ievsantillan kovarikthomas isabella232 venkitakrishnanmani balajiloganathan singhshanu atryputsen chitrankv27 juanpaulo addream fmunozse jiamaozheng dazfuller nisitc thummasuk somesh-ghaturle rnalus joyoyoyoyoyo onlykumarabhishek danielpesa7 rchangdar ajay-dev786 atul-ram bhaveshmotwani amara-kernel guptam jmendezde iammayanksrivastava norazhaoo lkrysik isantillan1 kprasun2021 sqlrescuedlokesh sabinsh larauj akshay777asi a0x8o umathur91 shankat2020 aadimanchanda sundara-raman sabyadg

azuredatabricksbestpractices's Issues

Guidance for scheduled jobs needs updating

What has been written to date is mostly still correct but the guidance could be expanded to include log and metric isolation as a key reason to favour jobs compute clusters over all-purpose compute clusters for scheduled jobs. Additionally, the instance pools feature is referenced as something that 'will' reduce startup time for short, frequent jobs - this has since been released into GA and should be reflected as such 🙂

IP address requirements clarification

The IP address requirements section seems confusing:

Each cluster node requires

1 Public IP and 2 Private IPs
For a desired cluster size of X: number of Public IPs = X, number of Private IPs = 4X; how is this 4X; per above, isn't it 2X?

Also when I provision a cluster with 2 nodes, it appears to use 4 IP addresses only - 2 public and 2 private. Per above, it should've used 2 public IPs and 4 private IPs or 8 private IPs, which doesn't seem to be the case.

Please clarify. Thanks.

Migration of OMS Agent to Azure Monitor Agent for ADB Clusters

As many of you know from the announcement

The Log Analytics agent is on a deprecation path and won't be supported after August 31, 2024. If you use the Log Analytics agent to ingest data to Azure Monitor, make sure to  migrate to the new Azure Monitor agent  prior to that date.

Could you update the documentation on how to configure the AMA to the Databricks VM

Move this repository under Microsoft Docs

Thank you for publishing this great set of Azure Databricks best practices. For the sake of this content's discoverability and consistency with other documentation, wouldn't it be better to move this repository under the official Azure Databricks docs?

Vnet IP Configuration - is that right?

The vnet recommendation doesn't quite make sense to me.
It seems to be telling me that to get the max number host I should use a /16 vnet address. So for example:

10.0.0.0/16 - this will give the 255*255 - 2 assignable host / = aprox 16000 cluster hosts.
However I don't understand the table. Next it seems to be telling me that I should subnet the /16 into /18 for the public and private sub net. If I do that then the table doesn't make sense.

The public hosts will be slashed - so how can I achieve 16000 cluster host (64000 host IP's) on /18 subnet?
So based on the info the document following that it indicates that I need to create 2 subnets in the vnet range with a 2 borrowed subnet bits. It says the private and the public borrowed bits have to be the same:

private: 10.0.0.0/18 -> That's not 64000 host ip's? This doesn't make any sense to me

Can you give an example for public and private subnet values that achieve the maximum available clusters?

The best you can achieve is aprox 4000 cluster host (/4) but that wastes an awful lot IP's in the private range.

Change init script to text

I revisited this issue a few times as I went through many steps to try and get log analytics working.

To cut a long story short you can find the correct init script by going to log analytics -> advanced settings -> connected sources -> linux servers, see at bottom there will be a link to DOWNLOAD AND ONBOARD AGENT FOR LINUX, use that link to create an init script that looks like this:

script = """
sed -i "s/^exit 101$/exit 0/" /usr/sbin/policy-rc.d 
wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh && sh onboard_agent.sh -w <YOUR-ID> -s <YOUR-KEY> -d opinsights.azure.com
"""

#save script to databricks file system so it can be loaded by VMs
dbutils.fs.put("/databricks/log_init_scripts/configure-omsagent.sh", script, True)

The other parts of the init script shown in the document are not required i.e. you do not need to do a restart as in:

sudo su omsagent -c 'python /opt/microsoft/omsconfig/Scripts/PerformRequiredConfigurationChecks.py' 
/opt/microsoft/omsagent/bin/service_control restart <YOUR-ID>

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.spark.sql.SQLContext

Hello I'm having problems reading in streams using evenhubs inside azure databricks, the read works using an interactive cluster but when i try to use the same code inside of a high currency cluster with the role access control enabled (only accepts sql and python code),i got the following error:

py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.spark.sql.SQLContext

Featured used :
Runtime 6.4(scala 2.11, Spark 2.4.5)

Spark configuration:
spark.databricks.cluster.profile serverless
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.maxDiskUsage 80g
spark.databricks.acl.dfAclsEnabled true
spark.databricks.delta.preview.enabled true
spark.databricks.io.cache.compression.enabled false
spark.databricks.repl.allowedLanguages python,sql

Library installed :
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6

Snippet is an image

The snippet for creating cluster monitoring is an image, which makes it cumbersome to copy

https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Python%20Snippet.PNG