GithubHelp home page GithubHelp logo

azure / azuredatabricksbestpractices Goto Github PK

View Code? Open in Web Editor NEW
433.0 41.0 164.0 4.02 MB

Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs

License: Creative Commons Attribution 4.0 International

azure azuredatabricks scalability performance performance-monitoring security deployment spark python grafana

azuredatabricksbestpractices's Introduction

Azure Databricks Best Practices

Azure Databricks

Authors: Dhruv Kumar, Senior Solutions Architect, Databricks Premal Shah, Azure Databricks PM, Microsoft Bhanu Prakash, Azure Databricks PM, Microsoft

Written by: Priya Aswani, WW Data Engineering & AI Technical Lead

Published: June 22, 2019

Version: 1.0

Click here for the Best Practices

Disclaimers:

This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.

Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred.

This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.

© 2019 Microsoft. All rights reserved.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

azuredatabricksbestpractices's People

Contributors

bensadeghi avatar bhpraka avatar damutch avatar dhruvkumar avatar furmangg avatar gstaubli avatar hurtn avatar kthejoker avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar mspreshah avatar paswani avatar swan-am-i avatar teresafds avatar xigyenge avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azuredatabricksbestpractices's Issues

Guidance for scheduled jobs needs updating

What has been written to date is mostly still correct but the guidance could be expanded to include log and metric isolation as a key reason to favour jobs compute clusters over all-purpose compute clusters for scheduled jobs. Additionally, the instance pools feature is referenced as something that 'will' reduce startup time for short, frequent jobs - this has since been released into GA and should be reflected as such 🙂

IP address requirements clarification

The IP address requirements section seems confusing:

Each cluster node requires

  • 1 Public IP and 2 Private IPs
  • For a desired cluster size of X: number of Public IPs = X, number of Private IPs = 4X; how is this 4X; per above, isn't it 2X?

Also when I provision a cluster with 2 nodes, it appears to use 4 IP addresses only - 2 public and 2 private. Per above, it should've used 2 public IPs and 4 private IPs or 8 private IPs, which doesn't seem to be the case.
image

image

Please clarify. Thanks.

Vnet IP Configuration - is that right?

The vnet recommendation doesn't quite make sense to me.
It seems to be telling me that to get the max number host I should use a /16 vnet address. So for example:

10.0.0.0/16 - this will give the 255*255 - 2 assignable host / = aprox 16000 cluster hosts.
However I don't understand the table. Next it seems to be telling me that I should subnet the /16 into /18 for the public and private sub net. If I do that then the table doesn't make sense.

The public hosts will be slashed - so how can I achieve 16000 cluster host (64000 host IP's) on /18 subnet?
So based on the info the document following that it indicates that I need to create 2 subnets in the vnet range with a 2 borrowed subnet bits. It says the private and the public borrowed bits have to be the same:

private: 10.0.0.0/18 -> That's not 64000 host ip's? This doesn't make any sense to me

Can you give an example for public and private subnet values that achieve the maximum available clusters?

The best you can achieve is aprox 4000 cluster host (/4) but that wastes an awful lot IP's in the private range.

Change init script to text

I revisited this issue a few times as I went through many steps to try and get log analytics working.

To cut a long story short you can find the correct init script by going to log analytics -> advanced settings -> connected sources -> linux servers, see at bottom there will be a link to DOWNLOAD AND ONBOARD AGENT FOR LINUX, use that link to create an init script that looks like this:

script = """
sed -i "s/^exit 101$/exit 0/" /usr/sbin/policy-rc.d 
wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh && sh onboard_agent.sh -w <YOUR-ID> -s <YOUR-KEY> -d opinsights.azure.com
"""

#save script to databricks file system so it can be loaded by VMs
dbutils.fs.put("/databricks/log_init_scripts/configure-omsagent.sh", script, True)

The other parts of the init script shown in the document are not required i.e. you do not need to do a restart as in:

sudo su omsagent -c 'python /opt/microsoft/omsconfig/Scripts/PerformRequiredConfigurationChecks.py' 
/opt/microsoft/omsagent/bin/service_control restart <YOUR-ID>

Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.spark.sql.SQLContext

Hello I'm having problems reading in streams using evenhubs inside azure databricks, the read works using an interactive cluster but when i try to use the same code inside of a high currency cluster with the role access control enabled (only accepts sql and python code),i got the following error:

py4j.security.Py4JSecurityException: Method public org.apache.spark.sql.streaming.DataStreamReader org.apache.spark.sql.SQLContext.readStream() is not whitelisted on class class org.apache.spark.sql.SQLContext

Featured used :
Runtime 6.4(scala 2.11, Spark 2.4.5)

Spark configuration:
spark.databricks.cluster.profile serverless
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.maxDiskUsage 80g
spark.databricks.acl.dfAclsEnabled true
spark.databricks.delta.preview.enabled true
spark.databricks.io.cache.compression.enabled false
spark.databricks.repl.allowedLanguages python,sql

Library installed :
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.6
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.