GithubHelp home page GithubHelp logo

azure_pig's Introduction

azure_pig

Environment Setup

  1. Create a resource group called pigtest

  2. Create a trial Blob storage account

  3. Go to link: https://azure.microsoft.com/en-us/offers/ms-azr-0044p/ to register/create a free trial account

  4. After registration, click the Storage accounts to create a new storage account. Give it a name, e.g. pigteststorage. Then choose locally-redundant storage (LRS) as the Type for simplicity reason, other settings are kept as default.

  5. Choose existing in the Resource group field, then select the created pigtest from the dropdown list

  6. Upload scripts and dataset to the (Blob) storage

  7. Go to the created storage account, Create a Container: Click Blobs from the Services, then click + icon to create a container, e.g. with name sources, Private (default) as Access type

  8. Upload the script and the csv dataset to the created container

  9. Create HDInsight Cluster

  10. In the Azure portal, search with keyword "HDInsight", then HDInsight Cluster will be listed. Select it

  11. In the New HDInsight Cluster tab, fill in each item 1. In the Cluster configuration configuration tab, choose Spark as Cluster Type, Linux as Operating System, then click select. (NB! The HBase system doesnot contain Anaconda, ref:https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview) 2. In the Credentials configuration tab, configure the required user name and password 3. In the Data Source configuration tab, configure the Data Source. NB! Choose From all subscriptions (default) in the Selection Method dropbox, then choose the storage created in the 2nd step and give the container's name - sources - created the 3rd step in the field Choose Default Container 4. In the Pricing configuration tab, choose the amount of nodes you intend to use 5. In the Resource Group tab, choose Use existing and then select also pigtest as the resource group

  12. Now that HDInsight Cluster is created, go to configure the Anaconda so that Python can have access to it

  13. Run the config_anaconda.sh through Script Actions

DEMO Introduction

  • Script Action: config_anaconda.sh
  • Scripts to run in the demo:
    • src/aggregate_by_single_grouped.pig
    • src/aggregate_by_single_grouped.py

Automation Solution

=> The whole automation process should be done with Cluster creation and deletion

Azure CLI command to create an HDInsight cluster and run a script action:

  1. Login to Azure CLI (ref: https://docs.microsoft.com/en-us/azure/xplat-cli-connect)
  2. azure config mode arm
  3. azure group create azureclitest NorthEurope
  4. Create HDInsight cluster: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-create-linux-clusters-azure-cli
  5. Create a storage account: azure storage account create -g azureclitest --sku-name RAGRS -l NorthEurope --kind Storage azurecliteststorage
  6. Retrieve the key used to access the storage key1=$(azure storage account keys list -g azureclitest azurecliteststorage |grep key1 |awk '{print $3})
  7. Create HDInsight cluster: azure hdinsight cluster create -g azureclitest -l NorthEurope -y Linux --clusterType Spark --defaultStorageAccountName azurecliteststorage.blob.core.windows.net --defaultStorageAccountKey ${key1} --defaultStorageContainer azureclitestcluster --workerNodeCount 2 --userName admin --password 1%XabcdeX%1 --sshUserName rxue --sshPassword 1%XabcdeX%1 azureclitestcluster
  8. Apply a script action to a running cluster (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-customize-cluster-linux): azure hdinsight script-action create configcluster -g azureclitest -n config_anaconda -u <scriptURI> -t headnode;workernode --persistOnSuccess

FAQ

  • What is data/simple_dataset.csv used?
    • Used for testing in Pig in local mode as a simple test case

azure_pig's People

Watchers

James Cloos avatar rxue avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.