GithubHelp home page GithubHelp logo

cloudera-cca175's Introduction

Data Ingest

The skills to transfer data between external systems and your cluster. This includes the following: Basic useful feature list:

Import data from a MySQL database into HDFS using Sqoop

sqoop-import

Import table product from mysql clouderacert database

sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera -m 2 --table product

Import all tables from mysql clouderacert database

sqoop-import-all-tables --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera -m 2  

Import subset tables from mysql clouderacert database

sqoop-import-all-tables --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera -m 2 --exclude-tables product_trans

Import subset of table content from mysql clouderacert database

sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera --table product --where "trype='hospital'"

Import subset of table content from mysql clouderacert database to a directory

sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera --query 'select * from product_trans where product_id=1 AND $CONDITIONS' --target-dir hdfs://quickstart.cloudera:8020/user/cloudera/product_trans -m 2 --split-by product_trans.trans_id

Incremental import

  • By column value
sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera --incremental append --table product_trans --check-column trans_id --last-value 4
  • By last modified date
sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera --incremental lastmodified --table product_trans --check-column <date_column> --last-value <timestamp>
Import job

sqoop-export

Export data to a MySQL product table in clouderacert database from HDFS using Sqoop

sqoop-export --connect jdbc:mysql://localhost/clouderacert --username root -P --table productc --export-dir ./product -m 10

Change the delimiter and file format of data during import using Sqoop

sqoop-import --connect jdbc:mysql://localhost/clouderacert --username root -P --table product --fields-terminated-by '|' --lines-terminated-by '\t' --as-textfile -m 1

Ingest real-time and near-real time (NRT) streaming data into HDFS using Flume

Flume Components

Source -> Channel->Sink

To fetch data from Sequence generator using a sequence generator source, a memory channel, and an HDFS sink.

  • Configuration in /usr/lib/flume-ng/conf/seq_gen.conf

# Naming the components on the current agent SeqGenAgent.sources = SeqSource
SeqGenAgent.channels = MemChannel SeqGenAgent.sinks = HDFS
# Describing/Configuring the source SeqGenAgent.sources.SeqSource.type = syslogtcp SeqGenAgent.sources.SeqSource.port = 44444

# Describing/Configuring the sink SeqGenAgent.sinks.HDFS.type = hdfs SeqGenAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/training/seq_data SeqGenAgent.sinks.HDFS.hdfs.filePrefix = syslog SeqGenAgent.sinks.HDFS.hdfs.rollInterval = 0 SeqGenAgent.sinks.HDFS.hdfs.rollCount = 10000 SeqGenAgent.sinks.HDFS.hdfs.fileType = DataStream
# Describing/Configuring the channel SeqGenAgent.channels.MemChannel.type = memory SeqGenAgent.channels.MemChannel.capacity = 1000 SeqGenAgent.channels.MemChannel.transactionCapacity = 100
# Binding the source and sink to the channel SeqGenAgent.sources.SeqSource.channels = MemChannel SeqGenAgent.sinks.HDFS.channel = MemChannel

  • Script
sudo flume-ng agent --conf /usr/lib/flume-ng/conf/ -f /usr/lib/flume-ng/conf/seq_gen.conf  -n SeqGenAgent

Load data into and out of HDFS using the Hadoop File System (FS) commands

Transfer the file mark.csv in current directory to HDFS directory /user/training

hdfs dfs -put ./mark.csv /user/training

Transform, Stage, Store

Convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS. This includes writing Spark applications in both Scala and Python (see note above on exam question format for more information on using either Scala or Python):

The example solutions for part in an executable format in src/main/scala/com/github/relmad/cloudera/cca175/SparkPractice.scala. The example is fully portable to a spark-shell execution. If executed from the shell I recommend using hdfs dfs -rm -r <dirName> instead of the deleteDir(dirName) utility.

Data Analysis

Use Data Definition Language (DDL) to create tables in the Hive metastore for use by Hive and Impala. ##Read and/or create a table in the Hive metastore in a given schema ###Create a table product with fields product_id (int),product_name(string), product_cost(double),date_purchased(date).

create table product(product_id int,product_name string,product_cost double,date_purchased timestamp);

###Verify table creation

show tables;
describe product;
describe formatted product;

##Extract an Avro schema from a set of datafiles using avro-tools Extract from an example avro file product_data.avro

java -jar ./lib/avro-tools-1.8.1.jar getschema ./tmpdata/product_data.avro

##Create a table in the Hive metastore using the Avro file format and an external schema file Create table product with avro schema file product.avsc

>hive CREATE TABLE product  ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  
        >WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/user/training/product.avsc')  
        >STORED as INPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

##Improve query performance by creating partitioned tables in the Hive metastore Create table product_part

create table product_part(id int, name string)  
COMMENT 'This is the prodtc part table' 
PARTITIONED BY(year int) 
STORED AS TEXTFILE;

Insert into product_part table from tmp_product_part table

insert overwrite table product_part partition (year=2016) select id,name,year from tmp_product_part where year=2016; 

##Evolve an Avro schema by changing JSON files update avro schema of product by creating another schema file adding new field an alter table set tblproperties for avro url.

alter table product set tblproperties ('avro.schema.url'='hdfs://localhost/user/training/product_mod.avsc')

cloudera-cca175's People

Contributors

mbalassi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.