The skills to transfer data between external systems and your cluster. This includes the following: Basic useful feature list:
sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera -m 2 --table product
sqoop-import-all-tables --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera -m 2
sqoop-import-all-tables --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera -m 2 --exclude-tables product_trans
sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera --table product --where "trype='hospital'"
sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera --query 'select * from product_trans where product_id=1 AND $CONDITIONS' --target-dir hdfs://quickstart.cloudera:8020/user/cloudera/product_trans -m 2 --split-by product_trans.trans_id
- By column value
sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera --incremental append --table product_trans --check-column trans_id --last-value 4
- By last modified date
sqoop-import --connect jdbc:mysql://localhost:3306/clouderacert --username root --password cloudera --incremental lastmodified --table product_trans --check-column <date_column> --last-value <timestamp>
Import job
sqoop-export --connect jdbc:mysql://localhost/clouderacert --username root -P --table productc --export-dir ./product -m 10
sqoop-import --connect jdbc:mysql://localhost/clouderacert --username root -P --table product --fields-terminated-by '|' --lines-terminated-by '\t' --as-textfile -m 1
Source -> Channel->Sink
To fetch data from Sequence generator using a sequence generator source, a memory channel, and an HDFS sink.
- Configuration in /usr/lib/flume-ng/conf/seq_gen.conf
# Naming the components on the current agent
SeqGenAgent.sources = SeqSource
SeqGenAgent.channels = MemChannel
SeqGenAgent.sinks = HDFS
# Describing/Configuring the source
SeqGenAgent.sources.SeqSource.type = syslogtcp
SeqGenAgent.sources.SeqSource.port = 44444
# Describing/Configuring the sink
SeqGenAgent.sinks.HDFS.type = hdfs
SeqGenAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/training/seq_data
SeqGenAgent.sinks.HDFS.hdfs.filePrefix = syslog
SeqGenAgent.sinks.HDFS.hdfs.rollInterval = 0
SeqGenAgent.sinks.HDFS.hdfs.rollCount = 10000
SeqGenAgent.sinks.HDFS.hdfs.fileType = DataStream
# Describing/Configuring the channel
SeqGenAgent.channels.MemChannel.type = memory
SeqGenAgent.channels.MemChannel.capacity = 1000
SeqGenAgent.channels.MemChannel.transactionCapacity = 100
# Binding the source and sink to the channel
SeqGenAgent.sources.SeqSource.channels = MemChannel
SeqGenAgent.sinks.HDFS.channel = MemChannel
- Script
sudo flume-ng agent --conf /usr/lib/flume-ng/conf/ -f /usr/lib/flume-ng/conf/seq_gen.conf -n SeqGenAgent
hdfs dfs -put ./mark.csv /user/training
Convert a set of data values in a given format stored in HDFS into new data values and/or a new data format and write them into HDFS. This includes writing Spark applications in both Scala and Python (see note above on exam question format for more information on using either Scala or Python):
The example solutions for part in an executable format in src/main/scala/com/github/relmad/cloudera/cca175/SparkPractice.scala
.
The example is fully portable to a spark-shell
execution. If executed from the shell I recommend using hdfs dfs -rm -r <dirName>
instead of the deleteDir(dirName)
utility.
Use Data Definition Language (DDL) to create tables in the Hive metastore for use by Hive and Impala. ##Read and/or create a table in the Hive metastore in a given schema ###Create a table product with fields product_id (int),product_name(string), product_cost(double),date_purchased(date).
create table product(product_id int,product_name string,product_cost double,date_purchased timestamp);
###Verify table creation
show tables;
describe product;
describe formatted product;
##Extract an Avro schema from a set of datafiles using avro-tools Extract from an example avro file product_data.avro
java -jar ./lib/avro-tools-1.8.1.jar getschema ./tmpdata/product_data.avro
##Create a table in the Hive metastore using the Avro file format and an external schema file Create table product with avro schema file product.avsc
>hive CREATE TABLE product ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/user/training/product.avsc')
>STORED as INPUTFORMAT'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
##Improve query performance by creating partitioned tables in the Hive metastore Create table product_part
create table product_part(id int, name string)
COMMENT 'This is the prodtc part table'
PARTITIONED BY(year int)
STORED AS TEXTFILE;
Insert into product_part table from tmp_product_part table
insert overwrite table product_part partition (year=2016) select id,name,year from tmp_product_part where year=2016;
##Evolve an Avro schema by changing JSON files update avro schema of product by creating another schema file adding new field an alter table set tblproperties for avro url.
alter table product set tblproperties ('avro.schema.url'='hdfs://localhost/user/training/product_mod.avsc')