Merlin is a Python framework to simplify Hadoop jobs workflow development and management. The framework is integrated with the Big Data technology stack and supports Hadoop jobs for Apache MapReduce, Streaming map-reduce, Apache Pig, Apache Hive, Apache Sqoop, Spark and offers commonly used functionality for development of Extract-Transform-Load (ETL) processes.
- Configure and launch applications that use Java MapReduce, Streaming, Hive, Pig, Spark, Flume, Kafka.
- Configure and run Sqoop jobs to transfer bulk data between Apache Hadoop and structured datastores
- Script HDFS/Local File System/ FTP operations
Requires Python 2.7 or later If you have Python 2.6 or lower you can download Python 2.7 and run all commands using 'python2.7' instead of 'python' Also you can install Merlin in virtualenv to avoid misconceptions with python's version
virtualenv ve -p python2.7
source ve/bin/activate
You can install Merlin by downloading the source and using the setup.py script as follows:
git clone https://github.com/epam/Merlin/
cd hadoop-framework
python setup.py install
This setup call installs all of the necessary python dependencies.
Merlin comes with a comprehensive suite for unit and integration tests.
To run through the whole suite for tests, run
python setup.py nosetests
Separate test cases can be run with the next command :
python setup.py nosetests --tests <package>
For integration testing, the Merlin expects a Hadoop installation available on a localhost.
python setup.py build_sphinx
Here's some short descriptions and code examples to give you a sense of what you're getting into.
Most Hadoop tools provide some CLI utilities which can be used to configure and launch specific Hadoop job from command line.
Merlin is nothing more then wrapper for these CLI utilities.
Besides it also includes some nice features though:
- Convenient and simple API for configuring and launching Hadoop jobs from the code.
- Launch pre-configured jobs: Job configurations can be loaded from metastore. Metastore can be config file that contains settings for more than one job. Job name is used as name of the config section containing job-specific settings. Also metasatore can be properties from Hive's table.
- Monitor job progress.
- Support for HDFS / FTP / Local File System operations including creating, copying, and deleting files and directories, examining filesystem information about files and directories, etc.
- Logging : HF relies on standard python logging module to write down log messages. HF provides default logging configuration. In order to override default configurations pass
--log-conf=<path_to_logging_config_file>
on the command line
A more extensive examples can be found in unit / integration tests, tool module documentation, and examples sub-module.
MapReduce.prepare_mapreduce_job(
jar="hadoop-mapreduce-examples.jar",
main_class="wordcount",
name=_job_name
).with_config_option("split.by", "'\\t'") \
.with_number_of_reducers(3) \
.with_arguments("/user/vagrant/dmode.txt", "/tmp/test") \
.run()
Will be transformed to next MapReduce CLI command :
hadoop jar hadoop-mapreduce-examples.jar wordcount -D mapreduce.job.name=%s -D split.by='\\t' -D mapreduce.job.reduces=3 /user/vagrant/dmode.txt /tmp/test
MapReduce.prepare_mapreduce_job(
jar='mr.jar',
main_class="test.mr.Driver",
config=Configuration.load(metastore=IniFileMetaStore(file=path_to_config_file)),
name='simple_mr_job'
).run()
Job Configurations
[simple_mr_job]
job.config = value.delimiter.char=,
partition.to.process=20142010
args=/input/dir
/output/dir
Will be transformed to next MapReduce CLI command :
hadoop jar mr.jar test.mr.Driver -D mapreduce.job.name=simple_mr_job -D value.delimiter.char=, \
-D partition.to.process=20142010 /input/dir /output/dir
MapReduce.prepare_streaming_job(
name='job_name'
).take(
'data'
).process_with(
mapper='mapper.py',
reducer='reducer.py'
).save(
'output.txt'
).with_config_option(
key='value.delimiter.char',
value=','
).with_config_option(
key='partition.to.process',
value='20142010'
).run()
Will be transformed to next MapReduce CLI command :
hadoop jar hadoop-streaming.jar -D mapreduce.job.name=job_name \
-D value.delimiter.char=, -D partition.to.process=20142010 \
-mapper mapper.py -reducer reducer.py -input data -output output.txt
MapReduce.prepare_streaming_job(
config=Configuration.load(metastore=IniFileMetaStore(file=path_to_config_file)),
name='streaming_test_job_with_multiple_inputs'
).run()
Job Configurations
[streaming_test_job_with_multiple_inputs]
mapper= smapper.py
reducer=sreducer.py
input=/raw/20102014
/raw/21102014
/raw/22102014
output=/core/20102014
mapreduce.job.reduces=100
inputformat='org.mr.CustomInputFormat'
outputformat='org.mr.CustomOutputFormat'
libjars=mr_001.jar
mr_002.jar
files=dim1.txt
environment.vars=JAVA_HOME=/java
tmp.dir=/tmp/streaming_test_job_with_multiple_inputs
Will be transformed to next MapReduce CLI command :
hadoop jar hadoop-streaming.jar -D mapreduce.job.name=streaming_test_job_with_multiple_inputs -files dim1.txt \
-libjars mr_001.jar,mr_002.jar -mapper smapper.py -reducer sreducer.py -numReduceTasks 100 -input /raw/20102014 \
-input /raw/21102014 -input /raw/22102014 -output /core/20102014 -inputformat org.mr.CustomInputFormat \
-outputformat org.mr.CustomOutputFormat -cmdenv JAVA_HOME=/java -cmdenv tmp.dir=/tmp/streaming_test_job_with_multiple_inputs
Pig.load_commands_from_file("/tmp/file").with_parameter("partition", "'010120014'").run()
Will be transformed to next Pig CLI command :
pig -param partition='010120014' -f "/tmp/file"
Pig.load_commands_from_string("A = LOAD 'data.csv' USING PigStorage()
AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;").run()
Will be transformed to next Pig CLI command :
pig -e "A = LOAD 'data.csv' USING PigStorage()
AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;"
Pig.load_preconfigured_job(
job_name='pig test',
config=Configuration.load(metastore=IniFileMetaStore(file=path_to_config_file))
).without_split_filter().run()
In order to run Pig job described bellow configuration file (job.ini) should contain 'pig_test' section
[pig test]
brief=enabled
SplitFilter=disable
ColumnMapKeyPrune=disable
file=data_processing.pig
Next Pig CLI command will be generated:
pig -brief -optimizer_off SplitFilter -optimizer_off ColumnMapKeyPrune -f data_processing.pig
Hive.load_queries_from_file("/tmp/file").run()
Will be transformed to next Hive CLI command :
hive -f /tmp/file
Hive.load_queries_from_string('CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);').run()
Will be transformed to next Hive CLI command :
hive -e "CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);"
hive = Hive.load_preconfigured_job(
name='hive test',
config=Configuration.load(metastore=IniFileMetaStore(file=path_to_config_file))
).with_hive_conf("hello", "world").run()
Configuration file (in our case hive.ini) should contain 'hive_test' section with job specific parameters. For example:
[hive test]
hive.file=test.hql
define=A=B
C=D
database=hive
hivevar=A=B
C=D
Next Hive CLI command should be generated:
hive -f test.hql --define A=B --define C=D --hiveconf hello=world --hivevar A=B --hivevar C=D --database hive
A basic import of a table named EMPLOYEES in the corp database
Sqoop.import_data().from_rdbms(
rdbms="mysql",
host="db.foo.com",
database="corp",
username="mysql_user",
password_file=".mysql.pwd"
).table("EMPLOYEES").to_hdfs(target_dir="/data/EMPLOYEES").run()
Will be transformed to next Sqoop CLI command :
sqoop-import \
--connect jdbc:mysql://db.foo.com/corp \
--username mysql_user \
--password-file .mysql.pwd \
--table EMPLOYEES \
--target-dir /data/EMPLOYEES \
--as-textfile
Sqoop.export_data().to_rdbms(
rdbms="mysql",
username="root",
password_file="/user/cloudera/password",
host="localhost",
database="sqoop_tests"
).table(table="table_name").from_hdfs(
export_dir="/core/data"
).run()
Will be transformed to next Sqoop CLI command :
sqoop export --connect jdbc:mysql://localhost/sqoop_tests \
--username root \
--password-file /user/cloudera/password \
--export-dir /core/data \
--table table_name
SparkApplication().master(SparkMaster.yarn_client()).application("application_jar", main_class="Main", app_name="Spark").run()
Will be transformed to next Spark CLI command :
spark-submit --master yarn-client --class Main --name Spark application_jar
SparkApplication.load_preconfigured_job(
name="spark_app",
config=Configuration.load(metastore=IniFileMetaStore(file=path_to_config_file))
).run()
In order to run Spar job configuration file 'config_file.ini' should include next section 'spark_app'. For example:
[spark_app]
application.jar=test.jar
master=local[10]
class=test.SparkApp
name=test_app
jars=lib001.jar
lib002.jar
files=dim001.cache.txt
dim002.cache.txt
properties-file=spark.app.configs
conf=spark.app.name=test_app
spark.executor.memory=512m
Next Spark Application submit command will be generated:
spark-submit --master local[10] --class test.SparkApp --name test_app \
--jars lib001.jar,lib002.jar --files dim001.cache.txt,dim002.cache.txt \
--properties-file spark.app.configs --conf "spark.app.name=test_app spark.executor.memory=512m" test.jar
Flume.agent(agent="a1", conf_file="path/to/file")\
.with_jvm_D_option(name="jvm_option", value="test")\
.load_configs_from_dir("path/to/dir")\
.load_plugins_from_dirs(pathes=["path/to/plugins1,path/to/plugins2", "path/to/plugins3"])\
.run()
Will be transformed to next Flume CLI command :
flume-ng agent --name a1 --conf-file path/to/file --conf path/to/dir \
--plugins-path path/to/plugins1,path/to/plugins2,path/to/plugins3 -Xjvm_option=test
Kafka.start_broker(path_to_config="path/to/config")
Kafka.stop_broker(path_to_config="path/to/config")
Will be transformed to next Kafka CLI command :
kafka-server-start.sh path/to/config
kafka-server-stop.sh path/to/config
Kafka.run_producer(name='namespace.Producer', args=["conf1", "conf2"])
Will be transformed to next Kafka CLI command :
kafka-run-class.sh namespace.Producer conf1 conf2
Merlin provides methods to easily interact with the local file system, HDFS and FTP/FTPS/SFTP. These methods are described in the table below.
HDFS |
LocalFS |
FTP/SFTP/FTPS |
Description |
---|---|---|---|
exists | exists | exists | Tests whether a file exists. |
is_directory | is_directory | is_directory | Tests whether a file is a directory. |
list_files | list_files | list_files | Returns list of the files for the given path |
recursive_list_files | - | - | Returns list of the files for the given path |
create | create | create | Creates new empty file or directory |
copy_to_local | - | download_file / download_dir | Copies file to the local file system |
copy | copy | - | Copies file |
move | move | - | Moves file |
size | size | size | Calculates aggregate length of files contained in the directory or the length of a file in case regular file. |
merge | - | - | Merges the files at HDFS path to a single file on the local filesystem. |
delete | delete | delete | Deletes a file if it exists. |
permissions | permissions | Gets permission of the file | |
owner | owner | Get file's owner | |
modification_time | modification_time | modification_time | Get file modification time |
- | copy_to_hdfs | - | Copies file from local filesystem to HDFS |
distcp | - | - | Copies files between HDFS clusters |
- | - | upload | Uploads file to FTP server |