mozilla / jydoop Goto Github PK

Efficient Hadoop Map-Reduce in Python

License: Other

Python 57.85% Java 42.15%

jydoop's Introduction

jydoop: Efficient and Testable Hadoop map-reduce in Python

Purpose

Querying Hadoop/HBase using custom java classes is complicated and tedious. It's very difficult to test and debug analyses on small sets of sample data, or without setting up a Hadoop/HBase cluster.

Writing analyses in Python allows for easier local development + testing without having to set up Hadoop or HBase. The same analysis scripts can then be deployed to a production cluster configuration.

Writing Scripts

The quickest way to get started is to have a look at some of the examples in scripts/.

The minimal jydoop job requires only two function definitions:

map(key, value, context) The map phase of MapReduce - called once for each input record, and data is written out by calling context.write(new_key, new_value)
setupjob(job, args) Determines how data is made available to the job script and processes incoming arguments. Usually you will use an existing implementation of this function.

For most Mozilla data sources, there are predefined setupjob functions available that you can reference in your script without implementing your own. There are examples of this included in the scripts/ dir.

In addition to the Mozilla data sources, there is also support for using the output of one jydoop job as the input of another. More on this below.

Besides the required map and setupjob functions, there are a number of optional functions you can implement for full MapReduce functionality:

reduce(key, values, context) The reduce phase of MapReduce - called once for each key (as output by the Map phase) with a list of all values seen for that key. If you do not define a reduce function in your script, it will run as a Map-only job and skip the Reduce phase entirely.
combine(key, values, context) An intermediate way of reducing partial value lists for a key. This function is entirely optional, but can improve performance if the logic for reducing values can be done in pieces. A good example of this is count-type jobs, where the overall reduce will still work fine even if some subsets of values have already been summed. The Hadoop Documentation has a nice description of the combine phase.
output(path, results) You may override how data is written out to the destination file by implementing this method. The default behaviour is usually fine. The results argument is an iterator on the (key, value) pairs that come from the Reduce step.
mapsetup(context) Called before the Map phase
mapcleanup(context) Called after the Map phase
skip_local_output() If this returns True, then output from the job is not downloaded from HDFS to a local file. The supplied output filename is used as the location in HDFS where data will be stored. If this function is omitted, the default behaviour is to output to a local text file, then remove any data from HDFS.

Testing Locally

To test scripts, use locally saved sample data and FileDriver.py:

python FileDriver.py script/osdistribution.py sample.json analysis.out

where sample.json is a newline-separated json dump. See the examples in scripts/ for map-only or map-reduce jobs.

Local testing can be done on any machine with Python installed, and doesn't require access to any extra libraries beyond what is included with jydoop, nor does it require connectivity to the Hadoop cluster.

Production Setup

Fetch resources

Fetch dependent JARs using

make download

Note: You may need to set the http_proxy environment variable to allow curl to get out to the internet:

export http_proxy=http://myproxy:port

Running as a Hadoop Job

Python scripts are wrapped into driver.jar with the Java driver.

For example, to count the distribution of operating systems on Mozilla telemetry data for March 30th, 2013 run:

make hadoop ARGS="scripts/osdistribution.py outputfile 20130330 20130330"

Supported Types of Jobs

jydoop supports several different backend data sources which are of the following types:

HBase (mapper type HBASE)
Plain sequence files (mapper type TEXT)
Jydoop-formatted sequence files (mapper type JYDOOP)

The mapper type should be set by the setupjob function. Currently supported types are HBASE, TEXT, or JYDOOP. The default is HBASE so you do not need to specify this value for HBase data sources. Other mapper types should set the org.mozilla.jydoop.hbasecolumns key in the job Configuration. For example, the TestPilot setupjob function sets the mapper type using:

job.getConfiguration().set("org.mozilla.jydoop.mappertype", "TEXT")

All the different types require at least two arguments, namely the script to be run and the filename where output will be sent.

Telemetry

Telemetry jobs take two extra arguments, the start date and the end date of the range you want to analyze. Telemetry data is quite large, so it's best to keep the date range to a few days max.

The production example above uses the HBase support to access Telemetry data (using the telemetryutils.hbase_setupjob setup function).

You may also access the most recent 14 days' data in HDFS. This will make jobs finish somewhat more quickly than using HBase. To use the HDFS data, specify the following in your script:

setupjob = telemetryutils.setupjob

Check the osdistribution.hdfs.py script for an example.

Firefox Health Report Jobs

FHR jobs don't require any extra arguments beyond the script name and the output file.

To help reduce the boilerplate required to write Firefox Health Report jobs, a special decorator and Python class is made available. From your job script:

from healthreportutils import (
    FHRMapper,
    setupjob,
)


@FHRMapper(only_major_channels=True)
def map(key, payload, context):
    if payload.telemetry_enabled:
        return

    for day, providers in payload.daily_data():
        pass

When the @FHRMapper decorator is applied to a map function, the 2nd argument to the function will automatically be converted to a healthreportutils.FHRPayload class. In addition, special arguments can be passed to the decorator to perform common filtering operations outside of your job.

See the source in healthreportutils.py for complete usage info.

TestPilot

TestPilot jobs access data in Sequence Files in HDFS.

Scripts accessing TestPilot data require three arguments: the TestPilot study name, the start date, and the end date. For example, to run the sample script against the testpilot_micropilot_search_study_2 study from June 10th to June 17th, 2013:

make hadoop ARGS="scripts/testpilot_test.py test_out.txt testpilot_micropilot_search_study_2 2013-06-10 2013-06-17"

Jydoop output -> jydoop input

You can run a jydoop job against the output of a previous jydoop job.

This enables workflows where you first filter or preprocess your input data and store it back in HDFS, then write a number of different analysis scripts that work on the filtered data set.

Normally jydoop jobs remove their output in HDFS once the data has been saved locally.

If you want to keep the output in HDFS instead of saving locally, you can implement the skip_local_output function in your job script (and have it return True). This will cause the data not to be saved locally, and also prevent it from being deleted from HDFS when the job is complete.

You then use the job's output in another job by using the jydoop.setupjob function in your script.

As a simplistic example, if you have a two-stage job which first reads and filters TestPilot data and stores the result into a HDFS location interesting_output, then reads interesting_output to produce local data, you could do the following:

"""stage1.py - Filter TestPilot input for an interesting person"""
import json
import testpilotutils
import jydoop
def map(key, value, context):
    payload = json.loads(value)
    if payload["personid"] == "interesting!":
        context.write(key, value)

def skip_local_output():
    return True

setupjob = testpilotutils.setupjob

"""stage2.py - Count events by type"""
import json
import jydoop
def map(key, value, context):
    payload = json.loads(value)
    for event in payload["events"]:
        context.write(event["type"], 1)

reduce = jydoop.sumreducer
combine = jydoop.sumreducer
setupjob = jydoop.setupjob

Run the jobs:

# Generate the HDFS output:
make hadoop ARGS="scripts/stage1.py interesting_output testpilot_study_1337 2013-06-10 2013-06-24"

# Process HDFS data and output local data:
make hadoop ARGS="scripts/stage2.py final_result.txt interesting_output"

You can then run any other jobs against interesting_output without having to re-filter the original data.

jydoop's People

Contributors

Stargazers

Watchers

Forkers

strategist922 bsmedberg mreid-moz tarasglek indygreg harshach gavinsharp smillaedler daviddahl v1ka5 policecar oyiptong bcolloran imclab rock999 wuxiaolei499390725 mozilla-github-standards

jydoop's Issues

log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient)

I accidentally pushed a script that contained some errors. Hadoop's output spewed a lot of the following:

attempt_201304170845_0427_m_000013_2: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201304170845_0427_m_000013_2: log4j:WARN Please initialize the log4j system properly.
13/04/19 21:29:54 INFO mapred.JobClient: Task Id : attempt_201304170845_0427_m_000009_2, Status : FAILED
Traceback (most recent call last):
File "/data3/hadoop/mapred/mapred/taskTracker/gszorc/jobcache/job_201304170845_0427/jars/job.jar/scripts/healthreportutils.py", line 156, in wrapper
File "scripts/fhr_session_counts.py", line 15, in map
File "/data3/hadoop/mapred/mapred/taskTracker/gszorc/jobcache/job_201304170845_0427/jars/job.jar/scripts/healthreportutils.py", line 26, in get
File "/data3/hadoop/mapred/mapred/taskTracker/gszorc/jobcache/job_201304170845_0427/jars/job.jar/scripts/healthreportutils.py", line 96, in telemetry_enabled
File "/data3/hadoop/mapred/mapred/taskTracker/gszorc/jobcache/job_201304170845_0427/jars/job.jar/scripts/healthreportutils.py", line 121, in iterdays
File "/data3/hadoop/mapred/mapred/taskTracker/gszorc/jobcache/job_201304170845_0427/jars/job.jar/scripts/healthreportutils.py", line 26, in get
File "/data3/hadoop/mapred/mapred/taskTracker/gszorc/jobcache/job_201304170845_0427/jars/job.jar/scripts/healthreportutils.py", line 88, in days
File "/data4/hadoop/m

While the stack trace is mine, I suspect the log4j messages are due to jydoop not ideally integrating with Hadoop/log4j.

Don't shadow map() and reduce() built-ins

Currently script jobs define the map() and possibly reduce() functions, shadowing Python's global map() and reduce() functions. The built-in functions are quite useful. Moreover it's considered bad Python programming practice to shadow built-in symbols (the common Python style checking tools all raise a fuss when this is done).

Please change the API so special symbols in job scripts aren't sharing names with Python built-ins.

As simple alternatives, I'll throw out prefixing the symbols with "hydoop_" or "HYDOOP_" or capitalizing the special symbols e.g. "def MAP(key, value, context)".

output files lost if terminal connection is lost

I discovered this problem b/c of the flaky wi-fi in the SF office... if my connection to e.g. the Mango cluster is lost, my jobs will complete without issue, but the target output file will not be written.

'Screen' can be used to work-around, but I wonder if it's possible to have jydoop write its output independent of an active shell (or whatevs.)

Wiki changes

FYI: The following changes were made to this repository's wiki:

defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).

These were made as the result of a recent automated defacement of publically writeable wikis.

read org.apache.hadoop.hbase.io.ImmutableBytesWritable from HDFS

Jydoop currently can't read org.apache.hadoop.hbase.io.ImmutableBytesWritable key,vals in HDFS (data that has been dumped from HBASE). Support for this is needed for an FHR workflow.

Link to akela for 'make download' dead

"make download" tries to fetch the following jar:

http://people.mozilla.org/~bsmedberg/akela-0.5-SNAPSHOT.jar

This file is no longer available at this location.

doc request: what are the 'ARGS' and how are they implemented?

In partcular:

make hadoop ARGS="scripts/osdistribution.py outputfile 20130330 20130330"

How do the two date arguments work?

Add license headers to files

Y U NO MPL 2 BOILERPLATE?

Don't overwrite json package

While the streaming Jackson API is much faster than the built-in json module, I think it is a bad idea to overwrite the built-in json module with the monkeypatched functions out of principle. There are legitimate cases where someone may want to utilize the additional features of the built-in json module APIs. Using Jackson shouldn't preclude this from occurring.

I think the Jackson JSON API should be exposed under say "import jacksonjson" or similar.

Java object in output file

I ran some ANR report jobs on Mango. For example,

make hadoop ARGS="scripts/anr.py anr-20130403-20130403 20130403 20130403"

In the hdfs output directory, I see mostly 90-byte long files,

-rw-r--r--   3 nchen users         90 2013-04-09 07:19 /user/nchen/anr-20130403-20130403/part-m-00900
-rw-r--r--   3 nchen users         90 2013-04-09 07:19 /user/nchen/anr-20130403-20130403/part-m-00901
-rw-r--r--   3 nchen users      39244 2013-04-09 07:19 /user/nchen/anr-20130403-20130403/part-m-00902
-rw-r--r--   3 nchen users         90 2013-04-09 07:17 /user/nchen/anr-20130403-20130403/part-m-00903
-rw-r--r--   3 nchen users         90 2013-04-09 07:19 /user/nchen/anr-20130403-20130403/part-m-00904
-rw-r--r--   3 nchen users         90 2013-04-09 07:19 /user/nchen/anr-20130403-20130403/part-m-00905

Examining the content of a 90-byte file, it appears to be a serialized Java object,

0000000: 53 45 51 06 1f 6f 72 67 2e 6d 6f 7a 69 6c 6c 61  SEQ..org.mozilla
0000010: 2e 70 79 64 6f 6f 70 2e 54 79 70 65 57 72 69 74  .pydoop.TypeWrit
0000020: 61 62 6c 65 1f 6f 72 67 2e 6d 6f 7a 69 6c 6c 61  able.org.mozilla
0000030: 2e 70 79 64 6f 6f 70 2e 54 79 70 65 57 72 69 74  .pydoop.TypeWrit
0000040: 61 62 6c 65 00 00 00 00 00 00 9c 60 4e fb 04 2f  able.......`N../
0000050: c2 53 91 26 f8 a4 a9 92 7e 7f                    .S.&....~.

Other files longer than 90 bytes also have this header in them.

I also ran into exceptions when running some jobs. This only happened intermittently so I'm not sure it's related. For example,

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server null for region , row '', but failed after 10 attempts.
Exceptions:
java.lang.NullPointerException
    at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1290)
    at org.apache.hadoop.hbase.client.HTable$ClientScanner.nextScanner(HTable.java:1142)
    at org.apache.hadoop.hbase.client.HTable$ClientScanner.initialize(HTable.java:1065)
    at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:559)
    at com.mozilla.hadoop.hbase.mapreduce.MultiScanTableInputFormat$TableRecordReader.restart(MultiScanTableInputFormat.java:232)
    at com.mozilla.hadoop.hbase.mapreduce.MultiScanTa

Better support for modules and sys.path

Currently, sys.path is munged to include just the path of the executing script. This essentially means that all scripts must be in the same directory. I don't believe this will scale, as we will likely inevitably have dozens (possibly hundreds) of scripts and we will likely want to organize them in directories.

I think we should have a directory for shared modules and a directory for scripts. The directory for shared modules is always added to sys.path. And, we continue to add the script's directory to sys.path, if needed.

If nothing else, this is the Python way and it will encourage code re-use among jobs because there is a clear and obvious place for reusable code/modules.

better to upload the savedata for the rookie user to run the demo immediately download

We can't find any "savedata" for running the demo, can the author upload one simple test data for run the demo?

FileDriver.py broken after pylib refactoring

Pull #30 broke FileDriver.py. Oops.

request: identity reducer

For ones where there is no reducer. Or is it better to just not include one?

identityreducer(k,vlist,cx): 
    for v in vlist:  
         cx.write(k,v)

support for scripts outside the `scripts` directory

(correct me if I am wrong!)

It seems like things need to be scripts. In my fantasy world, things like this should run fine.:

make hadoop ARGS="/some/fullpath/filter.py outputfile 20130330 20130330"

output python (string,string) k/v pairs to HDFS as (Text,Text)

Needed to join the output of jydoop runs jobs against FHR data dumps. This is required for FHR de-orphaning, and is a blocker on the new algorithm being tested/deployed. I spent all day trying to do this myself, but I still have no idea what's going on (where can you cast reducer output to (Text,Text)?).

ctrl+c of job should abort job

Currently if I hit ctrl+c during a |make ARGS="..." hadoop| the hadoop job keeps running in the background. It feels unexpected to hit ctrl+c and not have the job aborted.

I'm not sure if this is the default behavior of Hadoop or if running hadoop through make is causing weirdness here. If it's make, fixing this may involve not running jobs through make :/

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please reach out to [email protected].

(Message COC001)