shadanan / hadooplink Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 9.0 80.41 MB

Mathematica package for reading files off of HDFS

License: Other

Java 48.65% Objective-C 51.35%

hadooplink's People

Stargazers

Watchers

Forkers

wolframresearch-zz mohit6up strategist922 rajerino thangarajan8 vishwakarmarahul sunbc0120 haskman andandandand

hadooplink's Issues

Support input/output formats other than SequenceFile

Right now the input and output format for HadoopLink MapReduce jobs are hard-coded as SequenceFileInputFormat and SequenceFileOutputFormat respectively:

[paul-jean@paul-jeanmaclap: hadoop-link]$ grep "setInputFormatClass" src/java/com/wolfram/hadoop/mapreduce/MathematicaJob.java
    job.setInputFormatClass(SequenceFileInputFormat.class);
[paul-jean@paul-jeanmaclap: hadoop-link]$ grep "setOutputFormatClass" src/java/com/wolfram/hadoop/mapreduce/MathematicaJob.java
    job.setOutputFormatClass(SequenceFileOutputFormat.class);

I would be nice to be able to choose alternative input/output formats.

The cleanest way that comes to mind is to add options "InputFormat" and "OutputFormat" to the HadoopMapReduceJob function:

HadoopMapReduceJob[
 $$link,
 "blah blah",
 inputfile,
 outputdir,
 Function[{k,v}, ...],
 Function[{k,vs}, ...],
"InputFormat" -> "Text",
"OutputFormat" -> "SequenceFile" (* or Automatic *)
 ]

The reason I want this at the moment is so I can write a simple job that turns text files into Sequence files:

Clear[GenomeIndexMapper];
GenomeIndexMapper := Function[{k, v},
  Yield[k, v]
  ]

Clear[GenomeIndexReducer];
GenomeIndexReducer := Function[{k, vs},
  While[vs@hasNext[],
   Yield[k, vs@next[]]
   ]
  ]

With[{
  input = 
   DFSFileNames[$$link, "*.fa", 
    "/user/paul-jean/hadooplink/genomes/hs/"],
  out = "/user/paul-jean/hadooplink/genomes/test"},
 If[DFSFileExistsQ[$$link, out], DFSDeleteDirectory[$$link, out]];
 HadoopMapReduceJob[
  $$link,
  "test identity mappers/reducers",
  input,
  out,
  GenomeIndexMapper,
  GenomeIndexReducer,
"InputFormat" -> "Text",
"OutputFormat" -> "SequenceFile"
  ]
 ]

Java exceptions

link = OpenHadoopLink[
   "fs.default.name" -> "hdfs://hadoopheadlx.wolfram.com:8020",
   "mapred.job.tracker" -> "hadoopheadlx.wolfram.com:8021"];

DFSFileNames[link]

Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FileSystem
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:249).

LoadJavaClass::fail: Java failed to load class org.apache.hadoop.fs.FileSystem. >>

Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:249).

LoadJavaClass::fail: Java failed to load class org.apache.hadoop.conf.Configuration. >>

KeepJavaObject::obj: At least one argument to KeepJavaObject was not a valid Java object. >>

Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.fs.Path
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:249).

General::stop: Further output of Java::excptn will be suppressed during this calculation. >>

LoadJavaClass::fail: Java failed to load class org.apache.hadoop.fs.Path. >>

General::stop: Further output of LoadJavaClass::fail will be suppressed during this calculation. >>

StringMatchQ::strse: String or list of strings expected at position 1 in StringMatchQ[OperatingSystem->Unix,*]. >>

Error at MapReduceKernelLink.java

Very strange error.

The whole package worked nicely before my cluster crashed one time. After reinstall the OS and everything, lots of errors emerge including this one.

2016-02-11 19:35:06,117 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2016-02-11 19:35:06,184 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2016-02-11 19:35:06,208 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2016-02-11 19:35:06,208 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2016-02-11 19:35:06,231 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2016-02-11 19:35:06,233 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2016-02-11 19:35:06,233 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2016-02-11 19:35:06,726 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.StringIndexOutOfBoundsException: String index out of range: -6
    at java.lang.String.substring(String.java:1967)
    at com.wolfram.hadoop.mapreduce.MapReduceKernelLink.loadPackageFromJar(Unknown Source)
    at com.wolfram.hadoop.mapreduce.MapReduceKernelLink.get(Unknown Source)
    at com.wolfram.hadoop.mapreduce.MathematicaMapper.setup(Unknown Source)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
2016-02-11 19:35:06,730 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task

Located at:

String packagePath = packageURL.getPath();
int n = packagePath.lastIndexOf("!");
String jarPath = packagePath.substring(5, n);

cannot load java class

When use DFSImport[h_HadoopLink, file_String, "SequenceFile"] it reports :

cannot load java class: com.wolfram.hadoop.dfs.SequenceFileImportReader

shadanan / hadooplink Goto Github PK

hadooplink's People

Stargazers

Watchers

Forkers

hadooplink's Issues

Support input/output formats other than SequenceFile

Java exceptions

Error at MapReduceKernelLink.java

cannot load java class

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs