shadanan / hadooplink Goto Github PK
View Code? Open in Web Editor NEWMathematica package for reading files off of HDFS
License: Other
Mathematica package for reading files off of HDFS
License: Other
Right now the input and output format for HadoopLink MapReduce jobs are hard-coded as SequenceFileInputFormat and SequenceFileOutputFormat respectively:
[paul-jean@paul-jeanmaclap: hadoop-link]$ grep "setInputFormatClass" src/java/com/wolfram/hadoop/mapreduce/MathematicaJob.java
job.setInputFormatClass(SequenceFileInputFormat.class);
[paul-jean@paul-jeanmaclap: hadoop-link]$ grep "setOutputFormatClass" src/java/com/wolfram/hadoop/mapreduce/MathematicaJob.java
job.setOutputFormatClass(SequenceFileOutputFormat.class);
I would be nice to be able to choose alternative input/output formats.
The cleanest way that comes to mind is to add options "InputFormat" and "OutputFormat" to the HadoopMapReduceJob function:
HadoopMapReduceJob[
$$link,
"blah blah",
inputfile,
outputdir,
Function[{k,v}, ...],
Function[{k,vs}, ...],
"InputFormat" -> "Text",
"OutputFormat" -> "SequenceFile" (* or Automatic *)
]
The reason I want this at the moment is so I can write a simple job that turns text files into Sequence files:
Clear[GenomeIndexMapper];
GenomeIndexMapper := Function[{k, v},
Yield[k, v]
]
Clear[GenomeIndexReducer];
GenomeIndexReducer := Function[{k, vs},
While[vs@hasNext[],
Yield[k, vs@next[]]
]
]
With[{
input =
DFSFileNames[$$link, "*.fa",
"/user/paul-jean/hadooplink/genomes/hs/"],
out = "/user/paul-jean/hadooplink/genomes/test"},
If[DFSFileExistsQ[$$link, out], DFSDeleteDirectory[$$link, out]];
HadoopMapReduceJob[
$$link,
"test identity mappers/reducers",
input,
out,
GenomeIndexMapper,
GenomeIndexReducer,
"InputFormat" -> "Text",
"OutputFormat" -> "SequenceFile"
]
]
link = OpenHadoopLink[
"fs.default.name" -> "hdfs://hadoopheadlx.wolfram.com:8020",
"mapred.job.tracker" -> "hadoopheadlx.wolfram.com:8021"];
DFSFileNames[link]
Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FileSystem
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249).
LoadJavaClass::fail: Java failed to load class org.apache.hadoop.fs.FileSystem. >>
Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249).
LoadJavaClass::fail: Java failed to load class org.apache.hadoop.conf.Configuration. >>
KeepJavaObject::obj: At least one argument to KeepJavaObject was not a valid Java object. >>
Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.fs.Path
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249).
General::stop: Further output of Java::excptn will be suppressed during this calculation. >>
LoadJavaClass::fail: Java failed to load class org.apache.hadoop.fs.Path. >>
General::stop: Further output of LoadJavaClass::fail will be suppressed during this calculation. >>
StringMatchQ::strse: String or list of strings expected at position 1 in StringMatchQ[OperatingSystem->Unix,*]. >>
Very strange error.
The whole package worked nicely before my cluster crashed one time. After reinstall the OS and everything, lots of errors emerge including this one.
2016-02-11 19:35:06,117 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2016-02-11 19:35:06,184 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2016-02-11 19:35:06,208 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2016-02-11 19:35:06,208 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2016-02-11 19:35:06,231 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2016-02-11 19:35:06,233 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2016-02-11 19:35:06,233 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2016-02-11 19:35:06,726 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.StringIndexOutOfBoundsException: String index out of range: -6
at java.lang.String.substring(String.java:1967)
at com.wolfram.hadoop.mapreduce.MapReduceKernelLink.loadPackageFromJar(Unknown Source)
at com.wolfram.hadoop.mapreduce.MapReduceKernelLink.get(Unknown Source)
at com.wolfram.hadoop.mapreduce.MathematicaMapper.setup(Unknown Source)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2016-02-11 19:35:06,730 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task
Located at:
String packagePath = packageURL.getPath();
int n = packagePath.lastIndexOf("!");
String jarPath = packagePath.substring(5, n);
When use DFSImport[h_HadoopLink, file_String, "SequenceFile"]
it reports :
cannot load java class: com.wolfram.hadoop.dfs.SequenceFileImportReader
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.