GithubHelp home page GithubHelp logo

hadooplink's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hadooplink's Issues

Support input/output formats other than SequenceFile

Right now the input and output format for HadoopLink MapReduce jobs are hard-coded as SequenceFileInputFormat and SequenceFileOutputFormat respectively:

[paul-jean@paul-jeanmaclap: hadoop-link]$ grep "setInputFormatClass" src/java/com/wolfram/hadoop/mapreduce/MathematicaJob.java
    job.setInputFormatClass(SequenceFileInputFormat.class);
[paul-jean@paul-jeanmaclap: hadoop-link]$ grep "setOutputFormatClass" src/java/com/wolfram/hadoop/mapreduce/MathematicaJob.java
    job.setOutputFormatClass(SequenceFileOutputFormat.class);

I would be nice to be able to choose alternative input/output formats.

The cleanest way that comes to mind is to add options "InputFormat" and "OutputFormat" to the HadoopMapReduceJob function:

HadoopMapReduceJob[
 $$link,
 "blah blah",
 inputfile,
 outputdir,
 Function[{k,v}, ...],
 Function[{k,vs}, ...],
"InputFormat" -> "Text",
"OutputFormat" -> "SequenceFile" (* or Automatic *)
 ]

The reason I want this at the moment is so I can write a simple job that turns text files into Sequence files:

Clear[GenomeIndexMapper];
GenomeIndexMapper := Function[{k, v},
  Yield[k, v]
  ]

Clear[GenomeIndexReducer];
GenomeIndexReducer := Function[{k, vs},
  While[vs@hasNext[],
   Yield[k, vs@next[]]
   ]
  ]

With[{
  input = 
   DFSFileNames[$$link, "*.fa", 
    "/user/paul-jean/hadooplink/genomes/hs/"],
  out = "/user/paul-jean/hadooplink/genomes/test"},
 If[DFSFileExistsQ[$$link, out], DFSDeleteDirectory[$$link, out]];
 HadoopMapReduceJob[
  $$link,
  "test identity mappers/reducers",
  input,
  out,
  GenomeIndexMapper,
  GenomeIndexReducer,
"InputFormat" -> "Text",
"OutputFormat" -> "SequenceFile"
  ]
 ]

Java exceptions

link = OpenHadoopLink[
   "fs.default.name" -> "hdfs://hadoopheadlx.wolfram.com:8020",
   "mapred.job.tracker" -> "hadoopheadlx.wolfram.com:8021"];

DFSFileNames[link]
Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FileSystem
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:249).

LoadJavaClass::fail: Java failed to load class org.apache.hadoop.fs.FileSystem. >>

Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:249).

LoadJavaClass::fail: Java failed to load class org.apache.hadoop.conf.Configuration. >>

KeepJavaObject::obj: At least one argument to KeepJavaObject was not a valid Java object. >>

Java::excptn: A Java exception occurred: java.lang.ClassNotFoundException: org.apache.hadoop.fs.Path
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:249).

General::stop: Further output of Java::excptn will be suppressed during this calculation. >>

LoadJavaClass::fail: Java failed to load class org.apache.hadoop.fs.Path. >>

General::stop: Further output of LoadJavaClass::fail will be suppressed during this calculation. >>

StringMatchQ::strse: String or list of strings expected at position 1 in StringMatchQ[OperatingSystem->Unix,*]. >>

Error at MapReduceKernelLink.java

Very strange error.

The whole package worked nicely before my cluster crashed one time. After reinstall the OS and everything, lots of errors emerge including this one.

2016-02-11 19:35:06,117 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
2016-02-11 19:35:06,184 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2016-02-11 19:35:06,208 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2016-02-11 19:35:06,208 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2016-02-11 19:35:06,231 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2016-02-11 19:35:06,233 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2016-02-11 19:35:06,233 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2016-02-11 19:35:06,726 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.StringIndexOutOfBoundsException: String index out of range: -6
    at java.lang.String.substring(String.java:1967)
    at com.wolfram.hadoop.mapreduce.MapReduceKernelLink.loadPackageFromJar(Unknown Source)
    at com.wolfram.hadoop.mapreduce.MapReduceKernelLink.get(Unknown Source)
    at com.wolfram.hadoop.mapreduce.MathematicaMapper.setup(Unknown Source)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
2016-02-11 19:35:06,730 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task

Located at:

String packagePath = packageURL.getPath();
int n = packagePath.lastIndexOf("!");
String jarPath = packagePath.substring(5, n);

cannot load java class

When use DFSImport[h_HadoopLink, file_String, "SequenceFile"] it reports :

cannot load java class: com.wolfram.hadoop.dfs.SequenceFileImportReader

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.