GithubHelp home page GithubHelp logo

tiantianwdy / hdm Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 3.0 3.01 MB

Hierarchical Distributed Matrix

License: Other

Shell 0.46% Scala 54.11% Java 2.74% HTML 2.65% CSS 2.67% JavaScript 37.37%
scala functional-programming distributed-systems big-data

hdm's Introduction

HDM

Build Status

HDM (Hierarchy Distributed Matrix) is a light-weight, optimized, functional framework for data processing and analytics on large scale data sets.

Build HDM from source code

HDM is built using Apache Maven. To build HDM from source code, go to the directory of hdm-core and run:

mvn  clean install -Dmaven.test.skip=true

Then unzip the hdm-engine.zip from the target folder, after then you can see all the components required for HDM.

Quick start a HDM cluser

Start master

Users can start the master node of HDM by execute the shell cmd ./hdm-deamon.sh start master [parameters]? under the root folder of hdm-core, for example:

cd ./hdm-engine
./hdm-deamon.sh start master -p 8998 -h 127.0.1.1

The cmd above starts a master listening on port 8998 with host address 127.0.1.1

Start a slave

Users can start a slave node by executing the shell cmd ./hdm-deamon.sh start slave -p [port of slave] -P [address of the master] -s [number of cores] -mem [amount of JVM memory] -b [port for data transferring]

cd ./hdm-core
./hdm-deamon.sh start slave -p 10001 -P 172.18.0.1:8998 -s 4 -mem 1G -b 9001

Submit dependency to the server

Before start to execute a job, users need to submit the dependency for a HDM application by executing the shell cmd: ./hdm-client.sh submit [master url] [application ID] [application version] [dependency file] [author]

./hdm-client.sh submit "akka.tcp://[email protected]:8998/user/smsMaster/ClusterExecutor"
  | "defaultApp"
  | "0.0.1"
  | dwu
  | "/home/ubuntu/dev/workspace/hdm/hdm-benchmark/target/hdm-benchmark-0.0.1.jar"

An application may need multiple dependency libs, users can add more dependencies to a HDM applications by execute:

./hdm-client.sh addDep "akka.tcp://[email protected]:8998/user/smsMaster/ClusterResourceLeader"
 | "defaultApp"
 | "0.0.1"
 | dwu
 | "/home/ubuntu/lib/hdm-benchmark/lib/jna-4.0.0.jar"

Once all the dependencies have been submitted to the cluster, the HDM application can be executed freely on the cluster without re-submitting the libs until users make changes on the code.

Start HDM console

Users can start the HDM console by deploy the hdm-console-0.0.1.war file of hdm-console to any web servers such as Apache Tomcat or Jetty.

Programming in HDM

Use Maven to manage dependencies

To be able to program using HDM APIs, users just need to add hdm-core into their programming environment:

Add dependency using maven:

    <dependency>
      <groupId>org.hdm</groupId>
      <artifactId>hdm-core</artifactId>
      <version>${hdm.version}</version>
    </dependency>

HDM provides pure functional programming interfaces for users to write their data-oriented applications. In HDM, functions are the first citizens. Operations are just wrappers of primitive functions in HDM.

Primitives

The basic functions of HDM are listed as below:

Function parameters Description
NullFunc N/A A empty function, which does nothing to the input
Map f: T -> R applies transformation function f to each record in the input to generate the output
GroupBy f: T -> K applies function f to get the key K of each record, group the inputs based on the identical keys.
FindBy f: T -> Bool applies the match function to filter out a subset of the inputs
reduce f: (T, T) -> T apply the reduce function to aggregate the records by folding them pair by pair.
Sort f: (T,T) -> Bool sort the input by applying the comparison function.
Flatten N/A transform a nested collection of input into the non-nested flat collection.
CoGroup f1: T1 -> K, f2: T2 -> K group two input data sets by finding the identical group key based on the two parameter functions. For each input, the records with the same key are organized in a collection.
JoinBy f1: T1 -> K, f2: T2 -> K join two input data sets by finding the identical group key based on the two parameter functions. For each input, the records with the same key are flatten into tuples.

Apart from basic primitives, HDM also provides more derived functions based on the data type for example, for Key-Value based records, HDM provides derived functions such as reduceByKey, findByKey, findValues and mapKeys.

Actions

HDM is designed to be interactive with general programs during runtime, it delays the computation until a data collecting action is triggered. HDM contains a few actions that can trigger the execution of HDM applications on either remote or local HDM cluster. Each of the Actions has specific return type for different purposes:

Action Return Type Description
compute HDMRefs References of the HDMs which represents the meta-data (location, data type, size) of computed output.
sample Iterator An iterator object which can retrieve the a subset of sampled records from the computed output.
count Long Returns the number of records in the computed output.
traverse Iterator An iterator object which can retrieve all the records one by one from the computed output.
trace ExecutionTrace Returns a collection of execution traces for the last execution of the application.

Interation Pattern

For the consideration of performance, after the computation is triggered, the client side program would obtain the results in a asynchronous manner.

For example, an user can print out the results of WordCount program using the code below:

wordcount.traverse(context = "10.10.0.100:8999") onComplete {
  case Success(resp) => resp.foreach(println)
  case other => // do nothing
}

Examples

To better illustrate how to program in HDM, here are some examples as below:

WordCount

val wordcount = HDM("/path/data").map(_.split(","))
            .flatMap(w => (w, 1))
            .reduceByKey(_ + _)

TopK

val k = 100
val topK = HDM(path).map{w => w.split(",")}
           .map{arr => (arr(0).toFloat, arr)}
           .top(k)

LinearRegression

val input = HDM("hdfs://127.0.0.1:9001/user/data")
val training = input.map(line => line.split("\\s+"))
               .map { arr =>
                  val vec = Vector(arr.drop(0))
                  DataPoint(vec, arr(0))
               }
val weights = DenseVector.fill(10){0.1 * Random.nextDouble}

for (i <- 1 to iteration){
  val w = weights
  val grad = training.map{ p =>
    p.x * (1 / (1 + exp(-p.y * (w.dot(p.x)))) - 1) * p.y
  }.reduce(_ + _).collect().next()
  weights -= grad
}

hdm's People

Contributors

dwu-nicta avatar tiantianwdy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hdm's Issues

Min-min Shceduling empty.head exception when number of slots is less than number of parallelism

In file: MinExecutionScheduling

def findNextMatching(completionMatrix: mutable.HashMap[String, Vector[Double]]): (String, Int) = {

    //find the minimum execution time of each task
    val minTimeOnRes = completionMatrix.map { v =>
      v._1 -> SchedulingUtils.minWithIndex(v._2)
    }.toVector
    log.trace("min execution time vector:")
    // minTimeOnRes foreach (println(_))
    //find the minimum value among minimum execution time vector
    val comp = (d1: (String, (Int, Double)), d2: (String, (Int, Double))) => comparison(d1._2._2, d2._2._2)
    val minTimeOfTask = SchedulingUtils.minObjectsWithIndex(minTimeOnRes, comp)
    log.debug(s"find min task:${minTimeOfTask._2} with expected execution time ${minTimeOfTask._2._2._2}")
    val idx = minTimeOfTask._1
    // (taskId, resourceIdx)
    (minTimeOnRes(idx)._1, minTimeOfTask._2._2._1)
  }

the code above throws empty.head exception at line `v._1 -> SchedulingUtils.minWithIndex(v._2)`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.