The ann-benchmark from avulanov

ann-benchmark's People

Contributors

Stargazers

Watchers

ann-benchmark's Issues

ann benchmark sampling logic

Hi,

I'm reading your ann-bench mark spark version. When you do the following, shouldn't the sampling need to be done for every node ? It seems u just did for once and every node share the same sample data.

val sample = train.sample(true, 1.0 / i, 11L).collect
val parallelData = dataPartitions.flatMap(x => sample)

OpenBLAS threads

Hi Alex,

I am trying to reproduce the benchmark results and I have a quick question of how many OpenBLAS threads you have used and what's the runtime scalability that you got. I am expecting with N threads, compute runtime should improve from M secs to M/N/2 secs.

Here is what I am trying:

I have 20 nodes and 16 cores on each node.

SparkContext: 20 nodes, 16 cores, sc.defaultParallelism 320

def gramSize(n: Int) = (n*n+1)/2

// I have not used saxpy f2jBLAS and NativeBLAS yet but that will be used over here for comparisons.
// I am not sure if f2jBLAS can run on multiple threads or not but OpenBLAS should run fine

val combOp = (v1: Array[Float], v2: Array[Float]) => {
var i = 0
while (i < v1.length) {
v1(i) += v2(i)
i += 1
}
v1
}

val n = gramSize(4096)
val vv = sc.parallelize(0 until sc.defaultParallelism).map(i => Array.fill[Float](0))
vv.persist

Option 1: 320 partitions, 1 thread on combOp per partition

val start = System.nanoTime();
vv.treeReduce(combOp, 2);
val reduceTime = (System.nanoTime() - start)*1e-9
reduceTime: Double = 5.6390302430000006

Option 2: 20 partitions, 1 thread on combOp per partition

val coalescedvv = vv.coalesce(20)
coalescedvv.count

val start = System.nanoTime();
coalescedvv.treeReduce(combOp, 2);
val reduceTime = (System.nanoTime() - start)*1e-9
reduceTime: Double = 3.9140685640000004

Option 3: 20 partitions, OpenBLAS numThread=16 per partition

Setting up OpenBLAS on cluster, I will update soon.

Let me know your thoughts. I think if underlying operations are Dense BLAS level1, level2 or level3, running with higher OpenBLAS threads and reducing number of partitions should help in decreasing cross partition shuffle.

Recommend Projects

avulanov / ann-benchmark Goto Github PK

ann-benchmark's People

Contributors

Stargazers

Watchers

Forkers

ann-benchmark's Issues

ann benchmark sampling logic

OpenBLAS threads

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs