For ACOPF using ADMM, we may need to run hundreds of thousands of TRONs in parallel on

Improving performance on GPUs (3) - running hundreds of thousands of TRONs in parallel on GPUs about exatron.jl HOT 4 CLOSED

exanauts commented on June 2, 2024

Improving performance on GPUs (3) - running hundreds of thousands of TRONs in parallel on GPUs

from exatron.jl.

Comments (4)

youngdae commented on June 2, 2024

Performance evaluation on a batch run of TRON (calling dspcg() routine) seems promising. The time was averaged over 10 randomly generated problems of size n=8, where n is the number of variables. 70 threads were used for CPU, and TronDenseMatrix was used for CPU. Each run would correspond to one Newton step.

Batch size	CPU	GPU	Ratio (CPU/GPU)
5,120	1.76739e-01	7.92999e-04	2.27874e+02
10,240	3.84099e-01	1.49459e-03	2.56993e+02
20,480	7.87158e-01	8.77802e-04	8.96737e+02

It seems that GPU scales quite well. Note that ACOPF problem structure is different than these randomly generated problems, so the results could be different. Once I implement evaluation routines for ADMM on GPU, we will see its performance.

from exatron.jl.

youngdae commented on June 2, 2024

Numbers reported were incorrect. I didn't synchronize GPU run. Because a kernel is asynchronously run, I should have put CUDA.@sync to measure its correct run time. After putting CUDA.@sync macro, the gap has reduced by half. Experimental settings were

70 threads and TronDenseMatrix were used for CPU run.
the number of variables is n=8.
Tron was run over 30 randomly generated QP problems, and its time was averaged.

Batch size	CPU	GPU	Ratio (CPU/GPU)
5,120	1.86585e-02	1.46632e-03	1.27247e+01
10,240	3.99126e-02	2.75390e-03	1.44931e+01
20,480	1.37620e-01	5.26920e-03	2.61179e+01

GPU time makes more sense now. It increases as batch size increases. But, I wonder why CPU time has reduced compared to previous results. Did I forget to set JULIA_NUM_THREADS? ...

from exatron.jl.

youngdae commented on June 2, 2024

Another intermediate runtime (in seconds) results over case9241pegase of a direct GPU implementation of function evaluation and generator/bus/rho update routines. 70 threads were used for the CPU run. These routines are expected to take much smaller portion of overall runtime than that of branch solve. However, considering the high number of GPU threads, its occupancy doesn't seem good. We may need to visit this at some point.

Component	CPU	GPU	Ratio (CPU/GPU)	# GPU threads
Generator	0.00004	0.00012	3.32e-01	1,472
Bus	0.00083	0.00031	2.66e+00	9,248
Function evaluation (branch)	0.00016	0.00011	1.44e+00	16,064
Gradient evaluation (branch)	0.00025	0.00011	2.36e+00	16,064
Hessian evaluation (branch)	0.00019	0.00011	1.73e+00	16,064
Rho update	0.00068	0.00015	4.52e+00	99,200

from exatron.jl.

youngdae commented on June 2, 2024

The above might not fit in this channel, since it was about ADMM implementation.

from exatron.jl.

Improving performance on GPUs (3) - running hundreds of thousands of TRONs in parallel on GPUs about exatron.jl HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs