Hi, I'm working on smaller systems using the vashishta potential (<a

Add vashishta potential about gpumd HOT 33 CLOSED

brucefan1983 commented on August 11, 2024

Add vashishta potential

from gpumd.

Comments (33)

brucefan1983 commented on August 11, 2024 1

Thank you for the links. I have started coding and may get it done soon. You are correct that It's only a little bit more complicated than SW. I found the 2007-JAP paper more accurate than the manual (your first link) in describing the dimensions of the parameters H, D, W, and gamma. So I will just follow the paper.

from gpumd.

brucefan1983 commented on August 11, 2024

Hi andephane,

Thanks for suggesting this. I just noticed your comments today. The trick you mentioned is very important. I think I may need to use two neighbor lists for this potential (or at least a two-level neighbor list), one with shorter cutoff (for the 3-body part) and one with larger cutoff (for the 2-body part). This requires re-designing some parts of GPUMD and will take some time. What materials are you studying?

Best,
Zheyong

from gpumd.

andeplane commented on August 11, 2024

Thanks for your response. I'm currently working on silicon carbide, but I've also used the potential to nanoporous amorphous silica with and without water. Even without that trick, I believe that GPUMD may be faster than LAMMPS for smaller systems.

Would you be able to help me with a simple implementation so I could test this? For you it could be a matter of an hour or two since SW is already there.

from gpumd.

brucefan1983 commented on August 11, 2024

Yes, I am interested in adding this potential into GPUMD. I cannot make a promise when it will be done, as the holidays are coming, but I will do it as soon as possible. Do you think it is enough to read this paper:

P. Vashishta, R. K. Kalia, A. Nakano, J. P. Rino. J. Appl. Phys. 101, 103515 (2007)

from gpumd.

andeplane commented on August 11, 2024

That should be sufficient. The modern formulation is the one mentioned in the LAMMPS docs:
http://lammps.sandia.gov/doc/pair_vashishta.html (probably the same as in that paper).

with our reference implementation that may be good to have: https://github.com/lammps/lammps/blob/master/src/MANYBODY/pair_vashishta.cpp

Looking forward to it :) I'm happy to answer any questions you might have.

from gpumd.

brucefan1983 commented on August 11, 2024

Hi andeplane,

I have finished coding. How would you like to get the code? By email? I am not familiar with github and have never tried to use the provided version control tools. Previously, when I updated the code, I simply uploaded the modified files from my computer. However, for this Vashishta potential, I have not tested it fully and don't want to upload the files now. Perhaps you know how to do this professionally?

from gpumd.

brucefan1983 commented on August 11, 2024

From my test, the 3-body part only takes about 10% of the whole computation and the Vashishta potential is about 10 times more time consuming than the SW potential. With 8000 atoms, the speed is only about 2x10^6 atom*step/second using a Tesla K40. What's the performance of LAMMPS from your runs?

from gpumd.

brucefan1983 commented on August 11, 2024

After doing some tests, I feel that my implementation is correct. So I have updated the code. I also sent an email to your gmail with an example attached.

from gpumd.

andeplane commented on August 11, 2024

Fantastic! I will test this sometime over new years. What system did you test? I haven't tried running the system that is in the paper.

If 3-body is only 10%, that is great. Typically it should be even less I think, but that depends on the system of course.

I'll give you benchmarks from LAMMPS later. Fantastic work!

from gpumd.

brucefan1983 commented on August 11, 2024

I have tested SiC in the zincblende structure with 8000 atoms in total. With this number of atoms, the performance of K40 is not saturated of course. It may reach 5x10^6 atom*step/second for larger systems. NPT ensemble at 300 K and 0 GPa. Neighbor list with a skin of 1 A and is updated when needed. Double precision. For your convenience, here are the new materials related to this potential:

src/
--vashishta.cu and vashishta.h: the major source codes for this potential
--A function in potential.cu which is used to read in parameters for this potential
--A Vashishta structure defined in common.h
--An added "case" in the "switch-case" construction in force.cu
potentials/sw/
A potential file for SiC: sic_vashishta_2007.txt
doc/
Section 4.7 of the manual contains the formulas and conventions I used.

From these, you can understand what need to be done to add a new potential model into GPUMD. It took me two days of hard work (not two hours). Look forward to your testing results. When you have time, we can compare the forces in the same structure (with some randomness) computed by GPUMD and LAMMPS. I hope the forces are the same.

from gpumd.

andeplane commented on August 11, 2024

Ok, interesting. In LAMMPS with the GPU package on P100, I get 13.5e6 on a SiC nano particle, 9000 atoms.

Thanks a lot for your effort, I'll get back to you within a week or so :)

from gpumd.

brucefan1983 commented on August 11, 2024

Oh, I have not tested with P100. The speed of the LAMMPS version sounds very fast. Is the version on your github homepage?

from gpumd.

andeplane commented on August 11, 2024

Hmm, your SW benchmarks etc seem MUCH faster than LAMMPS, and I think that there is an enormous overhead in kernel execution in LAMMPS. So I think it should be possible to achieve ~50 million atom-timesteps/second on P100 with vashishta.

The implementation in LAMMPS is here: https://github.com/lammps/lammps/tree/master/lib/gpu
(see vashishta.cu).

from gpumd.

brucefan1983 commented on August 11, 2024

I don't how much P100 is faster than K40, but it seems 50 million atomstep/second is hard to achieve for this potential even on P100. I have just tested that using single precision, I can get 5 million atomstep/second on K40. Not very impressive. Perhaps you can test GPUMD on P100?

from gpumd.

andeplane commented on August 11, 2024

Hmm interesting. P100 should be ~2x on single precision and 3x on double precision compared to K40. I will test GPUMD on P100 (I have for SW previously) when I'm back to work in January =)

from gpumd.

andeplane commented on August 11, 2024

Single precision on P100:
INFO: Speed of this run = 1.50843e+07 atom*step/second.

With LAMMPS (also SP) I get 7.3e+06, so GPUMD is 2x as fast

from gpumd.

andeplane commented on August 11, 2024

By using 30x30x30 unit cells (512k atoms) and 7.35Å cutoff, I get 2.37544e+07 atom*step/second with GPUMD.

from gpumd.

brucefan1983 commented on August 11, 2024

I was travelling during the last few days. So are you satisfied or not with the current performance of GPUMD for the Vashishta potential? Perhaps there is some room for improvement? As this potential is much more expensive than the SW potential (perhaps 10 times more), the overhead in LAMMPS as you mentioned might not be as important as in the case of the SW potential? Do you want to use GPUMD to do simulations with this potential? If so, you might need to first validate the implementation by comparing forces directly with LAMMPS. Also note that the functionalities of GPUMD are very limited. I am still experimenting with some new features such as the MTK integrator for the NPT ensemble.

from gpumd.

andeplane commented on August 11, 2024

I think there is room for improvement, but I'm not sure how to figure that out, hehe. How do you typically profile such applications?

I might use GPUMD with it and would of course compare forces and results carefully before any production run. Most of our simulations on GPUs I've done so far is just to reach long times in NVT. GPUMD would be sufficient for this =)

One reason why I believe there is more to gain (but not sure) is that LAMMPS GPU implementation is still very sensitive to CPU speed. I get a rather large increase in speed when I simulate on a system with a powerful CPU.

from gpumd.

brucefan1983 commented on August 11, 2024

I usually first check whether or not the number of registers in the force evaluation kernels can be reduced to some critical values. I may have used too many registers such that the occupancy is not optimal. The parameter BLOCK_SIZE_VASHISHTA was set to 64 (any multiple of 32 is allowed as there is no binary reduction in the force evaluation kernels) and it may not be optimal. You can also try to switch on -DUSE_LDG (check line 75 in common.h) in the makefile. This might result in a few percent gain of performance in some cases (but sometimes it does not from my experiences). You may have noticed that I have already avoided using the expensive pow() function as all the eta parameters from the papers I have read are integers.

You can also time the individual force evaluation kernels in GPUMD and LAMMPS. The simple tool nvprof is enough for doing this. The performance difference between GPUMD and LAMMPS running with this potential might be resulted from the CPU part based on your descriptions. In GPUMD, Nearly all the calculations are done on the GPU. Only the initialization is done on the CPU.

from gpumd.

andeplane commented on August 11, 2024

Yes I did see that the pow is gone (I did the same trick in one of my implementations for LAMMPS), very nice! I will check with nvprof with both LAMMPS and GPUMD.

By the way, with my i7-6950x CPU, I get 20e6 atom timesteps/second (14400 atoms) with pure silicon (diamond structure) with stillinger weber on 10 cores. This is more than LAMMPS can do with P100 GPU. So when a CPU is able to do that, I'm pretty sure GPU's can do more :D

from gpumd.

brucefan1983 commented on August 11, 2024

One way to further accelerate this potential is to tabulate the two-body part (using e.g., cubic splines) because the two-body part is the hot spot (perhaps takes up ~90% of the whole computation time). Recently, I have done some experiments on the spline-based EAM potential and gained some experience. I will try this when I have time. Have you checked whether or not the current implementation is consistent with LAMMPS?

from gpumd.

andeplane commented on August 11, 2024

Hi, I was on the run when I saw this and totally forgot to answer, sorry about that.

That's very cool! I have testet a simple linear interpolation on the CPU version (vashishta/table in LAMMPS) which gives a nice speedup. I assume we'd also get nice speedups on the GPU although memory reads could be more costly on the GPU?

I haven't compared with LAMMPS yet, but nothing I've seen so far scares me :)

from gpumd.

brucefan1983 commented on August 11, 2024

Hi, I have tried the linear interpolation as you did in pair_vashishta_table.cpp for LAMMPS. Unlike the 3-5X speedup for the CPU version (reported in the LAMMPS manual), I only got at most 2X improvement for my GPU version. The performance depends on the table length N. From my test, using __ldg() improves the performance a lot. So it's better to compile with the choice -DUSE_LDG when using this potential. Perhaps you can try to further optimize it?

from gpumd.

brucefan1983 commented on August 11, 2024

I have uploaded files related to the tabulated Vashishta potential. Using a table length of N=20000, there is about a 2X speedup. The relative difference of the force compared to the analytical case is of the order of 1.0e-5.

If you want to do some tests, you can start with examples/ex5, where this tabulated potential is used to calculate the phonon density of states of beta-SiC. It takes about 6 min in K40.

from gpumd.

brucefan1983 commented on August 11, 2024

I noticed that you made the linear interpolation in terms of r^2 instead of r. I guess the purpose was to avoid computing r = sqrt(r^2)? It seems that this extra computation is not much for the GPU implementation (where memory reads consume more resources). I changed to work with r instead of r^2 in GPUMD and it turned out that the accuracy of the linear interpolation improved a lot (the errors of force are reduced by 5 times).

from gpumd.

brucefan1983 commented on August 11, 2024

Hi,

I have further improved the performance of the analytical and tabulated Vashishta potentials. I think I have tried my best to optimize them.

from gpumd.

andeplane commented on August 11, 2024

Very cool! I'm currently busy finishing up some papers, so I won't have time to look at this yet, but i definitely will! I'll run the benchmarks soon anyway. Thanks, and awesome work :)

from gpumd.

andeplane commented on August 11, 2024

And yeah, I made it linear in rsq due to the square root. On GPU's the square root is more or less free!

Btw, what memory layout did you use for tabulation? I know that storing them as texture is much faster due to random access pattern. This is how positions are stored in the GPU package and KOKKOS package.

from gpumd.

brucefan1983 commented on August 11, 2024

There is a compiling option -DUSE_LDG in GPUMD.

When this is on, some global memory accesses use the __ldg() intrinsic, which is similar to texture. I have used this for the tabulated data. It is true that for this potential, using __ldg() is much faster.

However, using __ldg() makes some other short-ranged many-body potentials (such as Tersoff) somewhat slower. I may perform more thorough tests in the future and make the best choices for the users automatically.

BTW, I have checked (using some initial structures with some randomness in positions) that the forces computed by GPUMD are consistent with those by LAMMPS. The agreement is up to 1.0e-4, and the difference should be caused by something like unit conversions with slightly different physical constants, very different computation orders, ect.

from gpumd.

andeplane commented on August 11, 2024

Ahh you mentioned that above. I just tried single precision with USE_LDG, and get no difference in running with or without table. Maybe this effect is more important for double precision?

Cool that the forces are consistent! Well done :)

from gpumd.

brucefan1983 commented on August 11, 2024

Yes, it seems that the performance depends on many factors. Even though single precision is faster, I still used double precision in my published works. So I usually do not care much about the case of single precision :-)

from gpumd.

brucefan1983 commented on August 11, 2024

Now I have merged the analytical and the tabulated versions for this potential (without affecting the inputs and the outputs). I am satisfied with the implementation so I close this issue now.

from gpumd.

Add vashishta potential about gpumd HOT 33 CLOSED

Comments (33)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs