Comments (10)
At default T31 resolution, we'll have 96*48 = 4608 grid points in the horizontal and 8 in the vertical. If we attempt to parallelize across 8 processors then one processor takes care of one vertical level. For the physics though we would tell 8 processors to share the load of calculating for the 4608 vertical columns the parametrizations which are independent of each other. So we could do @distributed
in front of the for i in 1:nlon, j in 1:nlat
loop. A given processor would first copy from the prognostic variable arrays nlon x nlat x nlev the 8 vertical levels into a new array (for contiguous mem access) that's not shared with other processors, do all the calculations and then write the total physics tendencies back into a nlon x nlat x nvert array (that's shared again).
Given that 8 << 4608 we'll never actually create a full copy of the prognostic variables as the next copy of a vertical column could just override the previous.
from speedyweather.jl.
@white-alistair is currently working on that in #82 maybe also a good testbed for the domain decomposition. Once we reach that stage.
from speedyweather.jl.
Yeah, LSC would also work as a test-case.
Maybe we keep this issue separate (keen to avoid scope creep on that PR...) and then do some parallelisation experiments before proceeding with other parameterisations? In any case, once we've settled on an approach, I don't think it's a lot of work to refactor whatever parameterisations have already been implemented at that point.
from speedyweather.jl.
I'd guess it would even make sense to drop the vertical view for an actual deepcopy to have the non-contiguous memory access only once. In the end we only have a few prognostic variables to copy that way and given that we'll use way less processes than nlon*nlat we'll never crowd the memory with a full copy of all prognostic variables (assuming that the garbage collector collects before the end of the for j=1:nlat,i=1:nlon
loop)
from speedyweather.jl.
I guess my first question is: why not change all loops, including in the dycore, to lat, lon, vert?
from speedyweather.jl.
Because we probably want the costly spherical transforms to act on arrays that are contiguously layed out in memory. So for them the loop order should be vert (outer loop) and then horizontal (inner), which leads to arrays having size nlon,nlat,nlev (column-major) or lmax,mmax,nlev in spectral. Given that the dycore is costly and involves a lot of global communication and nlon*nlat >> nlev I'd rather have memory access and loop order optimized for that.
from speedyweather.jl.
Based on what I've seen of the parameterisations so far, I think it could make a lot of sense to parallelise in the horizontal.
Regarding the view versus deepcopy thing, this section of the Julia performance tips would seem to support a deepcopy, but in any case we can just do some tests.
given that we'll use way less processes than nlon*nlat
Can you explain this please?
from speedyweather.jl.
As an archetypal single-column physics scheme, the convection scheme from SPEEDY might be a nice first target for porting to Julia. This produces just tendencies of specific humidity and (I think) enthalpy (i.e. temperature with a trivial conversion). As you can see from the code, there are two big loops over ngp
so it fits exactly the pattern that we are talking about.
from speedyweather.jl.
#123 implements this idea ☝🏼 roughly as
column = ColumnVariables{NF}(nlev=diagn.nlev)
for ij in eachgridpoint(diagn) # loop over all horizontal grid points
reset_column!(column) # set accumulators back to zero for next grid point
get_column!(column,diagn,ij,G) # extract an atmospheric column for contiguous memory access
# calculate parametrizations
large_scale_condensation!(column, M)
...
# write tendencies from parametrizations back into horizontal fields
write_column_tendencies!(diagn,column,ij)
end
So this is the only loop over the horizontal that could be @distributed
as every iteration is independent of others. However, every worker should have an instance of column
to avoid conflicts. Otherwise this is also independent of the grid (see #112) as eachgridpoint(diagn)
just loops over all gridpoints (regardless of their arrangment) with a single index.
from speedyweather.jl.
We now use multihreading through @floop
instead of @distributed
but otherwise idea as above ☝🏼
Benchmarking this (T31, Float64, full F24 grid, 8 levels held suarez forcing as an example)
nthreads | time | faster |
---|---|---|
1 | 2.6ms | 1x |
2 | 1.4ms | 1.9x |
4 | 755μs | 3.4x |
8 | 400μs | 6.5x |
16 | 255μs | 10.2x |
At higher resolution (T127, 31 levels)
nthreads | time | faster |
---|---|---|
1 | 194ms | 1x |
2 | 100ms | 1.94x |
4 | 54.6ms | 3.55x |
8 | 28ms | 6.9x |
16 | 15.5ms | 12.5x |
and similar scaling for T511, 63 levels
nthreads | time | faster |
---|---|---|
1 | 6.6s | 1x |
16 | 520ms | 12.8x |
@white-alistair not surprising, but good to see. I believe we don't reach 16x because the bottleneck becomes reading from (the same) data etc.
As a comparison using nthreads=16, and the OctaHEALPixGrid (-12.5% grid points but cubic instead of quadratic truncation), the T127 15.5ms become 14ms which is just the linear scaling from the number of grid points, but with Float32 this drops again to
julia> p,d,m = initialize_speedy(Float32,PrimitiveDryCore,trunc=127,nlev=31,Grid=OctaHEALPixGrid);
julia> @btime SpeedyWeather.parameterization_tendencies!($d,$t0,$m);
9.951 ms (145 allocations: 83.84 KiB)
which is a speedup of 1.4x Float64 -> Float32 for the parameterizations.
from speedyweather.jl.
Related Issues (20)
- Another not too simple radiation scheme
- Exoplanet context HOT 3
- Automatic performance testing HOT 9
- Precompilation Error HOT 2
- Add proper citations in Docs via DocumenterCitations.jl HOT 1
- JOSS review: text comments HOT 6
- Replace `PythonPlot` with a Julia-based plotting library in Docs HOT 2
- Verifying conserved quantities HOT 15
- Modified dynamics HOT 16
- PrimitiveDry and WetModel generation on Julia v1.9 hangs HOT 10
- The PrimitiveWetModel example fails HOT 10
- unbalanced initial condition for Galewsky Jet HOT 4
- ShallowWater dataset on PDEArena HOT 6
- Modularise netCDF output
- Time stepping of particle tracking HOT 1
- Virtual temperatuer as prognostic variable
- Spectral filtering instead of hyperdiffusion
- Lagrangian sampling of the model state HOT 1
- Instability develops over long integrations HOT 24
- set! initial conditions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from speedyweather.jl.