Comments (21)
as the element count is relevant for that decision too.
That is true for matrix/vector operations but not any arbitrary action. I could be computing 10 or 100 computations that I want spread over all available processors (a costly optimization problem for example). If CommonParallel checked the number of elements they wouldn't be ran in parallel (unless I lowered the default number of elements of course).
from mathnet-numerics.
For example Double.DenseVector.Add(scalar) is really slow because of the parallel overhead (for vector size = 1000 and test iterations = 100000 it is three times slower than non parallel version).
Double.DenseVector.Mulitply(scalar) is extremely fast as it uses LinearAlgebraProvider.ScaleArray().
I don't know if using Control.ParallelizeOperation() is the solution, but I think that some Vector operations are slower because of the parallel overhead and that should be fixed.
from mathnet-numerics.
... but I think that some Vector operations are slower because of
the parallel overhead and that should be fixed.
I completely agree, the TPL has significant overhead (and I have never been a fan of it for tight numerical loops). In some places I think we parallelized the code for the sake of parallelizing it instead of testing to see if it made sense to.
from mathnet-numerics.
What changing CommonParallel.For()
like this:
public static void For(int fromInclusive, int toExclusive, Action<int, int> body)
{
//other code removed to make it shorter
// iterative
if (Control.DisableParallelization || Control.NumberOfParallelWorkerThreads < 2)
{
body(fromInclusive, toExclusive);
return;
}
//for parallel
Parallel.ForEach(
Partitioner.Create(fromInclusive, toExclusive),
new ParallelOptions
{
MaxDegreeOfParallelism = Control.NumberOfParallelWorkerThreads
},
(range, loopState) => body(range.Item1, range.Item2)
);
}
Then there would be no thight loops with delegate calls as the action gets begin an end index. So it will be faster on iterative and on parallel path.
And to be backward compatible (will have the same speed as current implementation):
public static void For(int fromInclusive, int toExclusive, Action<int> body)
{
For(fromInclusive, toExclusive, (begin, end) =>
{
for (var index = begin; index < end; index++)
{
body(index);
}
});
}
Then we can change all calls to CommonParallel in Vector classes to the new version.
What do you think?
from mathnet-numerics.
Did you benchmark it? On the face of if it, it seems that is what Parallel.For would do.
from mathnet-numerics.
Have done some simple test no benchmark yet.
The difference is he reduction of delegate calls in a thight loop!
Since CLR v 2, the cost of delegate invocation is very close to that of virtual method invocation, which is used for interface methods.
http://stackoverflow.com/questions/2082735/performance-of-calling-delegates-vs-methods
A delegate call costs are compareable to an virtual method call, this is much more expensive than add/subtract of a double value in an array.
I have never been a fan of it for tight numerical loops
You may do some tests for your own, but I'm really optimistic that it will show a performance boost. As the problem with the thight loop is in the non-parallel code path too.
from mathnet-numerics.
The difference is he reduction of delegate calls in a thight loop!
doh, I missed that.
A couple other things we might want to look at:
- We set number of threads to Environment.ProcessorCount which includes virtual cores. Is there a portable way to only get physical cores?
- How many chunks is the partitioner creating? If it is creating too "many", maybe we should create a custom one that only creates number-of-thread (or twice the number of threads) chunks.
from mathnet-numerics.
Did some simple test with adding a scalar to a vector:
https://gist.github.com/tibel/1bcf17bf31d41ce888c2
Also tried to change MaxDegreeOfParallelism and the Partition size, but the sequential version was always the fastet (on i5-3210M, Win8, NET 4.5).
@cuda I think you are right. Parallel has only overhead and no benefits in speed (at least for vector artihmetics).
And you see the delegate call overhead in the tests (Thight Loop vs. Partition Loop).
from mathnet-numerics.
Run the same test on another PC (Xeon E31245, Win7, NET 4.5):
- With input size = 100000 Partition Parallel was fastest (PartitionSize=10000)
- with smaller input size Partition Loop wins
- Partition Loop and Reference Loop speed is nearly the same (negligible)
@cdrnet I think we should move to the Partition Loop variant and use Partition Parallel for really big vectors only.
from mathnet-numerics.
Thanks for the benchmarks! Yes, I agree.
I admit I'm positively surprised that partition loop is (almost) as fast as the reference loop, apparently the compiler got much smarter to optimize and avoid range checks in more for-loop over array cases than back in .Net 2. This should simplify things. Seems like I should do more benchmarking again myself as well...
from mathnet-numerics.
How we can fix this?
Also not all operations a equaly complex, so ParallelizeOperation()
is not ideal in deciding if it should run in parallel. I think we need to put the complexity of an operation into the consideration and not only the length of an array.
from mathnet-numerics.
Created a proposal for ForEach()
implementation:
https://gist.github.com/tibel/b39334995ed097b1282e
I hope this can be a fist step in solving the performance issues in MathNet.Numerics.
from mathnet-numerics.
I like the rangeSize parameter. Maybe instead of deprecating the old For, we compute a range size and call the new For.
from mathnet-numerics.
I think there are two problems with the old For()
implementation:
- It uses delegate calls in a tight loop. The benchmark shows that this kills performance.
- The partitioning does not depend on the complexity of the algorithm in
body
parameter.
That's why I marked it as Obsolete
. Only the caller knows accurate rangeSize
for the algorithm. The implementation itself has not enough information.
The code Partitioner.Create()
uses when not called with a rangeSize:
if (toExclusive <= fromInclusive) throw new ArgumentOutOfRangeException("toExclusive");
int coreOversubscriptionRate = 3;
int rangeSize = (toExclusive - fromInclusive) / (Environment.ProcessorCount * coreOversubscriptionRate);
if (rangeSize == 0) rangeSize = 1;
This is quite naive, but without further information it cannot be better.
from mathnet-numerics.
Updated my gist to calculate a rangeSize
in For()
.
At least all calls to CommonParallel.For()
in DenseVector
and ManagedLinearAlgebraProvider
should be changed to CommonParallel.ForEach()
to get better performance. For vector rangeSize=10000 seems to be quite good.
@cuda Is the rangeSize calculation OK for you?
@cdrnet Will you integrate this change? Should I create a pull?
from mathnet-numerics.
I'm not sure about
rangeSize = Math.Max(rangeSize, Control.BlockSize);
That would preclude parallel looping over a small number long running computations (say 100). I would just leave it at:
int rangeSize = (toExclusive - fromInclusive) / (Control.NumberOfParallelWorkerThreads * 2);
from mathnet-numerics.
@cuda Done. rangeSize = Math.Max(rangeSize, 1);
it should be at least one.
from mathnet-numerics.
@tibel Sorry for the delay. Yes, the proposed implementation looks very good, I'd certainly integrate the change.
from mathnet-numerics.
FYI: I'm working on it.
from mathnet-numerics.
Great to hear.
from mathnet-numerics.
Implemented in mainline, thanks again.
from mathnet-numerics.
Related Issues (20)
- How do I fit the gamma distribution, please? I didn't find it when I checked the documentation, am I missing something?
- Cholesky does not throw error on asymmetric matrix
- Power return NaN
- Adler32 checksum no work
- Using the Levenberg-Marquardt to solve a set of non-linear equations
- Beta distribution Sample method can return NaN HOT 1
- Is mono still required on Linux?
- Is it still impossible to read a 10 x 1 cell array of 3D arrays?
- Sparse MapIndexedInplace starts with incorrect indices
- Question: Building mathnet-numerics
- Marcum Q Function not implemented correctly. HOT 1
- Question: scipy splprep and splev
- DenseMatrix.OfRowArrays().LU() Memory Leak HOT 1
- I want use MKL under AOT publish,But cant HOT 2
- Polynomial fit returns infinity and NaN as parameters
- Incorrect condition for stopping the LevenbergMarquardtMinimizer's Minimum() by parameter tolerance. HOT 1
- Feature Request: Support for `ReadOnlyMemory<T>` and `ReadOnlySpan<T>` in Vector Creation HOT 2
- Modify NewtonMinimizer so that it forces Hessian to be positive definite HOT 1
- MKL_CBWR environment variable does not seem to be respected when running?
- Bug in SpecialFunctions.Hypotenuse HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mathnet-numerics.