Currently Control class only checks <code class="notr

What changing CommonParallel.For() like this: <di

Did some simple test with adding a scalar to a vector: <a href="https://gist.githu

Vector performance issues about mathnet-numerics HOT 21 CLOSED

mathnet commented on May 18, 2024

Vector performance issues

from mathnet-numerics.

Comments (21)

cuda commented on May 18, 2024

as the element count is relevant for that decision too.

That is true for matrix/vector operations but not any arbitrary action. I could be computing 10 or 100 computations that I want spread over all available processors (a costly optimization problem for example). If CommonParallel checked the number of elements they wouldn't be ran in parallel (unless I lowered the default number of elements of course).

from mathnet-numerics.

tibel commented on May 18, 2024

For example Double.DenseVector.Add(scalar) is really slow because of the parallel overhead (for vector size = 1000 and test iterations = 100000 it is three times slower than non parallel version).

Double.DenseVector.Mulitply(scalar) is extremely fast as it uses LinearAlgebraProvider.ScaleArray().

I don't know if using Control.ParallelizeOperation() is the solution, but I think that some Vector operations are slower because of the parallel overhead and that should be fixed.

from mathnet-numerics.

cuda commented on May 18, 2024

... but I think that some Vector operations are slower because of
the parallel overhead and that should be fixed.

I completely agree, the TPL has significant overhead (and I have never been a fan of it for tight numerical loops). In some places I think we parallelized the code for the sake of parallelizing it instead of testing to see if it made sense to.

from mathnet-numerics.

tibel commented on May 18, 2024

What changing CommonParallel.For() like this:

        public static void For(int fromInclusive, int toExclusive, Action<int, int> body)
        {
            //other code removed to make it shorter

            // iterative
            if (Control.DisableParallelization || Control.NumberOfParallelWorkerThreads < 2)
            {
                body(fromInclusive, toExclusive);
                return;
            }

            //for parallel
            Parallel.ForEach(
                Partitioner.Create(fromInclusive, toExclusive),
                new ParallelOptions
                {
                    MaxDegreeOfParallelism = Control.NumberOfParallelWorkerThreads
                },
                (range, loopState) => body(range.Item1, range.Item2)
                );
        }

Then there would be no thight loops with delegate calls as the action gets begin an end index. So it will be faster on iterative and on parallel path.

And to be backward compatible (will have the same speed as current implementation):

        public static void For(int fromInclusive, int toExclusive, Action<int> body)
        {
            For(fromInclusive, toExclusive, (begin, end) =>
                {
                    for (var index = begin; index < end; index++)
                    {
                        body(index);
                    }
                });
        }

Then we can change all calls to CommonParallel in Vector classes to the new version.
What do you think?

from mathnet-numerics.

cuda commented on May 18, 2024

Did you benchmark it? On the face of if it, it seems that is what Parallel.For would do.

from mathnet-numerics.

tibel commented on May 18, 2024

Have done some simple test no benchmark yet.

The difference is he reduction of delegate calls in a thight loop!

Since CLR v 2, the cost of delegate invocation is very close to that of virtual method invocation, which is used for interface methods.
http://stackoverflow.com/questions/2082735/performance-of-calling-delegates-vs-methods

A delegate call costs are compareable to an virtual method call, this is much more expensive than add/subtract of a double value in an array.

I have never been a fan of it for tight numerical loops

You may do some tests for your own, but I'm really optimistic that it will show a performance boost. As the problem with the thight loop is in the non-parallel code path too.

from mathnet-numerics.

cuda commented on May 18, 2024

The difference is he reduction of delegate calls in a thight loop!

doh, I missed that.

A couple other things we might want to look at:

We set number of threads to Environment.ProcessorCount which includes virtual cores. Is there a portable way to only get physical cores?
How many chunks is the partitioner creating? If it is creating too "many", maybe we should create a custom one that only creates number-of-thread (or twice the number of threads) chunks.

from mathnet-numerics.

tibel commented on May 18, 2024

Did some simple test with adding a scalar to a vector:
https://gist.github.com/tibel/1bcf17bf31d41ce888c2

Also tried to change MaxDegreeOfParallelism and the Partition size, but the sequential version was always the fastet (on i5-3210M, Win8, NET 4.5).

@cuda I think you are right. Parallel has only overhead and no benefits in speed (at least for vector artihmetics).

And you see the delegate call overhead in the tests (Thight Loop vs. Partition Loop).

from mathnet-numerics.

tibel commented on May 18, 2024

Run the same test on another PC (Xeon E31245, Win7, NET 4.5):

With input size = 100000 Partition Parallel was fastest (PartitionSize=10000)
with smaller input size Partition Loop wins
Partition Loop and Reference Loop speed is nearly the same (negligible)

@cdrnet I think we should move to the Partition Loop variant and use Partition Parallel for really big vectors only.

from mathnet-numerics.

cdrnet commented on May 18, 2024

Thanks for the benchmarks! Yes, I agree.

I admit I'm positively surprised that partition loop is (almost) as fast as the reference loop, apparently the compiler got much smarter to optimize and avoid range checks in more for-loop over array cases than back in .Net 2. This should simplify things. Seems like I should do more benchmarking again myself as well...

from mathnet-numerics.

tibel commented on May 18, 2024

How we can fix this?

Also not all operations a equaly complex, so ParallelizeOperation() is not ideal in deciding if it should run in parallel. I think we need to put the complexity of an operation into the consideration and not only the length of an array.

from mathnet-numerics.

tibel commented on May 18, 2024

Created a proposal for ForEach() implementation:
https://gist.github.com/tibel/b39334995ed097b1282e

I hope this can be a fist step in solving the performance issues in MathNet.Numerics.

from mathnet-numerics.

cuda commented on May 18, 2024

I like the rangeSize parameter. Maybe instead of deprecating the old For, we compute a range size and call the new For.

from mathnet-numerics.

tibel commented on May 18, 2024

I think there are two problems with the old For() implementation:

It uses delegate calls in a tight loop. The benchmark shows that this kills performance.
The partitioning does not depend on the complexity of the algorithm in body parameter.

That's why I marked it as Obsolete. Only the caller knows accurate rangeSize for the algorithm. The implementation itself has not enough information.

The code Partitioner.Create() uses when not called with a rangeSize:

if (toExclusive <= fromInclusive) throw new ArgumentOutOfRangeException("toExclusive");
int coreOversubscriptionRate = 3; 
int rangeSize = (toExclusive - fromInclusive) / (Environment.ProcessorCount * coreOversubscriptionRate); 
if (rangeSize == 0) rangeSize = 1;

This is quite naive, but without further information it cannot be better.

from mathnet-numerics.

tibel commented on May 18, 2024

Updated my gist to calculate a rangeSize in For().

At least all calls to CommonParallel.For() in DenseVectorand ManagedLinearAlgebraProvider should be changed to CommonParallel.ForEach() to get better performance. For vector rangeSize=10000 seems to be quite good.

@cuda Is the rangeSize calculation OK for you?
@cdrnet Will you integrate this change? Should I create a pull?

from mathnet-numerics.

cuda commented on May 18, 2024

I'm not sure about

rangeSize = Math.Max(rangeSize, Control.BlockSize);

That would preclude parallel looping over a small number long running computations (say 100). I would just leave it at:

int rangeSize = (toExclusive - fromInclusive) / (Control.NumberOfParallelWorkerThreads * 2);

from mathnet-numerics.

tibel commented on May 18, 2024

@cuda Done. rangeSize = Math.Max(rangeSize, 1); it should be at least one.

from mathnet-numerics.

cdrnet commented on May 18, 2024

@tibel Sorry for the delay. Yes, the proposed implementation looks very good, I'd certainly integrate the change.

from mathnet-numerics.

cdrnet commented on May 18, 2024

FYI: I'm working on it.

from mathnet-numerics.

tibel commented on May 18, 2024

Great to hear.

from mathnet-numerics.

cdrnet commented on May 18, 2024

Implemented in mainline, thanks again.

from mathnet-numerics.

Vector performance issues about mathnet-numerics HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs