GithubHelp home page GithubHelp logo

Comments (21)

cuda avatar cuda commented on May 18, 2024

as the element count is relevant for that decision too.

That is true for matrix/vector operations but not any arbitrary action. I could be computing 10 or 100 computations that I want spread over all available processors (a costly optimization problem for example). If CommonParallel checked the number of elements they wouldn't be ran in parallel (unless I lowered the default number of elements of course).

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

For example Double.DenseVector.Add(scalar) is really slow because of the parallel overhead (for vector size = 1000 and test iterations = 100000 it is three times slower than non parallel version).

Double.DenseVector.Mulitply(scalar) is extremely fast as it uses LinearAlgebraProvider.ScaleArray().

I don't know if using Control.ParallelizeOperation() is the solution, but I think that some Vector operations are slower because of the parallel overhead and that should be fixed.

from mathnet-numerics.

cuda avatar cuda commented on May 18, 2024

... but I think that some Vector operations are slower because of
the parallel overhead and that should be fixed.

I completely agree, the TPL has significant overhead (and I have never been a fan of it for tight numerical loops). In some places I think we parallelized the code for the sake of parallelizing it instead of testing to see if it made sense to.

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

What changing CommonParallel.For() like this:

        public static void For(int fromInclusive, int toExclusive, Action<int, int> body)
        {
            //other code removed to make it shorter

            // iterative
            if (Control.DisableParallelization || Control.NumberOfParallelWorkerThreads < 2)
            {
                body(fromInclusive, toExclusive);
                return;
            }

            //for parallel
            Parallel.ForEach(
                Partitioner.Create(fromInclusive, toExclusive),
                new ParallelOptions
                {
                    MaxDegreeOfParallelism = Control.NumberOfParallelWorkerThreads
                },
                (range, loopState) => body(range.Item1, range.Item2)
                );
        }

Then there would be no thight loops with delegate calls as the action gets begin an end index. So it will be faster on iterative and on parallel path.

And to be backward compatible (will have the same speed as current implementation):

        public static void For(int fromInclusive, int toExclusive, Action<int> body)
        {
            For(fromInclusive, toExclusive, (begin, end) =>
                {
                    for (var index = begin; index < end; index++)
                    {
                        body(index);
                    }
                });
        }

Then we can change all calls to CommonParallel in Vector classes to the new version.
What do you think?

from mathnet-numerics.

cuda avatar cuda commented on May 18, 2024

Did you benchmark it? On the face of if it, it seems that is what Parallel.For would do.

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

Have done some simple test no benchmark yet.

The difference is he reduction of delegate calls in a thight loop!

Since CLR v 2, the cost of delegate invocation is very close to that of virtual method invocation, which is used for interface methods.
http://stackoverflow.com/questions/2082735/performance-of-calling-delegates-vs-methods

A delegate call costs are compareable to an virtual method call, this is much more expensive than add/subtract of a double value in an array.

I have never been a fan of it for tight numerical loops

You may do some tests for your own, but I'm really optimistic that it will show a performance boost. As the problem with the thight loop is in the non-parallel code path too.

from mathnet-numerics.

cuda avatar cuda commented on May 18, 2024

The difference is he reduction of delegate calls in a thight loop!

doh, I missed that.

A couple other things we might want to look at:

  1. We set number of threads to Environment.ProcessorCount which includes virtual cores. Is there a portable way to only get physical cores?
  2. How many chunks is the partitioner creating? If it is creating too "many", maybe we should create a custom one that only creates number-of-thread (or twice the number of threads) chunks.

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

Did some simple test with adding a scalar to a vector:
https://gist.github.com/tibel/1bcf17bf31d41ce888c2

Also tried to change MaxDegreeOfParallelism and the Partition size, but the sequential version was always the fastet (on i5-3210M, Win8, NET 4.5).

@cuda I think you are right. Parallel has only overhead and no benefits in speed (at least for vector artihmetics).

And you see the delegate call overhead in the tests (Thight Loop vs. Partition Loop).

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

Run the same test on another PC (Xeon E31245, Win7, NET 4.5):

  • With input size = 100000 Partition Parallel was fastest (PartitionSize=10000)
  • with smaller input size Partition Loop wins
  • Partition Loop and Reference Loop speed is nearly the same (negligible)

@cdrnet I think we should move to the Partition Loop variant and use Partition Parallel for really big vectors only.

from mathnet-numerics.

cdrnet avatar cdrnet commented on May 18, 2024

Thanks for the benchmarks! Yes, I agree.

I admit I'm positively surprised that partition loop is (almost) as fast as the reference loop, apparently the compiler got much smarter to optimize and avoid range checks in more for-loop over array cases than back in .Net 2. This should simplify things. Seems like I should do more benchmarking again myself as well...

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

How we can fix this?

Also not all operations a equaly complex, so ParallelizeOperation() is not ideal in deciding if it should run in parallel. I think we need to put the complexity of an operation into the consideration and not only the length of an array.

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

Created a proposal for ForEach() implementation:
https://gist.github.com/tibel/b39334995ed097b1282e

I hope this can be a fist step in solving the performance issues in MathNet.Numerics.

from mathnet-numerics.

cuda avatar cuda commented on May 18, 2024

I like the rangeSize parameter. Maybe instead of deprecating the old For, we compute a range size and call the new For.

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

I think there are two problems with the old For() implementation:

  • It uses delegate calls in a tight loop. The benchmark shows that this kills performance.
  • The partitioning does not depend on the complexity of the algorithm in body parameter.

That's why I marked it as Obsolete. Only the caller knows accurate rangeSize for the algorithm. The implementation itself has not enough information.

The code Partitioner.Create() uses when not called with a rangeSize:

if (toExclusive <= fromInclusive) throw new ArgumentOutOfRangeException("toExclusive");
int coreOversubscriptionRate = 3; 
int rangeSize = (toExclusive - fromInclusive) / (Environment.ProcessorCount * coreOversubscriptionRate); 
if (rangeSize == 0) rangeSize = 1;

This is quite naive, but without further information it cannot be better.

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

Updated my gist to calculate a rangeSize in For().

At least all calls to CommonParallel.For() in DenseVectorand ManagedLinearAlgebraProvider should be changed to CommonParallel.ForEach() to get better performance. For vector rangeSize=10000 seems to be quite good.

@cuda Is the rangeSize calculation OK for you?
@cdrnet Will you integrate this change? Should I create a pull?

from mathnet-numerics.

cuda avatar cuda commented on May 18, 2024

I'm not sure about

rangeSize = Math.Max(rangeSize, Control.BlockSize);

That would preclude parallel looping over a small number long running computations (say 100). I would just leave it at:

int rangeSize = (toExclusive - fromInclusive) / (Control.NumberOfParallelWorkerThreads * 2); 

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

@cuda Done. rangeSize = Math.Max(rangeSize, 1); it should be at least one.

from mathnet-numerics.

cdrnet avatar cdrnet commented on May 18, 2024

@tibel Sorry for the delay. Yes, the proposed implementation looks very good, I'd certainly integrate the change.

from mathnet-numerics.

cdrnet avatar cdrnet commented on May 18, 2024

FYI: I'm working on it.

from mathnet-numerics.

tibel avatar tibel commented on May 18, 2024

Great to hear.

from mathnet-numerics.

cdrnet avatar cdrnet commented on May 18, 2024

Implemented in mainline, thanks again.

from mathnet-numerics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.