GithubHelp home page GithubHelp logo

Architecture/Threading idea about connxr HOT 24 OPEN

nopeslide avatar nopeslide commented on July 17, 2024
Architecture/Threading idea

from connxr.

Comments (24)

mdhimes avatar mdhimes commented on July 17, 2024

In general I think this is a great idea. It looks like a better, higher-level approach than the multithreaded version I had coded up, which performed the multithreading within a given operator.

Have you done any benchmarking yet? Based on what I observed in my multithreaded version, my main concern is that, in some situations, the added overhead of creating/joining threads can eat up any performance gains. I think the higher-level implementation avoids this problem for the most part, but it would be good to benchmark it against the current version of the code to make sure it isn't going to make simple models (e.g., the MNIST example) less efficient. Maybe a single-threaded comparison to evaluate the overhead, and then look at a 2- or 4-threaded run to see how it scales.

from connxr.

alrevuelta avatar alrevuelta commented on July 17, 2024

I haven't thought much about supporting multithreading but before going into technicalities I would like to discuss some other things. I think we should have clear what we want to achieve and why, and then the "how" will come later. Some ideas from the top of my head:

  • I assume you are referring to multithreading on a model level. So if a given model has different "branches" that don't depend on each other, we can parallelise them by running both at the same time. Just out of curiosity, do you have any models that could be parallelised?

  • Do you have any idea how can we know if a given model can be parallelised? I don't know if there are some tools out there but I don't think this is trivial. We should study this and this will impact how we "resolve" our jobs.

  • Do you have a particular use case/need why you want to implement this? If we make this multithreading, that means that we are looking for performance, and I don't think someone looking for performance will be using cONNXr. Lets not forget that this is C99.

  • From my point of view we could allocate our limited time in other more "core" functionalities that I think are more relevant. Of course, if there is a particular need and someone is really interested we can consider it . @mdhimes is this something that you are looking into?

Regarding node_context, there are some things that I don't understand. So a node is a given operator with its inputs and outputs. Why does a node have different jobs? Could you elaborate more on this?

struct node_context {
   Onnx__NodeProto     *onnx_node;
   Onnx__TensorProto  **inputs;
   Onnx__TensorProto  **outputs;
+  operator_executer    executer;
+  size_t               n_jobs;
+  void               **jobs;
 };

In general I think this is a great idea. It looks like a better, higher-level approach than the multithreaded version I had coded up, which performed the multithreading within a given operator.

So you were parallelising some operators? Just curious, which ones have you tried?

from connxr.

mdhimes avatar mdhimes commented on July 17, 2024

If we make this multithreading, that means that we are looking for performance, and I don't think someone looking for performance will be using cONNXr. Lets not forget that this is C99.

While I think that this is true for the average user, I think it's good to keep in mind that a lot of science software is written in older standards, and performance can be a concern in those situations. In my case, we're working with some modern software that is written in C99 for speed and for compatibility with some older modules, and we'd like to squeeze out any performance gains that are possible. So, there is definitely an interest in a multithreaded version of cONNXr. But, I think it is only worth implementing it into the master branch if it doesn't negatively impact the performance of single-threaded uses.

So you were parallelising some operators? Just curious, which ones have you tried?

I parallelized all operators that were implemented at that time (early June) just to see if it was reasonable. I found that most operators didn't benefit from parallelization (the overhead of creating/joining threads ate up the performance gain of using N threads) except when there were large numbers of that operation. From what I recall, the exception to that was the add and matmul operators, which provided some performance gains even for small models.

Do you have any idea how can we know if a given model can be parallelised? I don't know if there are some tools out there but I don't think this is trivial. We should study this and this will impact how we "resolve" our jobs.

This is a great point, and it should probably be figured out first before going down the rabbit hole.

Based on my use case, my initial thought is that you could just parallelize over the zero-th axis. These would be separate, independent cases, so presumably the inference could be carried out in parallel without issue. E.g., if there is a batch of 256 cases to predict on, you could spawn 256 jobs and assign them to N workers. For all of my models, this holds true -- they all take inputs shaped as (num_cases, num_inputs) -- but maybe there are some other models that this is not true for. Thoughts?

from connxr.

alrevuelta avatar alrevuelta commented on July 17, 2024

@mdhimes Great input

Based on my use case, my initial thought is that you could just parallelize over the zero-th axis. These would be separate, independent cases, so presumably the inference could be carried out in parallel without issue. E.g., if there is a batch of 256 cases to predict on, you could spawn 256 jobs and assign them to N workers. For all of my models, this holds true -- they all take inputs shaped as (num_cases, num_inputs) -- but maybe there are some other models that this is not true for. Thoughts?

This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading.

from connxr.

mdhimes avatar mdhimes commented on July 17, 2024

This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading.

Yeah, the single global context threw a wrench in my early multithreading plans :)

But, assuming that the internal arrays maintain the expected dimensionality, I think it should still be possible to parallelize over the zero-th axis even with the global context. Everything is presumably accessed through pointers to pre-allocated arrays, and since each case along the zero-th axis is independent, there's no risk of a worker trying to access a value that hasn't been filled in yet. As long as all of the workers are joined before trying to access the outputs, it should be fine, I think. Though, I think this would require a reworking of how the operators perform.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

Regarding node_context, there are some things that I don't understand. So a node is a given operator with its inputs and outputs. Why does a node have different jobs? Could you elaborate more on this?

There is a lot of potential parallelism in the execution of any onnx model:

  • batch
    batches are independent and can be executed in any order
  • model
    we have a model with nodes that can run in parallel
  • operator
    almost all operators can be parallelized to a certain degree

I wanted to tackle the batch and operator parallelism.
Instead of having an operator calculate everything in one go, I would split the whole calculation in independent chunks aka jobs.
A batch element would be a legitimate chunk aka job, or even a partial calculation inside the same batch element:
I.e.

  • Add is a linear walk through independent values, therefore parallelizable
  • Conv has groups and feature maps which are parallelizable

This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading.

Afaik we do not need to change a thing with this approach, since all jobs of a node need to be independent and therefore can run in any order. These jobs work with already allocated memory and do nothing but filling in the numbers.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

As long as all of the workers are joined before trying to access the outputs, it should be fine, I think. Though, I think this would require a reworking of how the operators perform.

I would try to keep it simple, stupid,
but you are right we need some kind of synchronization between the workers regarding nodes or jobs.
Either we run the network node by node and wait for all worker to finish for each node
or, we run N workers completely independent of each other which try to grab the next possible job, requiring a critical/lockable structure which marks jobs as done

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

@alrevuelta
keep in mind with #42 and #40 we already need to exchange data between the "init" (for the lack of a better name) aka typeless-operator function and the actual calculation function. this data is nothing else as what I called "job".
the only additional thing I propose is: instead of exchanging this one large "job," we could exchange multiple smaller jobs directly enabling multiple threads if desired.

from connxr.

mdhimes avatar mdhimes commented on July 17, 2024

I would try to keep it simple, stupid,
but you are right we need some kind of synchronization between the workers regarding nodes or jobs.
Either we run the network node by node and wait for all worker to finish for each node
or, we run N workers completely independent of each other which try to grab the next possible job, requiring a critical/lockable structure which marks jobs as done

This is why I'm in favor of parallelizing over the batch. If using pthread.h, the create/join calls will handle all of that.

If you also parallelize operators, unless there is a more efficient way than I implemented, a lot of the time you won't get much benefit. Add, matmul, and conv are the few that I think would benefit from parallelization in most use cases. I can make a branch on my fork if you'd like to take a look at how I handled that -- it isn't the best coding style per se, but it was quick enough to implement for my purpose of testing.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

If you also parallelize operators, unless there is a more efficient way than I implemented, a lot of the time you won't get much benefit. Add, matmul, and conv are the few that I think would benefit from parallelization in most use cases. I can make a branch on my fork

I would not create a thread for each job, but have job ringbuffer from which persistent worker grab their next job. If you create a thread for each job you work against the fork/clone overhead.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

This is why I'm in favor of parallelizing over the batch. If using pthread.h, the create/join calls will handle all of that.

As far as I see it, if we abstract the parallelism this way and each operator decides how to be parallelized, it's always at least the batch parallelism and in some cases even more, but the underlying structure can accommodate all kinds of parallelism an operator may offer.
How to parallelize multiple nodes or even the whole network is out-of-scope of this idea, since this would require actual scheduling done outside of the operator execution.

from connxr.

alrevuelta avatar alrevuelta commented on July 17, 2024

I wanted to tackle the batch and operator parallelism.

Okay I see. So if we leave by now the model parallelism aside, this are my concerns:

  • Batch parallelism is interesting and a cool feature to have. Either by building on top or using zero-th axis I would say the impact is reasonable.

  • Regarding operator parallelism and what you have mentioned, I'm worried about its impact. Can't we just parallelise within the operator? I mean, leave all the interfaces/struct/whatever untouched to avoid adding unnecessary complexity to the non-multithreading case, and then just parallelise one particular operator. This means that the operator has to be reimplemented, but locally. Do you think this can be a solution to the problem?. One of the cool things of cONNXr is that its a spider that connects inputs/outputs/attributes with its nodes and the function that has to be run, with a simple and generic interface. If someone doesn't like one particular operator, they can reimplement it to run multithreaded, on a GPU, HW accelerator, or whatever.
    @mdhimes which approach did you took to parallelise the operators? Where you respecting the interface and just multithreading within the function?

This is something that I never thought of but shouldn't be very difficult to implement. Actually it should be possible to implement this without even touching cONNXr. Beware of #45. Currently there is one global context (that stores inputs and outputs) so if multiple threads are spawned it won't work. Once this issue is fixed we could build something on top to allow this. Might be a nice starting point to play with multithreading.

Afaik we do not need to change a thing with this approach, since all jobs of a node need to be independent and therefore can run in any order. These jobs work with already allocated memory and do nothing but filling in the numbers.

What I mean here is that if we have multiple threads running inference at the same time, the struct all_context will be corrupted, because all the thread would be writing at the same time on it.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024
* Batch parallelism is interesting and a cool feature to have. Either by building on top or using zero-th axis I would say the impact is reasonable.

* Regarding operator parallelism and what you have mentioned, I'm worried about its impact. Can't we just parallelise within the operator? I mean, leave all the interfaces/struct/whatever untouched to avoid adding unnecessary complexity to the non-multithreading case, and then just parallelise one particular operator. This means that the operator has to be reimplemented, but locally. Do you think this can be a solution to the problem?. One of the cool things of cONNXr is that its a spider that connects inputs/outputs/attributes with its nodes and the function that has to be run, with a simple and generic interface. If someone doesn't like one particular operator, they can reimplement it to run multithreaded, on a GPU, HW accelerator, or whatever.

In my view, batch parallelism is a specific kind of operator parallelism.
The operator gets the data of multiple batches at once and it's the operators job to offload this work how it sees fit.
The problem I see by not providing this tiny,tiny (;D) interface change (making a necessary void pointer to an array) you need to build something not on-top of connxr, but something that has to replace each operator, the inference and our contexts to accommodate efficient threading, therefore almost everything.

What I mean here is that if we have multiple threads running inference at the same time, the struct all_context will be corrupted, because all the thread would be writing at the same time on it.

There can't happen any corruption afaik, because each job is (defined as) completely independent and therefore only writes to its own assigned location not interfering with anything else. this is what I meant by "filling in the numbers"

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

regarding the necessary void pointer I mentioned:
If our typeless operator sets up the execution context (parallelized or not) it makes sense to provide some things to the actual executer.
For example, optional attributes can be resolved in the typeless operator and we supply the values the execution needs to act on (resolved means here, the decision between default values or values from the attribute).
For this we need to extend the node context with a void pointer to reference the operator specific execution context.
sorry if this was unclear, sometimes I'm jumping too far ahead.

from connxr.

mdhimes avatar mdhimes commented on July 17, 2024

@mdhimes which approach did you took to parallelise the operators? Where you respecting the interface and just multithreading within the function?

Yes, mostly. I modified the existing function to use pthread.h to create/join the different threads. I defined a new function for those threads to carry out (this is basically the original lines of code that implement the operator). And, in order to tell the threads where to get the inputs from and where to store the outputs, I defined a new struct that held pointers to those locations and other relevant info (e.g., indices to carry out the operation).

I would not create a thread for each job, but have job ringbuffer from which persistent worker grab their next job. If you create a thread for each job you work against the fork/clone overhead.

Is there a reason that the full set of jobs can't just be split among the threads when they are created? Is it more efficient to use a ringbuffer? If you have a batch size of M and requested N threads, then the i-th thread would do the inference on the non-inclusive range of elements i -- min(i+N, M). This is the approach I took with parallelizing the operators, which ensures that there are exactly as many jobs as there are threads, as this avoids the overhead related to creating additional jobs and having workers retrieve said jobs. Though, maybe that overhead is negligible.

As far as I see it, if we abstract the parallelism this way and each operator decides how to be parallelized, it's always at least the batch parallelism and in some cases even more, but the underlying structure can accommodate all kinds of parallelism an operator may offer.

This is a great point, I hadn't considered that. I like it.

I think the biggest concern is performance impact, which I suppose we won't really know until it's implemented and tested. Worst-case scenario, if it negatively impacts simple cases, then it could exist as a branch for people that need multithreading.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

Is there a reason that the full set of jobs can't just be split among the threads when they are created?

this would make jobs and threads dependant on each other.
I prefer to divide job creation and job scheduling, so threading is completely optional regarding the creation of all datastructures.
Later on, this would allow to divide the frontend, the onnx parsing code from the backend (the actual execution).

Is it more efficient to use a ringbuffer? If you have a batch size of M and requested N threads, then the i-th thread would do the inference on the non-inclusive range of elements i -- min(i+N, M). This is the approach I took with parallelizing the operators, which ensures that there are exactly as many jobs as there are threads, as this avoids the overhead related to creating additional jobs and having workers retrieve said jobs. Though, maybe that overhead is negligible.

static ringbuffers are just neat for job dispatching. you have a mutex-protected ringbuffer in which you insert your jobs and a fixed number of persistent threads which remove the jobs. reading an empty ringbuffer blocks, writing to a full ringbuffer blocks, an insertion unblocks readers, a removal unblocks writers.
The scheduler simply inserts each job in the ringbuffer. To synchronize between nodes it can just wait until all threads are blocked while reading.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

I think the biggest concern is performance impact, which I suppose we won't really know until it's implemented and tested. Worst-case scenario, if it negatively impacts simple cases, then it could exist as a branch for people that need multithreading.

the impact should minimal, if not even non-existing.
If the jobs are large enough to be optimized by the compiler (at least a few loops), I expect no impact.

from connxr.

mdhimes avatar mdhimes commented on July 17, 2024

I see, thanks for the info. I'm not familiar with ringbuffers so I wasn't sure what impact that might have. If the impact is as minimal as you say, then it sounds like a great approach to me.

I'm not sure how much help I could be in coding it up (I'd have to go learn more about ringbuffers) but I'm happy to do some benchmarking once the method is implemented, I have a pretty wide range of hardware (~decade old server, ~7 year old consumer-grade workstation, ~4 year old i7 laptop, modern AMD EPYC server) to test it out.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

I'm not sure how much help I could be in coding it up (I'd have to go learn more about ringbuffers) but I'm happy to do some benchmarking once the method is implemented, I have a pretty wide range of hardware (~decade old server, ~7 year old consumer-grade workstation, ~4 year old i7 laptop, modern AMD EPYC server) to test it out.

nice to know, I could throw in some modern dual-socket xeon monster (2x6x2 threads) optimized for IO.
Realistic numbers/benchmarks are hard to come by, maybe we need some kind of highscore board :D

from connxr.

alrevuelta avatar alrevuelta commented on July 17, 2024

In my view, batch parallelism is a specific kind of operator parallelism.
The operator gets the data of multiple batches at once and it's the operators job to offload this work how it sees fit.
The problem I see by not providing this tiny,tiny (;D) interface change (making a necessary void pointer to an array) you need to build something not on-top of connxr, but something that has to replace each operator, the inference and our contexts to accommodate efficient threading, therefore almost everything.

I might be missing something here, but I see batch parallelism more straight forward. We have the following function.

inference(Onnx__ModelProto *model, Onnx__TensorProto **inputs, int nInputs)

Can't we just spawn different threads that call this functions with different inputs?

from connxr.

alrevuelta avatar alrevuelta commented on July 17, 2024

@mdhimes @nopeslide Perhaps a MULTITHREADING_ENABLED variable and some ifdef magic can satisfy all parties?

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

@mdhimes @nopeslide Perhaps a MULTITHREADING_ENABLED variable and some ifdef magic can satisfy all parties?

let's wait until the PR is ready, then we see the code/performance impact and can decide how to proceed.
Will do a PR for the build system first (#58 related)

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

I might be missing something here, but I see batch parallelism more straight forward. We have the following function.

inference(Onnx__ModelProto *model, Onnx__TensorProto **inputs, int nInputs)

Can't we just spawn different threads that call this functions with different inputs?

we could, but this would either need special input (no simple onnx tensor), or preprocessing of the onnx tensor input (generating multiple tensors out of one).
I would like to keep connxr as close as possible to onnx, without any special treatment of anything, therefore letting the operator decide how to do operations.
on the other hand, fair enough, this seems more simple, but I got the feeling it will hurt us in the long run, since it looks like some special treatment which could be easily done outside of connxr and therefore has no place in connxr itself.

from connxr.

nopeslide avatar nopeslide commented on July 17, 2024

@mdhimes
I finished #57 which divides execution of an operator into a prepare and execution stage and would be a valid base for any "job based" threading attempts.

from connxr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.