Comments (9)
I believe the latency in this scenario will be directly depended on the fact: is internal arena already populated with threads or not.
If the async submits to the arena is quite frequent it should keep all the threads inside arena thus latency should not be too big.
If the gaps between submit are less than 1ms you can try this patch with increased block time #1352.
I hope with threads in place and task_arena::enqueue
the avg latency should be around 5-10us not 50.
from onetbb.
OK, things are now looking great! I changed the warm up code to use a barrier to ensure that all of the threads are running, and lowered the concurrency to the number of physical cores. Then I noticed that the longer task_group time seems to only be for an arena thread that hasn't had a task_group used on it before. Once a thread was re-used, it gets fast. It seems to take less than 200ns to create a task_grpoup, add 4 sub-tasks, and have one of the sub-tasks run (including book-keeping overhead.)
Here is some sample output. Note that the latency of the root task is high, but that's fine since we launched a bunch and only have so many physical cores. The latency of the sub-tasks, once the root task is started, settles in to fast numbers. These seem like great results now, so closing this issue. Thank you for all the help!
~/tbb_sched/cmake-build-release$ ./tbb_sched 100 4 2 1 | tail -20
Thread 140541798905408 timeToSubmit=90 timeToSchedule=31850 total=31940
Thread 140541798905408 timeToSubmit=110 timeToSchedule=260 total=370
Thread 140541798905408 timeToSubmit=100 timeToSchedule=190 total=290
Thread 140541798905408 timeToSubmit=110 timeToSchedule=110 total=220
Thread 140541798905408 timeToSubmit=120 timeToSchedule=30 total=150
Thread 140541777913408 timeToSubmit=30 timeToSchedule=33920 total=33950
Thread 140541777913408 timeToSubmit=130 timeToSchedule=320 total=450
Thread 140541777913408 timeToSubmit=140 timeToSchedule=230 total=370
Thread 140541777913408 timeToSubmit=150 timeToSchedule=140 total=290
Thread 140541777913408 timeToSubmit=160 timeToSchedule=40 total=200
Thread 140541756921408 timeToSubmit=40 timeToSchedule=33889 total=33929
Thread 140541756921408 timeToSubmit=100 timeToSchedule=290 total=390
Thread 140541756921408 timeToSubmit=120 timeToSchedule=200 total=320
Thread 140541756921408 timeToSubmit=140 timeToSchedule=110 total=250
Thread 140541756921408 timeToSubmit=150 timeToSchedule=30 total=180
Thread 140541874198080 timeToSubmit=80 timeToSchedule=30960 total=31040
Thread 140541874198080 timeToSubmit=130 timeToSchedule=320 total=450
Thread 140541874198080 timeToSubmit=120 timeToSchedule=230 total=350
Thread 140541874198080 timeToSubmit=140 timeToSchedule=130 total=270
Thread 140541874198080 timeToSubmit=140 timeToSchedule=40 total=180
from onetbb.
Hi @yonik, I'm not sure I 100% understood your example but as far I can see you want to run a subset of tasks from the thread that is not going to participate in this work.
Serial submit can actually take significant amount of time to process since single thread will submit all the tasks.
Our parallel algorithms use recursive decomposition pattern where threads will submit no more than log2(N) tasks per tree branch.
So the quickest way to submit a bunch a parallel tasks is to use something like tg.run([&] { tbb::parallel_for(); })
If there is no point where thread will call tg.wait()
then task_group::enqueue
should be used.
from onetbb.
Hi @yonik, I'm not sure I 100% understood your example but as far I can see you want to run a subset of tasks from the thread that is not going to participate in this work.
Yeah, I think it really comes down to the time for default work-stealing to occur. The fact that I launched multiple tasks in my example code isn't actually very relevant. I'm looking for a way to tell TBB "I just created a task, but this thread won't be executing it, so please move/steal this task ASAP". It's almost like work-requesting rather than work-stealing.
If there is no point where thread will call tg.wait() then task_group::enqueue should be used.
I already tried a bunch of variants I didn't include here. flow graphs, task groups, and arenas all showed about the same amount of latency for work to be stolen and started on a different thread.
from onetbb.
hope with threads in place and task_arena::enqueue the avg latency should be around 5-10us not 50.
Yes, that looks to be the case.
I changed the warmup code in the test code above to this:
// create enough tasks that multiple other threads will be spun-up.
int res=0;
for (int i=0; i<100000; i++) {
tg.run([&]{res++;});
}
tg.wait();
Results of task_group.run (I also started recording what thread did the processing):
Time until first submit=880 this thread is 140097750809856
latency=2440 threadId=140096200492608
latency=3615 threadId=140096183412288
latency=3400 threadId=140097667135040
latency=2480 threadId=140096018372160
latency=3665 threadId=140097651365440
latency=3840 threadId=140096200492608
latency=3740 threadId=140096018372160
latency=3610 threadId=140097667135040
latency=4620 threadId=140096192095808
latency=4140 threadId=140097667135040
Results using arena.enqueue: (it does look faster than task_group now)
Time until first submit=910 this thread is 140192042380544
latency=9270 threadId=140190294070848
latency=2530 threadId=140190410532416
latency=3120 threadId=140191946663488
latency=4147 threadId=140190470891072
latency=2760 threadId=140190475146816
latency=2020 threadId=140190311130688
latency=2480 threadId=140190338565696
latency=1410 threadId=140191982396992
latency=2230 threadId=140190442260032
latency=1700 threadId=140190208636480
From an implementation POV, what makes arena.enqueue() that much faster than task_group.run()?
Thanks for helping me understand how this all works!
from onetbb.
task_group::run
will place task into thread's local queue so other threads can get it only though stealing (which is still efficient).
And the task_arena::enqueue
will place a task into shared queue (which is still scalable) so other threads will find it faster.
So it will help to reduce latency in your case.
from onetbb.
Is there anything else I can help you with or we can close this issue?
from onetbb.
Is there anything else I can help you with or we can close this issue?
I've been experimenting a little more and noticed some interesting things about task_groups. Once tasks in a task_group exist, they are very quick to start running (presumably because of the local queue you mentioned). But some combination of the task_group creation and the first task submit is slower (not horrible, but around 10us).
So the question is: assuming one has a root task launched via task_arena.enqueue() and that task runs multiple tasks via a task_group, is there a best practice or way to lower the task_group creation time? Assuming the same arena will always be used, would caching task_group objects speed things up?
EDIT: see last message! seems like it's just the first time a task_group is used on a thread.
I'll attach some test code once I clean it up.
from onetbb.
Test code:
#include <iostream>
#include <oneapi/tbb/task_arena.h>
#include <oneapi/tbb/task_group.h>
tbb::task_arena arena;
int method = 0; // 0=arena enqueue, 1=new task group
class Msg {
public:
std::vector<Msg> children;
std::chrono::time_point<std::chrono::high_resolution_clock> preSubmitTime; // time when we want to submit a new task
std::chrono::time_point<std::chrono::high_resolution_clock> postSubmitTime; // time right after we submitted a task
std::chrono::time_point<std::chrono::high_resolution_clock> processTime; // time when process was finally called
std::thread::id threadId;
void startTimer() {
preSubmitTime = std::chrono::high_resolution_clock::now();
}
void submitted() {
postSubmitTime = std::chrono::high_resolution_clock::now();
}
void process() {
processTime = std::chrono::high_resolution_clock::now();
threadId = std::this_thread::get_id();
switch(method) {
case 0:
submitChildrenArena();
break;
case 1:
submitChildrenNewTaskGroup();
break;
default:
std::cerr << "Invalid method " << method << std::endl;
exit(1);
}
}
void submitChildrenArena() {
// record the time before task_group creation so its cost is included.
for (auto& child: children) {
child.startTimer();
}
for (auto& child : children) {
arena.enqueue([&child] {
child.process();
});
child.submitted();
}
}
void submitChildrenNewTaskGroup() {
// record the time before task_group creation so its cost is included.
for (auto& child: children) {
child.startTimer();
}
{
tbb::task_group tg;
for (auto& child: children) {
tg.run([&child] {
child.process();
});
child.submitted();
}
tg.wait();
}
}
// recursively fill out subtasks
void addSubTasks(int numSubTasks, int depthLeft) {
if (depthLeft == 0 || numSubTasks == 0) {
return;
}
children.resize(numSubTasks);
for (auto& child : children) {
child.addSubTasks(numSubTasks, depthLeft-1);
}
}
void print(int level=0) {
for (int i=0; i<level; i++) {
std::cout << " ";
}
auto timeToSubmit = std::chrono::duration_cast<std::chrono::nanoseconds>(postSubmitTime - preSubmitTime).count();
auto timeToSchedule = std::chrono::duration_cast<std::chrono::nanoseconds>(processTime - postSubmitTime).count();
std::cout << "Thread " << threadId << " timeToSubmit=" << timeToSubmit << " timeToSchedule=" << timeToSchedule << " total=" << timeToSubmit+timeToSchedule << std::endl;
for (auto& child : children) {
child.print(level+1);
}
}
};
int main(int argc, char** argv) {
int numRootTasks = 4;
int numChildTasks = 4; // per root task
int depth = 2;
if (argc <= 1) {
std::cout << "Usage: " << argv[0] << " numRootTasks numChildTasks depth childSubmitMethod" << std::endl;
std::cout << "\t\tchildSubmitMethod: 0=arena enqueue, 1=new task group" << std::endl;
return 1;
}
if (argc > 1) {
numRootTasks = std::stoi(argv[1]);
}
if (argc > 2) {
numChildTasks = std::stoi(argv[2]);
}
if (argc > 3) {
depth = std::stoi(argv[3]);
}
if (argc > 4) {
method = std::stoi(argv[4]);
}
std::vector<Msg> rootTasks(numRootTasks);
for (auto& rootTask : rootTasks) {
rootTask.addSubTasks(numChildTasks, depth - 1);
}
// Try to get all threads to spin-up before we continue. This seems to work better than the commented out code below.
std::barrier start_barrier(arena.max_concurrency()); // not +1, I think, because one slot is reserved for a master thread.
for (int i=0; i<arena.max_concurrency(); i++) {
arena.enqueue([&]{start_barrier.arrive_and_wait();});
}
start_barrier.arrive_and_wait();
/*
// create enough tasks that multiple other threads will be spun-up.
int res=0;
arena.execute([&]{
oneapi::tbb::task_group tg;
for (int i=0; i<100000; i++) {
tg.run([&]{res++;});
}
tg.wait();
});
*/
// Independent root tasks submitted to the arena.
for (auto& rootTask : rootTasks) {
rootTask.startTimer();
arena.enqueue([&rootTask] {
rootTask.process();
});
rootTask.submitted();
}
// Sleep for a bit to allow the tasks to finish.
std::this_thread::sleep_for(std::chrono::seconds(1));
for (auto& rootTask : rootTasks) {
rootTask.print();
}
return 0;
}
Timings for using arena.enqueue for sub-tasks:
~/tbb_sched/cmake-build-release$ ./tbb_sched 4 4 2 0
Thread 140207543887424 timeToSubmit=7730 timeToSchedule=-1525 total=6205
Thread 140207164720704 timeToSubmit=4180 timeToSchedule=898 total=5078
Thread 140207164720704 timeToSubmit=5800 timeToSchedule=618 total=6418
Thread 140207539689024 timeToSubmit=6500 timeToSchedule=490 total=6990
Thread 140207273748032 timeToSubmit=7220 timeToSchedule=950 total=8170
Thread 140207160522304 timeToSubmit=1490 timeToSchedule=1275 total=2765
Thread 140207294740032 timeToSubmit=1580 timeToSchedule=965 total=2545
Thread 140207277946432 timeToSubmit=2560 timeToSchedule=1410 total=3970
Thread 140207315732032 timeToSubmit=3370 timeToSchedule=910 total=4280
Thread 140207290541632 timeToSubmit=4820 timeToSchedule=920 total=5740
Thread 140207177315904 timeToSubmit=2070 timeToSchedule=1055 total=3125
Thread 140207539689024 timeToSubmit=1760 timeToSchedule=710 total=2470
Thread 140207298938432 timeToSubmit=2500 timeToSchedule=1280 total=3780
Thread 140207164720704 timeToSubmit=3080 timeToSchedule=868 total=3948
Thread 140207194109504 timeToSubmit=3860 timeToSchedule=640 total=4500
Thread 140207324128832 timeToSubmit=360 timeToSchedule=1105 total=1465
Thread 140207535490624 timeToSubmit=2110 timeToSchedule=1250 total=3360
Thread 140207311533632 timeToSubmit=3520 timeToSchedule=510 total=4030
Thread 140207315732032 timeToSubmit=4100 timeToSchedule=350 total=4450
Thread 140207273748032 timeToSubmit=4370 timeToSchedule=510 total=4880
Timing for using task_group.run() for sub-tasks:
~/tbb_sched/cmake-build-release$ ./tbb_sched 4 4 2 1
Thread 139911623673408 timeToSubmit=930 timeToSchedule=1912 total=2842
Thread 139910962988608 timeToSubmit=12420 timeToSchedule=1720 total=14140
Thread 139911585887808 timeToSubmit=12750 timeToSchedule=1779 total=14529
Thread 139911623673408 timeToSubmit=12860 timeToSchedule=1210 total=14070
Thread 139911623673408 timeToSubmit=12920 timeToSchedule=630 total=13550
Thread 139911598483008 timeToSubmit=1860 timeToSchedule=1092 total=2952
Thread 139911644665408 timeToSubmit=7290 timeToSchedule=1375 total=8665
Thread 139911611078208 timeToSubmit=7550 timeToSchedule=1370 total=8920
Thread 139910954591808 timeToSubmit=7660 timeToSchedule=1940 total=9600
Thread 139911598483008 timeToSubmit=7690 timeToSchedule=520 total=8210
Thread 139910950393408 timeToSubmit=1080 timeToSchedule=882 total=1962
Thread 139910841366080 timeToSubmit=7630 timeToSchedule=1630 total=9260
Thread 139910967187008 timeToSubmit=7810 timeToSchedule=1685 total=9495
Thread 139910950393408 timeToSubmit=7860 timeToSchedule=1290 total=9150
Thread 139910950393408 timeToSubmit=7900 timeToSchedule=630 total=8530
Thread 139911606879808 timeToSubmit=570 timeToSchedule=1422 total=1992
Thread 139911627871808 timeToSubmit=10180 timeToSchedule=1430 total=11610
Thread 139911606879808 timeToSubmit=10350 timeToSchedule=1470 total=11820
Thread 139911606879808 timeToSubmit=10530 timeToSchedule=1120 total=11650
Thread 139911606879808 timeToSubmit=10590 timeToSchedule=510 total=11100
Nested task groups:
~/tbb_sched/cmake-build-release$ ./tbb_sched 4 1 4 1
Thread 140009002194496 timeToSubmit=2160 timeToSchedule=820 total=2980
Thread 140008977004096 timeToSubmit=6810 timeToSchedule=1185 total=7995
Thread 140008981202496 timeToSubmit=7720 timeToSchedule=1275 total=8995
Thread 140008526681664 timeToSubmit=7400 timeToSchedule=940 total=8340
Thread 140008547673664 timeToSubmit=1200 timeToSchedule=490 total=1690
Thread 140008547673664 timeToSubmit=7320 timeToSchedule=250 total=7570
Thread 140009002194496 timeToSubmit=390 timeToSchedule=170 total=560
Thread 140009002194496 timeToSubmit=1760 timeToSchedule=170 total=1930
Thread 140008985400896 timeToSubmit=460 timeToSchedule=820 total=1280
Thread 140008539276864 timeToSubmit=6800 timeToSchedule=1670 total=8470
Thread 140008539276864 timeToSubmit=9010 timeToSchedule=740 total=9750
Thread 140008985400896 timeToSubmit=350 timeToSchedule=500 total=850
Thread 140008964408896 timeToSubmit=1450 timeToSchedule=1210 total=2660
Thread 140008964408896 timeToSubmit=7200 timeToSchedule=350 total=7550
Thread 140008964408896 timeToSubmit=380 timeToSchedule=140 total=520
Thread 140008964408896 timeToSubmit=210 timeToSchedule=60 total=270
from onetbb.
Related Issues (20)
- "Set the oneAPI Environment" every time when I start VSCode!?! HOT 3
- Windows 11 missing tbb12_debug.dll! HOT 1
- `warning: ‘template<long unsigned int _Len, long unsigned int _Align> struct std::aligned_storage’ is deprecated [-Wdeprecated-declarations]` HOT 1
- Error Linking CXX shared library ../../gnu_14.1_cxx11_32_release/libtbb12.dll
- Uninitialized data member in test code
- armel: Compiling a project using tbb via cmake yields: undefined reference to symbol '__atomic_fetch_add_8@@LIBATOMIC_1.0' HOT 4
- Consider adding an API to task_arena that returns index of the current thread only if it is part of the arena
- Loading tbb12.dll fails if its dependencies are located in `LOAD_LIBRARY_SEARCH_USER_DIRS` directories HOT 1
- Possible memory overflows in libtbb (stringop-overflow gcc warnings) HOT 5
- Test test_buffer_node sporadically hangs on x86_64 HOT 1
- Is tbb::concurrent_hash_map's copy constructor thread safe? HOT 4
- concurrent vector compiler errors HOT 2
- [try_put_and_wait] Test coverage should be significantly extended
- Profiling : Custom Flow Graph name does not appear in generated graphml file HOT 4
- Problems installing oneTBB
- Error while building on WIndows with Cygwin64 HOT 3
- ERROR: Could not find a version that satisfies the requirement tbb==2021.13.0 HOT 3
- Why do all of the generated libraries for libtbb and libtbbmalloc store the SONAME in the metadata as libtbb.so.12 (on glnxa64)?
- Compilation fails with gcc-12 with possibly false positive warning
- TBB_BUILD_APPLE_FRAMEWORKS missing headers HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onetbb.