Update: I added a fences branch which contains a possible fix.
This repository contains a reproduction of a hang when using multiple kernels running concurrently and coordinating using pipes. It was adapted from the example project at oneAPI-samples/DirectProgramming/DPC++FPGA/Tutorials/Features/pipes
To reproduce, build the project for the Arria 10 FPGA hardware (I used devcloud) and run with ./pipes.fpga
The project uses multiple kernels running on the FPGA to try and coordinate events between the host and a long-running persistent kernel and acheive exact timing on the device.
You can optionally provide a fmax argument on the command line to run with a specific time (use the Kernel fmax from pipes.prj/acl_quartus_report.txt).
The expected output is:
./pipes.fpga
fmax: 100
fmax_sec: 100000000
0: 0.670524
1: 0.670489
2: 0.670543
3: 0.670472
4: 0.670468
5: 0.670567
6: 0.670436
7: 0.670488
8: 0.670475
9: 0.67051
Sending shutdown message to persistent kernel
Waiting for persistent kernel shutdown
Persistent kernel shutdown
0
0
0
0
0
0
0
0
Freeing memory
Success
When the memory access on line 101 is uncommented, the kernel hangs and produces output similar to the following:
./pipes.fpga
fmax: 100
fmax_sec: 100000000
0: 0.657343
1: 1.3146
2: 0.657351
3: 1.31451
4: 0.657344
5: 0.657354
6: 1.31454
At which point it needs to be terminated with CTRL-C
It appears that the additional logic of the memory access changes the timing and that some pipe messages are missed.
The original documentation for the project below:
This FPGA tutorial shows how to use pipes to transfer data between kernels.
Documentation: The oneAPI DPC++ FPGA Optimization Guide provides comprehensive instructions for targeting FPGAs through DPC++. The oneAPI Programming Guide is a general resource for target-independent DPC++ programming.
Optimized for | Description |
---|---|
OS | Linux* Ubuntu* 18.04; Windows* 10 |
Hardware | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA; Intel® Programmable Acceleration Card (PAC) with Intel Stratix® 10 SX FPGA |
Software | Intel® oneAPI DPC++ Compiler (Beta) Intel® FPGA Add-On for oneAPI Base Toolkit |
What you will learn | The basics of the of DPC++ pipes extension for FPGA How to declare and use pipes in a DPC++ program |
Time to complete | 15 minutes |
Notice: Limited support in Windows*; compiling for FPGA hardware is not supported in Windows*
This tutorial demonstrates how a kernel in a DPC++ FPGA program transfers data to or from another kernel using the pipe abstraction.
The primary goal of pipes is to allow concurrent execution of kernels that need to exchange data.
A pipe is a FIFO data structure connecting two endpoints that communicate
using the pipe's read
and write
operations. An endpoint can be either a kernel
or an external I/O on the FPGA. Therefore, there are three types of pipes:
- kernel-kernel
- kernel-I/O
- I/O-kernel
This tutorial focuses on kernel-kernel pipes, but the concepts discussed here apply to other kinds of pipes as well.
The read
and write
operations have two variants:
- Blocking variant: Blocking operations may not return immediately, but are always successful.
- Non-blocking variant: Non-blocking operations take an extra boolean parameter
that is set to
true
if the operation happened successfully.
Data flows in a single direction inside pipes. In other words, for a pipe P
and two kernels using P
, one of the kernels is exclusively going to perform
write
to P
while the other kernel is exclusively going to perform read
from
P
. Bidirectional communication can be achieved using two pipes.
Each pipe has a configurable capacity
parameter describing the number of write
operations that may be performed without any read
operations being performed. For example,
consider a pipe P
with capacity 3, and two kernels K1
and K2
using
P
. Assume that K1
performed the following sequence of operations:
write(1)
, write(2)
, write(3)
In this situation, the pipe is full, because three (the capacity
of
P
) write
operations were performed without any read
operation. In this
situation, a read
must occur before any other write
is allowed.
If a write
is attempted to a full pipe, one of two behaviors occur:
- If the operation is non-blocking, it returns immediately and its
boolean parameter is set to
false
. Thewrite
does not have any effect. - If the operation is blocking, it does not return until a
read
is performed by the other endpoint. Once theread
is performed, thewrite
takes place.
The blocking and non-blocking read
operations have analogous behaviors when
the pipe is empty.
In DPC++, pipes are defined as a class with static members. To declare a pipe that
transfers integer data and has capacity=4
, use a type alias:
using ProducerToConsumerPipe = pipe< // Defined in the DPC++ headers.
class ProducerConsumerPipe, // An identifier for the pipe.
int, // The type of data in the pipe.
4>; // The capacity of the pipe.
The class ProducerToConsumerPipe
template parameter is important to the
uniqueness of the pipe. This class need not be defined, but must be distinct
for each pipe. Consider another type alias with the exact same parameters:
using ProducerToConsumerPipe2 = pipe< // Defined in the DPC++ headers.
class ProducerConsumerPipe, // An identifier for the pipe.
int, // The type of data in the pipe.
4>; // The capacity of the pipe.
The uniqueness of a pipe is derived from a combination of all three template
parameters. Since ProducerToConsumerPipe
and ProducerToConsumerPipe2
have
the same template parameters, they define the same pipe.
This code sample defines a Consumer
and a Producer
kernel connected
by the pipe ProducerToConsumerPipe
. Kernels use the
ProducerToConsumerPipe::write
and ProducerToConsumerPipe::read
methods for
communication.
The Producer
kernel reads integers from the global memory and writes those integers
into ProducerToConsumerPipe
, as shown in the following code snippet:
void Producer(queue &q, buffer<int, 1> &input_buffer) {
std::cout << "Enqueuing producer...\n";
auto e = q.submit([&](handler &h) {
auto input_accessor = input_buffer.get_access<access::mode::read>(h);
auto num_elements = input_buffer.get_count();
h.single_task<ProducerTutorial>([=]() {
for (size_t i = 0; i < num_elements; ++i) {
ProducerToConsumerPipe::write(input_accessor[i]);
}
});
});
}
The Consumer
kernel reads integers from ProducerToConsumerPipe
, processes
the integers (ConsumerWork(i)
), and writes the result into the global memory.
void Consumer(queue &q, buffer<int, 1> &output_buffer) {
std::cout << "Enqueuing consumer...\n";
auto e = q.submit([&](handler &h) {
auto output_accessor = output_buffer.get_access<access::mode::discard_write>(h);
size_t num_elements = output_buffer.get_count();
h.single_task<ConsumerTutorial>([=]() {
for (size_t i = 0; i < num_elements; ++i) {
int input = ProducerToConsumerPipe::read();
int answer = ConsumerWork(input);
output_accessor[i] = answer;
}
});
});
}
NOTE: The read
and write
operations used are blocking. If
ConsumerWork
is an expensive operation, then Producer
might fill
ProducerToConsumerPipe
faster than Consumer
can read from it, causing
Producer
to block occasionally.
- The basics of the of DPC++ pipes extension for FPGA
- How to declare and use pipes in a DPC++ program
This code sample is licensed under MIT license.
The included header dpc_common.hpp
is located at %ONEAPI_ROOT%\dev-utilities\latest\include
on your development system.
If running a sample in the Intel DevCloud, remember that you must specify the compute node (FPGA) as well as whether to run in batch or interactive mode. For more information see the Intel® oneAPI Base Toolkit Get Started Guide (https://devcloud.intel.com/oneapi/get-started/base-toolkit/).
When compiling for FPGA hardware, it is recommended to increase the job timeout to 12h.
-
Generate the
Makefile
by runningcmake
.mkdir build cd build
To compile for the Intel® PAC with Intel Arria® 10 GX FPGA, run
cmake
using the command:cmake ..
Alternatively, to compile for the Intel® PAC with Intel Stratix® 10 SX FPGA, run
cmake
using the command:cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10
-
Compile the design through the generated
Makefile
. The following build targets are provided, matching the recommended development flow:- Compile for emulation (fast compile time, targets emulated FPGA device):
make fpga_emu
- Generate the optimization report:
make report
- Compile for FPGA hardware (longer compile time, targets FPGA device):
make fpga
- Compile for emulation (fast compile time, targets emulated FPGA device):
-
(Optional) As the above hardware compile may take several hours to complete, an Intel® PAC with Intel Arria® 10 GX FPGA precompiled binary can be downloaded here.
Note: cmake
is not yet supported on Windows. A build.ninja file is provided instead.
-
Enter the source file directory.
cd src
-
Compile the design. The following build targets are provided, matching the recommended development flow:
-
Compile for emulation (fast compile time, targets emulated FPGA device):
ninja fpga_emu
-
Generate the optimization report:
ninja report
If you are targeting Intel® PAC with Intel Stratix® 10 SX FPGA, instead use:
ninja report_s10_pac
-
Compiling for FPGA hardware is not yet supported on Windows.
-
You can compile and run this tutorial in the Eclipse* IDE (in Linux*) and the Visual Studio* IDE (in Windows*). For instructions, refer to the following link: Intel® oneAPI DPC++ FPGA Workflows on Third-Party IDEs
Locate report.html
in the pipes_report.prj/reports/
or pipes_s10_pac_report.prj/reports/
directory. Open the report in any of Chrome*, Firefox*, Edge*, or Internet Explorer*.
Navigate to the "System Viewer" to visualize the structure of the kernel system. Identify the pipe connecting the two kernels.
- Run the sample on the FPGA emulator (the kernel executes on the CPU):
./pipes.fpga_emu (Linux) pipes.fpga_emu.exe (Windows)
- Run the sample on the FPGA device:
./pipes.fpga (Linux)
Input Array Size: 1024
Enqueuing producer...
Enqueuing consumer...
PASSED: The results are correct