HLS-Project

Trading System: FPGA-Accelerated Arbitrage Strategies

This repository provides an FPGA-accelerated trading system utilizing HLS streams to implement two arbitrage strategies: Latency Arbitrage and Statistical Arbitrage. The system is optimized for low-latency trading environments, leveraging the speed and parallelism of FPGAs.

High-Frequency Trading (HFT) systems rely on executing trades in microseconds to gain profit margins on price discrepancies. This trading system implements Latency Arbitrage and Statistical Arbitrage strategies using FPGA-based acceleration via HLS (High-Level Synthesis). By taking advantage of FPGA's parallel processing capabilities, this system ensures low-latency execution for high-speed trading operations.

Features

Latency Arbitrage: Executes sell orders when market price falls below a defined threshold with sufficient volume.

Statistical Arbitrage: Executes buy orders when market price exceeds a threshold and volume conditions are met.

FPGA-Accelerated: Uses HLS streams for parallel processing to achieve ultra-low-latency performance.

Configurable Strategies: Dynamically enable/disable arbitrage strategies using a control flag.

Testbench: Includes a testbench to simulate and verify system functionality with predefined market data.

System Architecture The trading system consists of three major components:

Latency Arbitrage Strategy:

Monitors incoming market data and checks if the price falls below a defined threshold. If conditions are met, a sell or buy order is generated. Statistical Arbitrage Strategy:

Monitors incoming market data and checks if the price exceeds a defined threshold. If conditions are met, a sell or buy order is generated.

Trading System Function:

Controls the flow of data for both arbitrage strategies. Allows enabling/disabling of individual arbitrage strategies through control flags. Streams are used to handle data flow, ensuring low-latency execution.

Code Structure

trading_system.cpp This file contains the implementation of the Latency Arbitrage and Statistical Arbitrage strategies, as well as the core trading_system function.

latency_arbitrage: Generates a sell order if market data price is below a threshold.

statistical_arbitrage: Generates a buy order if market data price exceeds a threshold.

trading_system: Controls data flow and decision-making for both strategies, based on control flags.

trading_system.h This header file defines the data structures and function prototypes used in the system, such as:

market_data_t: Structure for market data (price and volume).

order_t: Structure for buy/sell orders.

order_type_t: Enum defining buy and sell order types.

Function prototype for the trading_system.

trading_system_TB.cpp The testbench file used for simulating and verifying the functionality of the system. It pushes test data into input streams, runs the trading system, and prints out generated orders.

Generated Latency Arbitrage Order: Price = 100, Volume = 180, Type = Buy

Generated Statistical Arbitrage Order: Price = 130, Volume = 170, Type = Buy

How It Works

Latency Arbitrage The latency_arbitrage function checks if the incoming market data's price is below a threshold and if the volume is sufficient. If both conditions are satisfied, it generates a sell order.
Statistical Arbitrage The statistical_arbitrage function checks if the incoming market data's price is above a threshold and if the volume is sufficient. If conditions are met, it generates a buy order.
Trading System Control The trading_system function manages both arbitrage strategies using a control flag ap_uint<2>. Each bit of the flag enables/disables a specific arbitrage strategy.

HLS-based CRC Computation Module

The development, verification, and performance analysis of a High-Level Synthesis (HLS) based CRC (Cyclic Redundancy Check) computation module designed for FPGA implementation. The module is developed with a focus on pipelining to enhance throughput and efficiency and is designed to be compliant with the AXI4 streaming protocol.

Introduction

Cyclic Redundancy Check (CRC) is a popular method for detecting errors in digital data. It is widely used to ensure the integrity of data in digital networks and storage devices. CRC computation involves generating a short, fixed-size binary sequence, known as a checksum, for each block of data sent or stored. The checksum is calculated based on the polynomial division of the data's binary representation, and the same polynomial is used to verify the integrity of the data at the receiving end. If the receiving end re-computes the CRC and finds a different checksum, it indicates that the data has been altered.

Significance and Applications

CRCs are crucial in applications where data integrity is paramount, including in telecommunications, networking (Ethernet, WiFi), storage (hard disks, SSDs), and in the transport of data across susceptible channels. The simplicity and efficiency of CRC calculations make them particularly suited for hardware implementation, where speed and resource optimization are critical.

Module Description

Input Data Width: 32 bits

CRC Polynomial Width: 6 bits

Output CRC Width: 6 bits

Interface: AXI4 Stream

Pipeline Stages: Implemented with an initiation interval (II) of 1

Inputs and Outputs

Input Ports:

data_in (32 bits): Receives the data block for which the CRC is to be computed.

CRC_polynomial (6 bits): The polynomial used in the CRC computation, defining the error detection capabilities.

clk (1 bit): Clock signal that synchronizes the data processing.

reset (1 bit): Asynchronously resets the computation process.

Output Ports:

CRC_out (6 bits): Outputs the computed CRC value.

valid_out (1 bit): Indicates the validity of the CRC output.

ready_out (1 bit): Signals the readiness to send the CRC result.

Design Implementation

HLS Code Overview

The design utilizes bitwise operations within a pipelined loop to enhance throughput and minimize latency. The core of the CRC computation handles data bit by bit, applying XOR operations based on the CRC polynomial.

Pipelining Strategy

The pipelining is designed to allow new iterations of the loop to begin every clock cycle, optimizing throughput and ensuring the module processes one input per clock cycle under optimal conditions.

Simulation and Verification

Simulation results indicate successful CRC computations for all test vectors, confirming the module's functionality and robustness.

Performance Analysis

The synthesized design shows a latency of 6 cycles per data packet, with high throughput capabilities evidenced by achieving the targeted initiation interval of 1.

Synthesis Outcomes

RTL synthesis confirms that the design meets the timing and resource criteria for the target FPGA architecture, verifying its readiness for production deployment.

Future Recommendations

Resource Optimization: Further reduce resource utilization by exploring more compact logic designs or alternative synthesis options.

Error Handling Enhancements: Implement additional features for error signaling and correction to handle possible data transmission errors dynamically.

Algorithmic Improvements: Consider exploring different CRC algorithms or configurable polynomial support to broaden the application scope and enhance error detection capabilities.

Conclusion The HLS-based CRC module effectively meets specified functional and performance requirements, demonstrating efficient use of FPGA resources through advanced pipelining techniques. It is recommended for integration into larger systems where data integrity checks are essential, supporting critical applications across various industries.

HLS-based Up-counter Streaming Module for FPGA Design

In the ever-evolving field of digital electronics, FPGA stands as a beacon of adaptability and speed. Today, I'm excited to introduce a prime example of FPGA's capabilities – an up-counter streaming module designed for real-time interfacing utilizing FIFO generator with a seven-segment display.

Design Overview:

The design comprises an integrated system built on an FPGA platform, utilizing High-Level Synthesis (HLS) for rapid development and iteration. The main focus of the design is an up-counter module which is part of a larger system including a pulse generator, FIFO buffer, debouncer, and display drivers – to perform counting operations and display the results in real-time.

Key Submodules:

Pulse Generator: Initiates the counting sequence, providing a regular pulse stream to the up-counter.

FIFO Generator: Manages the queuing of count values, ensuring a smooth and orderly data flow.

Debouncer: Filters out any noise from the input buttons, ensuring accurate count increment triggers.

Up-Counter: Lies at the heart of the design, incrementing the count in response to incoming pulses and managing the modulo operation for wrapping the count value.

Inputs/Outputs:

Inputs:

clk_in: The main system clock that synchronizes the operation of all submodules.

reset: Resets the system to a known state.

4-bit sw: User input switches to specify the modulo value for the counter.

up_count: The signal line carrying the current count value.

Outputs:

8-bit seven segment display (display_data): Drives the display segments to show the count.

4-bit seven segment enable (display_enable): Controls which segments are activated on the display.

Design Function:

At its core, the up-counter module functions to increment a count value with each pulse received from the pulse generator. The count is presented through a seven-segment display, which is controlled by the output lines display_data and display_enable. The FIFO generator ensures that the count values are processed in the order they are received.

Applications:

This module is versatile and can be applied to a variety of scenarios, including but not limited to digital clocks, counters in consumer appliances, and educational tools for digital logic learning. Its real-time processing capability makes it suitable for applications where immediate feedback from user input is necessary.

Conclusion:

The HLS-based up-counter streaming module demonstrates how complex digital systems can be realized with reduced development times, facilitating innovation and creativity in digital design. This project is an enabler for engineers and developers to turn their digital concepts into real-world applications swiftly and effectively.

FPGA-based HLS PMOD Keyboard IP Design

Introduction

FPGA-based PMOD keyboards are fundamental interfaces in embedded systems, enabling user input for a variety of applications. In this report, we present the design and implementation of an FPGA-based HLS PMOD Keyboard IP with two sub-modules: PMOD keyboard refresh and PMOD keyboard. This solution facilitates efficient interaction with external keyboards and provides robust functionality for FPGA-based systems.

Design Overview

The FPGA-based HLS PMOD Keyboard IP comprises two primary sub-modules:

PMOD Keyboard Refresh: Generates refresh signals for scanning the keyboard matrix.

PMOD Keyboard: Interfaces with the PMOD keyboard, scanning rows and detecting key presses.

Input and Output Ports

The HLS PMOD Keyboard IP exposes the following input and output ports:

Input Ports:

ap_clk: Clock signal for the FPGA fabric.

4-bit input_pmod_row: Input for scanning rows of the PMOD keyboard matrix.

ap_rst: Reset signal to initialize the IP.

Output Ports:

4-bit pmod_output_col: Output for detected columns of the PMOD keyboard matrix.

8-bit display_data: Data for driving the seven-segment display.

4-bit display_enable: Enable signals for controlling segments of the seven-segment display.

Conclusion

The FPGA-based HLS PMOD Keyboard IP offers a versatile and efficient solution for integrating PMOD keyboard functionality into FPGA-based systems. Leveraging HLS methodology enables rapid development and optimization, while the modular design ensures scalability and ease of integration. With its robust features and customizable configurations, the HLS PMOD Keyboard IP can be seamlessly integrated into various embedded systems, enhancing user interaction and expanding application possibilities.

FPGA-based HLS Timer with Initialization IP Block Design

FPGA-based timers are essential components in various embedded systems, facilitating timekeeping functionalities crucial for a wide range of applications. I recently designed and implemented an FPGA-based HLS timer with initialization IP for a complex DSP project developed using High-Level Synthesis (HLS) methodology.

🗝️Design Overview

The HLS timer with initialization IP consists of the following key components:

✨Debouncer: Ensures stable input signals by eliminating noise and bouncing effects.

✨Pulse Generator: Generates pulse signals to trigger timer events such as start, stop, and reset.

✨Seven-Segment Driver: Drives the seven-segment display to visualize timer output.

✨Seven-Segment Signal: Generates refresh signals for the seven-segment display.

✨Timer Signal: Provides timing signals for precise operation of the timer.

✨Timer with Initialization: Implements countdown functionality and initialization of timer values.

🗝️Input and Output Ports

The HLS timer IP exposes the following input and output ports:

✨Input Ports:

ap_clk: Clock signal for the FPGA fabric.

6-bit "minutes" switch: Input for setting timer values or configurations.

start: Start signal to initiate timer operations.

ap_rst: Reset signal to reset the timer and associated submodules.

✨Output Ports:

8-bit seven_segment_data: Data for driving the seven-segment display.

4-bit seven_segment_enable: Enable signals for controlling segments of the seven-segment display.

The FPGA-based HLS timer with initialization IP offers a versatile and efficient solution for implementing timer functionalities in FPGA-based systems.

Leveraging HLS methodology enables rapid development and optimization, while the modular design ensures scalability and ease of integration. With its robust features and customizable configurations, the HLS timer IP can be seamlessly integrated into various embedded systems, meeting diverse application requirements.

High-Level Synthesis (HLS) Based Median Filter for FPGA Applications

Median filtering is a powerful technique in digital signal processing used for noise reduction and image enhancement. In this report, we delve into the implementation of a median filter using High-Level Synthesis (HLS) targeting FPGA platforms. HLS offers a high-level abstraction for designing complex algorithms and accelerates development by automatically generating hardware descriptions from C/C++ code.

How the Median Filter Works:

The med function computes the median of three input values.

The median_filter function applies the median filter to input data in_data.

The filter maintains a sliding window of size 3 to compute the median.

The output out_data is the median value, and out_data_vld indicates valid output.

Key Applications:

Image Processing: Median filtering removes salt-and-pepper noise from images while preserving edges.

Signal Processing: Used in audio and biomedical signal processing to remove impulsive noise without blurring the signal.

Digital Communications: Employed for channel equalization in wireless communication systems to mitigate multipath interference effects.

Real-Time Systems: Finds applications in robotics and automotive safety systems due to its simplicity and effectiveness.

HLS-based IIR Filter

In the realm of FPGA (Field-Programmable Gate Array) design, the efficient implementation of digital signal processing (DSP) algorithms is a critical aspect. Among these algorithms, IIR (Infinite Impulse Response) filters stand out for their versatility and effectiveness in a wide range of applications.

The code represents a second-order IIR filter implementation. Here's a breakdown of the key components:

Interface Definition: The iir function takes an input x and produces an output y. Both x and y are of type DATA_TYPE, and they are passed by reference (&) to allow modifications.

Interface Directives: The #pragma HLS INTERFACE directives specify the interface properties of the function. In this case, ap_ctrl_hs denotes a high-speed control interface, while ap_none indicates that x and y do not have streaming interfaces.

Pipeline Directive: The #pragma HLS PIPELINE II=2 directive enables pipeline optimization with a pipeline initiation interval (II) of 2, enhancing throughput by allowing two computations to occur concurrently.

The core of the function performs the IIR filtering operation using the provided coefficients (b0, b1, b2, a1, and a2). It maintains internal state variables (xn1, xn2, yn1, and yn2) to store past input and output samples, ensuring correct filter behavior.

HLS-based QAM Module for Wireless Communication Application (Corresponding RTL Verilog HDL files are included)

lms_equalizer: Implements the LMS equalizer module that updates the filter coefficients based on the error signal (difference between desired output and actual output) using the LMS algorithm. It also updates the input sample delay line to prepare for the next iteration.

qam_modulation: Implements QAM modulation by mapping input data bits onto a QAM constellation in the I-Q plane. The design provided is for QAM-4 modulation, but the constellation points can be adjusted for other QAM schemes.

qam_demodulation: Implements QAM demodulation by computing the Euclidean distance between the received symbol and each constellation point and detecting the closest point as the demodulated data.

transmit_filter: Implements the transmit filter module that applies the FIR filter to the input sample to shape the transmitted signal spectrum. The FIR filter coefficients can be optimized based on the desired spectral characteristics.

lms_adaptive_filter: Implements the LMS adaptive filtering function that updates the filter coefficients based on the error signal (difference between desired output and filtered output) using the LMS algorithm.

fir_filter: Implements a simple FIR filter module that computes the output of the equalizer filter using the current input sample and filter coefficients.

dfe_equalizer: Implements the Decision Feedback Equalization (DFE) function that computes the equalized sample by subtracting decision feedback from the filtered output.

Here's an overview of the design process:

System Specification:

The requirements of the QAM system are reviewed including the modulation scheme (e.g., QAM-16, QAM-64), bandwidth, data rate, signal-to-noise ratio (SNR) requirements, and any specific application constraints.

Modulation and Demodulation:

For QAM modulation, I mapped the input data bits onto a complex constellation, such as symbols in the I-Q plane. For demodulation, I implemented algorithms to estimate the transmitted symbols based on received signals and perform symbol detection.

Transmit Filter:

Design and implement a transmit filter to shape the transmitted signal spectrum and meet regulatory requirements. Finite Impulse Response (FIR) is used to design the transmit filter. Filter coefficients are optimized to achieve desired spectral characteristics while minimizing implementation complexity.

Receive Filter:

A receive filter is designed and implemented to mitigate noise and interference and improve signal-to-noise ratio (SNR) at the receiver. Adaptive filtering techniques such as Least Mean Squares (LMS) or Decision Feedback Equalization (DFE) is used for adaptive equalization in the receive filter.

Equalization:

An equalizer is implemented to compensate for channel distortion and improve signal quality. Adaptive equalization techniques such as Decision Feedback Equalization (DFE) or Maximum Likelihood Sequence Estimation (MLSE) can be employed. Channel effects such as multipath propagation, phase distortion, and frequency-selective fading are estimated and compensated for.

HLS D Flip-Flop:

The code defines a function HLS_dff that models a series of digital flip-flops (DFFs) in a High-Level Synthesis (HLS) environment for FPGAs. A flip-flop is a basic digital memory element that stores one bit of information. The function takes a single input data, which represents the input bit to be stored, and outputs three bits, n1, n2, and n3, each representing the state of one flip-flop in the series.

The #pragma HLS INTERFACE directives specify that the ports for data, n1, n2, and n3 have no specific interface protocol (ap_none), which means they are treated as simple wires with no handshaking or protocol overhead. The ap_ctrl_none on the return port indicates that there is no control interface, and the function should not expect any start or done signals.

The #pragma HLS PIPELINE directive with II=1 specifies that the design should be pipelined with an initiation interval of 1. This means that on every clock cycle, a new set of inputs can be processed by the function, leading to a continuous flow of data and a potentially higher throughput.

Inside the function, there are four static boolean variables (dff_0, dff_1, dff_2, dff_3) declared to hold the state of each flip-flop. The volatile keyword on dff_0 indicates that this variable's value can change at any time and should not be optimized away by the compiler. On each function call, the flip-flops are updated in a chain, where dff_3 takes the value of dff_2, dff_2 takes the value of dff_1, and so on, with dff_0 taking the new input data. The outputs n1, n2, and n3 are then assigned the values of dff_1, dff_2, and dff_3, respectively.

The screenshots provide additional context:

The csim.log shows that the C simulation of the design (csim.exe) has run with zero errors, suggesting that the function behaves correctly in simulation.

The timing estimate indicates a target of 10 ns with an uncertainty of 2.70 ns. However, as there are no latency or violation issues, the design is meeting timing requirements.

The Vitis HLS console logs indicate successful generation of RTL and co-simulation passes, further confirming functional correctness.

The waveform shows the clock, data, and the output signals (n1, n2, n3). We can infer the behavior of the flip-flops from the transitions of these signals: each output n changes to the value of data from one clock cycle before it, demonstrating the shift-register behavior created by the series of flip-flops.

This design is typically used in digital circuits where a sequence of bits needs to be captured and transferred on each clock cycle, effectively creating a delay line or a shift register. In an FPGA, this would translate to a series of flip-flop components connected in sequence, each storing the bit value of the previous stage.

HLS Register

The HLS function HLS_reg is defined with two parameters: a single bit data input and a 5-bit unsigned integer n output passed by reference. The design uses the ap_uint<5> data type to define a 5-bit wide register. In HLS, ap_uint is a datatype provided by Xilinx libraries for arbitrary precision arithmetic, which in this case, allows us to define the width of our integer at 5 bits specifically.

The #pragma HLS INTERFACE directives are used to specify the interface configuration for the ports. Setting ap_ctrl_none for the return port indicates that the block does not use the default control signals (start, done, idle). This means that the function will continuously run and process data without needing to be explicitly started or stopped. The ports data and n are configured with ap_none, suggesting they are simple connections without any sideband signals or protocol.

The #pragma HLS PIPELINE directive with an initiation interval (II) of 1 tells the HLS tool to attempt to pipeline this function such that it can accept new input data every clock cycle, effectively allowing for parallel processing and increasing throughput.

Inside the function, there's a static 5-bit register reg which initially holds the value 0b00000. Each time the function is invoked, the register contents are shifted right by one bit, and the data input is placed into the most significant bit (reg[4]). The updated value of reg is then assigned to the output n.

From the attached screenshots, we can observe that:

The csim.log file indicates that the C simulation has been run successfully with no errors, implying functional correctness of the design at the C level.

The timing estimate does not show any issues, with a target and estimated timing of 10 ns and an uncertainty of 2.70 ns.

The synthesis report indicates that the design meets the timing requirements with no latency or iterations, and it is set up for pipelining.

The co-simulation report confirms that the RTL simulation passes, ensuring that the synthesized design behaves as expected when compared to the C model.

The waveform depicts the behavior of the data input and the n output over time, confirming the shift register operation where each bit in n is shifted in each clock cycle, and the input data is placed into the most significant bit position.

This design would typically be used in scenarios where a sequence of bits needs to be captured and shifted in time, such as in serial communication interfaces or for temporary storage and manipulation of bits within a larger digital system. The shift register's ability to operate every clock cycle maximizes the data handling efficiency, which is essential for high-performance FPGA applications.

HLS FPGA Design Rotate-With-Load Module:

A common function in cryptographic algorithms and data manipulation tasks. The project encompasses the full HLS design flow including C simulation, synthesis, schedule viewing, co-simulation, and RTL waveform simulation.

Design Specification

The provided HLS code implements a rotate-with-load function. The function takes a data_in input, which is conditionally loaded into a register rotate_reg when the load signal is high. Upon a high rotate signal, the rotate_reg is right-rotated by one position, with the output provided in data_out. The use of the ap_uint data type suggests a templated width, providing flexibility for different data widths.

HLS Pragmas and Interface Configuration

Several HLS interface pragmas are declared for the function arguments, optimizing the design's control and data flow:

ap_ctrl_none: Disables automatic control signals, providing full control over the execution.

ap_none: Specifies that the ports will have no handshaking signals, simplifying the interface.

PIPELINE II=1: Enables full pipelining with an initiation interval of 1, maximizing throughput.

Design Flow

C Simulation (csimulation): The initial stage involves verifying the functional correctness of the HLS code. The console output indicates that the C simulation was successfully completed with no errors, demonstrating functional correctness of the HLS code under testbench conditions.
Synthesis: The synthesis summary shows that the design targets a 5 ns clock period, achieving an estimated 2.637 ns timing, which is well within the target. The synthesis reports no timing violations, suggesting an efficient translation of the HLS code to RTL.
Schedule Viewer: The viewer provides a graphical representation of the operation schedule. The data suggests that operations are tightly packed with no evident pipeline stalls or wasted clock cycles, indicative of an efficient HLS scheduling result.
Co-Simulation Report: The co-simulation integrates the RTL simulation with the C model, ensuring the RTL implementation functions as intended. The console output indicates all simulation runs passed with no errors, showcasing the RTL's fidelity to the original HLS design.
RTL Simulation Waveform: The waveform visualization confirms the expected operation of the design. Signals load and rotate control the behavior of the rotate_reg, as seen by the changes in data_out, which align with the specified behavior.

Performance and Resource Utilization:

The synthesis report highlights the efficient use of FPGA resources, with only 18 flip-flops (FFs) and 53 look-up tables (LUTs) used, and no block RAM (BRAM) or DSP slices consumed. The design is pipelined, which is reflected in the timing estimate and schedule viewer results, indicating high throughput capability.

Functional Behavior:

The waveforms corroborate the expected functional behavior, with the rotate_reg correctly loading and rotating based on the control signals.

Challenges and Improvements:

The static nature of rotate_reg as a single-bit shift might limit the function's flexibility. Future improvements could include parameterizing the shift amount for varying rotate operations. No clear error handling or corner-case consideration is present in the code. Robustness could be improved with the addition of error detection and management logic.

The HLS-based FPGA design for a rotate-with-load function exhibits a promising blend of high performance, functional correctness, and efficient resource utilization. The design flow from csimulation to RTL waveform analysis showcases the effectiveness of HLS in streamlining FPGA design, delivering rapid prototyping capabilities while maintaining high-quality synthesis results. With further optimization and robustness features, the design could be well-suited for demanding applications that require dynamic data manipulation with high throughput requirements.