GithubHelp home page GithubHelp logo

arc-research-lab / charm Goto Github PK

View Code? Open in Web Editor NEW
100.0 6.0 14.0 156.04 MB

CHARM: Composing Heterogeneous Accelerators on Versal ACAP Architecture

License: MIT License

Makefile 3.05% C++ 93.80% C 1.04% Shell 1.43% Python 0.68%
deeplearning fpga heterogeneous-computing design-space-exploration high-level-synthesis versalacap electronic-design-automation domain-specific-architecture acap versal

charm's Introduction

Team

Principal Investigator: Dr. Peipei Zhou, https://peipeizhou-eecs.github.io/

Ph.D. Students: Jinming Zhuang (Lead) and Zhuoping Yang

Faculty Collaborators: Drs. Jingtong Hu, Alex Jones, Deming Chen, and Jason Cong

Student Collaborators: Jason Lau and Hanchen Ye

AMD Collaborators: Stephen Neuendorffer, Jack Lo, and Kristof Denolf

DOI

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture (FPGA'23)

High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives (DAC'23).

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration (FPGA'24)

ACM/IEEE Reference Format

  1. Jinming Zhuang, Jason Lau, Hanchen Ye, Zhuoping Yang, Yubo Du, Jack Lo, Kristof Denolf, Stephen Neuendorffer, Alex Jones, Jingtong Hu, Deming Chen, Jason Cong, Peipei Zhou. 2023. CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’23), February 12–14, 2023, Monterey, CA, USA. ACM, New York, NY, USA, 12 pages.

ACM PDF: https://doi.org/10.1145/3543622.3573210 Author Version PDF: https://peipeizhou-eecs.github.io/publication/fpga23/

  1. Jinming Zhuang, Zhuoping Yang, Peipei Zhou. 2023. High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives. In Proceedings of the 60th ACM/IEEE Design Automation Conference, San Francisco, California, USA, (DAC ’23), July 9–13, 2023, San Francisco, CA, USA. https://doi.org/10.1109/DAC56929.2023.10247981

Author Version PDF: https://arxiv.org/pdf/2305.18698.pdf

  1. Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, Peipei Zhou. 2024. SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration (FPGA'24)

ACM PDF: https://doi.org/10.1145/3626202.3637569

New Release !   !   ! 2023.05.29

PyACAP 1.0: Python Based Automatic Code Generation for Versal ACAP:

  • What's new ?: In this release we create an entire python interface for matrix multiply under floating-point 32 data type for Versal ACAP VCK190 and VCK5000 Platforms.
  • Overall Compilation Flow:



  • Python Interface Introduction:
    Quick Start: Running project_setup.py
python project_setup.py
from charm import* 

#Define the left-hand-side(A) and right-hide-side(B) operands
A=np.random.rand(4096, 4096).astype(np.float32)
B=np.random.rand(4096, 4096).astype(np.float32)

#Create the object of the class charm
automm=charm(prj_dir)

#Launch charm dse to find optimized hardware configuration
Versal_config=automm.cdse(A,B)

#Launch charm automatic code generator to emit the code for AIE, PL and Host CPU
device='vck190' # Supported devices are vck190 and vck5000
automm.cacg(Versal_config,device)

#Run Vitis Compilation Flow
automm.build()

Overview

In this repo, we use general-purpose Matrix-Matrix Multiplication (GEMM) applications as an example and provide a detailed description of how to build a system-level design on AMD Versal VCK190 Platform. By going through this repo, users can get knowledge on:

  • How to design a highly efficient single AIE kernel by leveraging the 7-way very long instruction words (VLIW)?
  • How to sustain 400 AIEs with the limited I/O interfaces between AIE and PL by using a broadcast-packet mechanism?
  • How to transfer data from PL/AIE to AIE/PL by using a bubble-free pipeline strategy?

We provide an automatic code generation and compilation flow that users can build the system on Versal step by step by changing the configuration files.

Dependencies

To play with the Charming Accelerators, the following software and hardware dependencies are required:

  • Linux System with "tar" installed
  • AMD/Xilinx Vitis 2021.1 (Version 2021.1 guarantees the designs in the example folder to be compiled correctly)
  • AMD/Xilinx XRT Library
  • AMD/Xilinx Versal VCK190 (Vitis 2021.1)
  • AMD/Xilinx Versal VCK5000 (Python Interface Vitis 2021.2)

Environment Setup

1. To quickly boost and run experiments on the board instead of building the platform and Linux from scratch, users can download the platform package (VCK190 Base 2021.1) and petalinux common image(Versal common image) from the following link:

https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/embedded-platforms/2021-1.html

2. Install the platform and Petalinux

unzip xilinx_vck190_base_202110_1.zip
tar -xf xilinx-versal-common-v2021.1.tar.gz
cd xilinx-versal-common-v2021.1
sh sdk.sh

3. VCK190 Base 2021.1: It contains the pre-built Versal extensible embedded platform. During compilation users need to specify the platofrm path in the following format.

PLATFORM=${PATH}/xilinx_vck190_base_202110_1/xilinx_vck190_base_202110_1.xpfm

4. Versal common image: It includes the petalinux system boot files and the cross compilation environment needed for ARM CPU. During compilation, users need to point the path to SYSROOT and EDGE_COMMON_SW.

SYSROOT = ${PATH}/sysroots/cortexa72-cortexa53-xilinx-linux
EDGE_COMMON_SW=${PATH}/xilinx-versal-common-v2021.1

5. Vitis and Cross-compilation Environment Setup

source /opt/tools/xilinx/Vitis/2021.1/settings64.sh
source /opt/xilinx/xrt/setup.sh
unset LD_LIBRARY_PATH (If needed)
source ${PATH}/environment-setup-cortexa72-cortexa53-xilinx-linux

6. Project Setup and Compilation

Users can generate the customized project by setting up the configuration file and directly running the following command:

./project_setup.sh ./config_files/input.cfg ${Project_DIR}
cd ${Project_DIR}
make all PLATFORM=${PATH} EDGE_COMMON_SW_PATH=${PATH} SYSROOT_PATH={PATH}

7. On Board Execution for MM with Arbitrary Sizes

After copy the sd card image to micro sd card and boot up the system run the following commands to get the execution results. {M}, {K}, {N} refers to the size of MM. In order to reduce the effect of overhead of calling API when runnning the kernel, users can specify the number of {iteration} of running the MM then it provides the average throughput. To verify the correctness of the MM kernel, {verify} should be assigned to 1, otherwise 0. One example of running MM with 1024*1024*\1024 for 100 iterations without verify the result can be: ./hostexe mm_hw.xclbin 1024 1024 1024 100 0

cd /mnt/sd-mmcblk0p1
./hostexe mm_hw.xclbin {M} {K} {N} {iteration} {verify}

Step-by-Step Tutorial

In this part, we first introduce the overall MM tiling strategy including four levels of tilings. Then in the later parts, we illustrate the methodology of how we handle each of these level of tilings.

Overall MM Tiling Strategy:

Given a large Matrix Multiplication(MM) with size (M*K) * (K*N) refer as M*K*N, the listing bellow shows four level of tilings to handle this MM (from innermost to outermost):

  • Line 16-20: MM calculated on a single AIE core.
  • Line 12-14: The spatial distribution unrolled across different AIE cores in AIE Array.
  • Line 7-9: The sequential processing of data stored in PL on-chip memories.
  • Line 2-4: The temporal processing of data stored in off-chip memory.

We visualize the on-chip buffer level tiling in the right figure. We refer the MM calculated in single AIE as "Tile" level and refer the MM unrolled on AIE array level as "Batch" level. The strtegy of mapping the tiled MM on AIE array will be illustrated later.

Single AIE Programming:

In this part, we demonstrate the coding style of calculating MM with size TI*TK*TJ in a single AIE which corresponds to the first level of tiling.

AIE is a very-long instruction word (VLIW) processor which can issue upto seven operations in parallel using one VLIW word:

  • Two Load Operations: Load upto (the width of data could change) two 256bits data from AIE local memory to AIE register.
  • One Store Operation: Store upto one 256bits data from AIE register to AIE local memory.
  • One Vector Operation: Calculating a vector matrix multiply or add.
  • One Scalar Operation
  • Two Move Operations: Two move operations that move the data from scalar/vector to scalar/vector.
    $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$


The key challenge of programming single AIE is how to make back-to-back issued instructions by utilizing the 32KB local memory and 2KB local registers of a single AIE (for integer data type there are additional 3KB accumulator registers).

We provide our source code that achieves 95% efficiency when calculating 32*32*32 MM in src/aie/mm_kernel0.cc. The visualization of the algorithm is shown below:

The insights of programming AIE are:

  • Mannually unroll the innermost loop to make it fully pipelined: View the report under "Work/aie/xx_xx/xx_xx.log" (here xx_xx is the position of AIE) after compilation and make sure 1) "Critical cycle of length" equals the number of vector operations. 2) The cycle after folding achieves the cycle due to"Critical cycle of length". The following figure is reported from a cycle-accurate simulator (Vitis Analyzer) provided by AMD. In our innermost loop (line 43), we manually unrolled 16 vector mac operations and after checking we fully pipelined them.

  • Avoid of using judgment statement in the innermost loop: The judgment statement will prevent the compiler to achieve fully pipelined MAC instructions. As can be seen in line 98, since we need to store the data from the register back to the local memory when finishing the last iteration of the reduction dimension (K loop), to avoid using the judgment in the innermost loop we manually write the last iteration of the innermost loop out of the "for loop" region.
  • Using __restrict and chess_prepare_for_pipelining pragma to improve the efficiency: Software pipelining**

Automatic Code Generation (ACG):

The tools for automatically generating the source code are under ""src_gen"" folder

ACG takes platform information and user-specified design point as input and automatically generated the system-level design by launching the following 4 template based components sequentially:

Kerne_lGen: Kernel_Gen is launched to generate both the single AI Engine(AIE) C code and adaptive data flow (ADF) graph code in C++ for verifying the correctness of a single kernel design. MM kernels with fp32 data type in different shapes that can be fit in a single kernel are supported in current version.

AIE_ArrGen: AIE_ArrGen is launched to generate new ADF graph code that defines how packet-switch streams are connected to AIE array which contains 400 AIEs. Single kernel calculating 32x32x32 MM with fp32 data type is supported to scale out to the AIE array.

PL_Gen: Based on the AIE array created by AIE_ArrGen, PL_Gen is launched to generate PL streams, scheduling controller C/C++ HLS modules to communicate with AIE array and PL on-chip buffers, off-chip AXI data transfer modules to communicate with DDR. Differnet system level designs varying in on-chip buffer size and its implementation option (BRAM or URAM) fp32 data type are supported.

Host_Gen: Host_Gen is launched to generate the system control logic running on the ARM CPU by using AMD XRT APIs.

Compilation After code generation, the vendor tools AIE compiler and V++ compiler take ADF gragh and HLS C/C++ as input respectively. Their output object file libadf.a and kernel.xo will be linked into xclbin file which includes the hardware information of the design for the target platform. C++ compiler compiles XRT API-based host code to executable file runs on CPU.

Configuration File

We provide a configuration file template under "./config_files/input.cfg", users can specify the platform, data type, kernel type, and mapping strategy of each level in this file. The feasible option of each parameter are illustrated in ( ) The rules of using this configuration file are listed below:

  • Platform refers to the hardware platform used in the project. VCK5000 and VCK190 are supported in the current framework.
  • DATA_TYPE the framework currently support fp32, int16 and int8 data types.
  • KernelGen, AIEArrGen, SysGen decide if the corresponding ACG should be launched (1 refers to launch).
  • KRL_TYPE refers to two types of MM kernels provided in our source file.
  • I, K, J refers to the MM size stored and calculated in a single AIE.
  • A, B, C refers to the BATCH level parameter.
  • X, Y, Z refers to the on-chip level parameter.
  • LHS_BUFF, RHS_BUFF, OUT_BUFF decide the implementation option for LHS, RHS, and output buffers. 1 refers to URAM and 0 refers to BRAM. For example, LHS_BUFF=1 means the LHS buffer is implemented by URAM.
Platform:VCK190;
DATA_TYPE:fp32;
KernelGen:1;
	KRL_TYPE:0;
	I:32;
	K:32;
	J:32;
AIEArrGen:1;
	NUM_PACK:4;
	A:6;
	B:4;
	C:16;
	A_BRO:4;
	C_BRO:3;
SysGen:1;
	X:8;
	Y:1;
	Z:2;
	LHS_BUFF:0;
	RHS_BUFF:0;
	OUT_BUFF:1;

Applications

We provide four applications under the example folder including BERT for natural language processing, NCF for recommendations, ViT for vision classification, MLP for multi-layer perceptron classification or regression. The expected throughput should be the same as the results shown in the following figure:

image

To quickly reproduce the results, we provide the pre-built object files of AIE, PL, and ARM CPU in the pre_built folder. Users can go to the corresponding folder and run the following command to create the sd card image for onboard execution.

make package EDGE_COMMON_SW_PATH=${PATH} SYSROOT_PATH={PATH}

Errata Sheet

  1. A typo appears in Table 6 of the paper. The latency for MLP should be "119ms" instead of "11.9ms".
  2. In Table 5, the size of the fifth layer of ViT should be 3072*1024*3072 instead of 3072*1024*3048.

Acknowledgement

We acknowledge the support from the University of Pittsburgh New Faculty Start-up Grant, NSF awards #2213701, #2217003, and the support from CRISP, one of six SRC JUMP centers. We thank AMD/Xilinx for FPGA and software donation, and support from the AMD/Xilinx Center of Excellence at UIUC, the AMD/Xilinx Heterogeneous Accelerated Compute Cluster at UCLA, and the Center for Research Computing (CRC) at the University of Pittsburgh.

References:
[1] AIE Architecture(AM009 2021.1)
[2] AIE Instructions and APIs(UG1078 UG1529)
[3] AIE Coding Example(UG1079 2021.1)
[4] Versal Programming Environment(UG1076 2021.1)
[5] Introduction to FP32 programming of AIE

charm's People

Contributors

jinmingzhuang avatar peipeizhou-eecs avatar ussamazahid96 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

charm's Issues

ERROR: [v++ 60-602] Source file does not exist: hw.xclbin

Hi,

I am facing a problem when running command
python project_setup.py
I got the error:
ERROR: [v++ 60-602] Source file does not exist: /home/[myusername]/xxxx/CHARM/prj_try/build_dir.hw.xilinx_vck190_base_202220_1/hw.xclbin
I'm wondering if this has something to do with the vitis 2022.2 version I'm using, or a misconfiguration somewhere

The complete log is as follows:

`mkdir -p ./build_dir.hw.xilinx_vck190_base_202220_1
v++ -l -t hw --platform /s3/Xilinx_Vitis_2022.2/Vitis/2022.2/base_platforms/xilinx_vck190_base_202220_1/xilinx_vck190_base_202220_1.xpfm --save-temps --optimize 2 --hls.jobs 8 --config ./conn.cfg --clock.defaultFreqHz 220000000 --temp_dir ./build_dir.hw.xilinx_vck190_base_202220_1 --vivado.synth.jobs 8 --vivado.impl.jobs 8 -o'build_dir.hw.xilinx_vck190_base_202220_1/hw.xclbin' _x.hw.xilinx_vck190_base_202220_1/dma.xo libadf.a | tee ./build_dir.hw.xilinx_vck190_base_202220_1/hpc_xclbin.log
Option Map File Used: '/s3/Xilinx_Vitis_2022.2/Vitis/2022.2/data/vitis/vpp/optMap.xml'

****** v++ v2022.2 (64-bit)
**** SW Build 3671529 on 2022-10-13-17:52:11
** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.

objcopy: warning: /home/[myusername]/xxxx/CHARM/prj_try/.Xil/v++-37415-[myusername]/hw.o: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
objcopy: warning: /home/[myusername]/xxxx/CHARM/prj_try/.Xil/v++-37415-[myusername]/hw.o: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
objcopy: warning: /home/[myusername]/xxxx/CHARM/prj_try/.Xil/v++-37415-[myusername]/sw.o: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
objcopy: warning: /home/[myusername]/xxxx/CHARM/prj_try/.Xil/v++-37415-[myusername]/sw.o: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
INFO: [v++ 60-1306] Additional information associated with this v++ link can be found at:
Reports: /home/[myusername]/xxxx/CHARM/prj_try/build_dir.hw.xilinx_vck190_base_202220_1/reports/link
Log files: /home/[myusername]/xxxx/CHARM/prj_try/build_dir.hw.xilinx_vck190_base_202220_1/logs/link
Running Dispatch Server on port: 37909
INFO: [v++ 60-1548] Creating build summary session with primary output /home/[myusername]/xxxx/CHARM/prj_try/build_dir.hw.xilinx_vck190_base_202220_1/hw.xclbin.link_summary, at Fri Jul 21 16:59:57 2023
INFO: [v++ 60-1315] Creating rulecheck session with output '/home/[myusername]/xxxx/CHARM/prj_try/build_dir.hw.xilinx_vck190_base_202220_1/reports/link/v++_link_hw_guidance.html', at Fri Jul 21 16:59:57 2023
INFO: [v++ 60-895] Target platform: /s3/Xilinx_Vitis_2022.2/Vitis/2022.2/base_platforms/xilinx_vck190_base_202220_1/xilinx_vck190_base_202220_1.xpfm
INFO: [v++ 60-1578] This platform contains Xilinx Shell Archive '/s3/Xilinx_Vitis_2022.2/Vitis/2022.2/base_platforms/xilinx_vck190_base_202220_1/hw/hw.xsa'
ERROR: [v++ 82-4223] Output file type of .xsa is required. A different output file type has been specified: build_dir.hw.xilinx_vck190_base_202220_1/hw.xclbin
INFO: [v++ 60-1653] Closing dispatch client.
v++ -p -t hw -f /s3/Xilinx_Vitis_2022.2/Vitis/2022.2/base_platforms/xilinx_vck190_base_202220_1/xilinx_vck190_base_202220_1.xpfm
--package.out_dir ./package.hw
--package.rootfs /opt/xilinx-versal-common-v2022.2/rootfs.ext4
--package.kernel_image /opt/xilinx-versal-common-v2022.2/Image
--package.boot_mode=sd
--package.image_format=ext4
--package.defer_aie_run
--package.sd_file hostexe
libadf.a ./build_dir.hw.xilinx_vck190_base_202220_1/hw.xclbin -o mm_hw.xclbin
Option Map File Used: '/s3/Xilinx_Vitis_2022.2/Vitis/2022.2/data/vitis/vpp/optMap.xml'

****** v++ v2022.2 (64-bit)
**** SW Build 3671529 on 2022-10-13-17:52:11
** Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.

ERROR: [v++ 60-602] Source file does not exist: /home/[myusername]/xxxx/CHARM/prj_try/build_dir.hw.xilinx_vck190_base_202220_1/hw.xclbin
INFO: [v++ 60-1662] Stopping dispatch session having empty uuid.
INFO: [v++ 60-1653] Closing dispatch client.
make: *** [Makefile:185: package_hw] Error 1`

Thanks
Yiou

Some questions about the paper and code of CHARM

Thank you for open-sourcing the CHARM project. This is an excellent GEMM accelerator work with outstanding performance. While we read the CHARM paper and code, we have a few questions and would really appreciate it if you can answer them.

Firstly, some questions about the paper:

  1. In the last paragraph of section 4.2, it says "... so that a tile of LHS with size (X x A x TI) x (Y x B x TK) can be reused on-chip for (Z x TJ) times", here why it is (Z x TJ) times not (Z x C x TJ) times?
  2. In the "1st Step: Workload Assignment" portion of section 5.4, "... mapping an application with n kernels to num accs suffers...", here mentions the number of accelerators at the first time, is this a user provided input parameter, or the output from CDSE?
  3. Also, in 1st step, it's better to give more explanation about how to reduce the time complexity as C(n-1, num-1). How does this function come?

Then, as for the code from the GitHub:

  1. In the input.cfg file, KRL_TYPE can only be 0 or 1, however, in src_gen/AIE_ArrGen/gen_graph.sh, line 89 and src_gen/Kernel_Gen/gen_grah.sh, line 73, it will check if kernel_type == "int32". If kernel type can only be either 1 or 0, why need to check if it is equal to int32?
  2. After running code generation, what is the purpose of these three files: mm_graph_x3_type1.h, mm_graph_x3_type0.h, mm_graph_x3_col.h? They are not included in the top function or anywhere else.
  3. Looks like there is no data set for testing, could you provide some data set for a demo?
  4. Our platform is VCK5000, however, when we compile the project by following the instructions, we will get some errors. We'd like to double-check that if we need more specific modifications or instructions to run the project on VCK5000?
  5. After searching in the CHARM repo, we only found the CACG (code generation) part of CHARM framework, but not the design search parts (e.g., CDSE, CDAC and CRTS). Could you please point us to the location of the source code for these parts? Additionally, how could we get the parameters in input.cfg? How to generate different accs for different MM? How to generate accs for non-MM functions? For example, in the example/BERT, there are files for different sizes, like mm_graph_large.h, mm_graph_small.h, dma_large.h, but according to the sources in src_gen, there is no sh file to generate a file which ends with _large.h or _small.h. Would be great if you can shed some light on these parts.

Thank you very much in advance and looking forward to your reply!

About pipeline view in paper DAC'23

Dear ARC Lab,
Thanks for releasing the code open-source and the nice CHARM project. I tried replicating the pipeline view in paper DAC'23. Could you please describe how to config the parameters such as A_BRO and C_BRO to generate a 1*4 AIE array? Moreover, could you please tell me how to generate aiesim data corresponding to the customized AIE array?
Thanks !

non-GEMM operations

Hi,

In the paper, there are evaluations on non-GEMM operations such as softmax, layernorm, etc. However, I think the current release does not include them. I am wondering if you have an implementation or any references.

About MM size in paper and demo

Could you please describe what is your input size of the original model, especially for ViT and BERT.
And how do you organize the matrices to form the MxK and KxN matrices as shown in Table 5.
image

Vkc190 Simulator?

Hi @JinmingZhuang
Great work! I am working in machine learning. But I'm a beginner to FPGA and learning about machine learning implementation on FPGA. Your work seems latest and a good start point. Since FPGA boards are not easily available to everyone, more specifically vkc190. For learning purpose, could you also make a tutorial or show some pointers for beginners like me, how to use run your code or use it in a xilinx simulator in linux/ubuntu? I'm also not sure if there is any simulator like that. Please let me know.

Thank you!
Srikanth

How to profile DDR bandwidth?

Hi,
I have seen there is a Bandwidth Profiler before DSE in fig6 of CHARM paper.
And I want to reproduce it on my vck5000 since the it has 4 DRAM bank.
But in cdse.py, the bandwidth part is fixed, how did you profile the bandwidth of the FPGA?

# One-Time Profling of DDR Bandwidth
    BW_L_S=(12*DDR_BANK)*freq_rate
    BW_R_S=(12*DDR_BANK)*freq_rate
    BW_O_S=(8.5*DDR_BANK)*freq_rate

    BW_L_DR=(8*DDR_BANK)*freq_rate
    BW_R_DL=(8*DDR_BANK)*freq_rate

    BW_L_DO=(8*DDR_BANK)*freq_rate
    BW_R_DO=(8*DDR_BANK)*freq_rate
    BW_O_D=(6*DDR_BANK)*freq_rate

    BW_L_T=(7*DDR_BANK)*freq_rate
    BW_R_T=(7*DDR_BANK)*freq_rate
    BW_O_T=(7*DDR_BANK)*freq_rate

Thank you.

[AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B3_CORE_X24Y0/AIE_PL_TO_AIE_0 exceeds

Dear ARC Lab,
Thanks for releasing the code open-source and the nice work. I tried replicating the flow as per the instructions and get the following error on executing make all PLATFORM=${PATH} EDGE_COMMON_SW_PATH=${PATH} SYSROOT_PATH={PATH}
Just checking if there is any quick hint/resolution that you are aware of?

Regards,
Rajesh

Total Number of unique Switch FIFOs: 0
Running AIE Post-Map Finalizer.
Post-Map Finalizer succeeded.
Ordered merge post process: Adding BD config broadcast nets
Running AIE Post-Map Finalizer.
Post-Map Finalizer succeeded.
Check AIE-ROUTER has run: 12 errors
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B3_CORE_X24Y0/AIE_PL_TO_AIE_0 exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B3_CORE_X24Y0/AIE_PL_TO_AIE_4 exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B3_CORE_X24Y0/AIE_PL_TO_AIE_0_PIN exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B3_CORE_X24Y0/AIE_PL_TO_AIE_4_PIN exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B3_CORE_X24Y0/AIE_SWITCH_S_SOUTH_CH0_PIN exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B3_CORE_X24Y0/AIE_SWITCH_S_SOUTH_CH4_PIN exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B0_CORE_X25Y0/AIE_PL_TO_AIE_0 exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B0_CORE_X25Y0/AIE_PL_TO_AIE_4 exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B0_CORE_X25Y0/AIE_PL_TO_AIE_0_PIN exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B0_CORE_X25Y0/AIE_PL_TO_AIE_4_PIN exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B0_CORE_X25Y0/AIE_SWITCH_S_SOUTH_CH0_PIN exceeds limit of 100.
ERROR: [AIE-ROUTER-3] Demand of 200 at node AIE_INTF_B0_CORE_X25Y0/AIE_SWITCH_S_SOUTH_CH4_PIN exceeds limit of 100.
HDRTPinDelayHelperV10:: Clearing pin delay helper
Releasing pin delay helper for floorplan 0x6661ea70
HDRTPinDelayHelperV10:: Releasing Pin Dly Helper
NodeGraph Released.
DeviceData Released.
ERROR:MathEngineUDMRouter:###UDM Router Did NOT Finish Successfully
/tools/Xilinx/Vitis/2022.2/aietools/bin/aieir_be: line 98: kill: (-871611) - No such process
Compilation Failed
INFO: [aiecompiler 77-5805] Run completed. Additional information can be found in:
Guidance: ./Work/reports/guidance.html

INFO: [aiecompiler 77-5806] Use the vitis_analyzer tool to visualize and navigate the relevant reports. Run the following command.
vitis_analyzer ./Work/mm_top.aiecompile_summary
/tools/Xilinx/Vitis/2022.2/aietools/bin/aiecompiler: line 83: kill: (-871297) - No such process

wrong output of aie simulation of int8 mm_kernel0

Dear Author,

I am currently practicing small-scale matrix multiplication using a single AIE tile. I have tried mm_kernel0.cc for various data types in the src path of your project (as mm_kernel0 includes two inputs and one output, theoretically it should be the basic unit capable of performing matrix multiplication).

I created separate projects for each data type of mm_kernel0.cc to conduct AIE Simulation. Moreover, I generated the test data and Golden data according to the required data type and size by using Python code.

According to my AIE simulator results, the simulation data for int16, int32, and fp32 data types of mm_kernel0.cc are all correct. However, the int8 data type alone does not yield the correct simulation results, which I find puzzling. For convenience, I have attached my test project for the int8 type. In this project, the input matrices used for testing are two identical 01 diagonal matrices. I transposed them for AIE input (as you know, AIE matrix multiplication requires column-wise storage of matrices). You can directly see the two transposed input matrices, along with the golden result, and the AIE output I saved, in the data/ directory. You'll notice that these two results are completely different. If you have time, you can recompile the simulation to get my results by running python int8_test_data_gen.py && make all. You can also check the other input matrix if modifying the Python code.

I don't think there is any issue with my testing process, as I can obtain correct output results for the other three data types using the same process, with only the int8 output being incorrect. I would like to know if there are any additional conditions to be aware of to obtain the correct results for int8 type mm_kernel0, or if there is something missing or incorrect in my process?

Thank you for your guidance.

mm_int8.zip

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.