ucb-bar / midas Goto Github PK

FPGA-Accelerated Simulation Framework Automatically Transforming Arbitrary RTL

License: Other

C++ 19.74% Scala 69.17% Python 2.70% Makefile 0.79% Verilog 7.54% C 0.06%

midas's Introduction

Golden Gate (MIDAS II)

Golden Gate is an optimizing FIRRTL compiler for generating FPGA-accelerated simulators automatically from Chisel-based RTL design, and is the basis for simulator compilation in FireSim.

Golden Gate is the successor to MIDAS, which was originally based off the Strober sample-based energy simulation framework. Golden Gate differs from prior work in that it is, to our knowledge, the first compiler to support automatic multi-model composition: it can break apart a block of RTL into a graph of models. Golden Gate uses this feature to identify and replace FPGA-hostile blocks with multi-host-cycle models that consume fewer FPGA resources while still exactly representing the behavior of the source RTL. In our ICCAD 2019 paper, we leverage this feature optimize multi-ported RAMs in order to fit an extra two BOOM cores (6 up from 4) on a Xilinx VU9P.

Changes From MIDAS

Golden Gate inherits nearly all of the features of MIDAS, including, FASED memory timing models, assertion synthesis, and printf synthesis, but there are some notable changes:

1. Support for Resource Optimizations

As mentioned above, Golden Gate can identify and optimize FPGA-hostile structures in the target RTL. This is described at length in our ICCAD2019 paper. Currently Golden Gate only supports optimizing multi-ported memories, but other resource-reducing optimizations are under development.

2. Different Inputs and Invocation Model (FIRRTL Stage).

Golden Gate is not invoked in the same process as the target generator. instead it's invoked as a seperate process and provided with three inputs:

FIRRTL for the target-design
Associated FIRRTL annotations for that design
A compiler parameterization (derived from Rocket Chip's Config system). annotations. This permits decoupling the target Generator from the compiler, and enables the resuse of the same FIRRTL between multiple simulation or EDA backends. midas.Compiler will be removed in the next release.

3. Endpoints Have Been Replaced With Target-to-Host Bridges.

Unlike Endpoints, which were instantiated by matching on a Chisel I/O type, target-to-host bridges (or bridges, for short) are instantiated directly in the target's RTL (i.e., in Chisel). Unlike endpoints, bridges can be instantiated anywhere in the module heirachy, and can more effectively capture module-hierarchy-dependent parameterization information from the target. This makes it easier to have multiple instances of the same bridge with difference parameterizations.

4. The Input Target Design Must Be Closed

The FIRRTL passed to Golden Gate must expose no dangling I/O (with the exception of one input clock): instead the target should be wrapped in a module that instantiates the appropriate bridges. This wrapper module is directly analogous to a test harness used in software-based RTL simulation. How these bridges are instantiated is left to the user, but multiple different examples can be found in FireSim. One benefit of this "closed-world" approach is that the topology of the simulator (as a network of simulation models) is guaranteed to match the topology of the input design.

5. Different Underlying Dataflow Network Formalism

Golden Gate uses the Latency-Insensitive Bounded-Dataflow Network (LI-BDN) target formalism. This makes it possible to model combinational paths that span multiple models, and to prove that properties about target-cycle exactness and deadlock freedom in the resulting simulator.

Documentation

Golden Gate's documentation is hosted in FireSim's Read-The-Docs

Related Publications

Albert Magyar, David T. Biancolin, Jack Koenig, Sanjit Seshia, Jonathan Bachrach, Krste Asanović, Golden Gate: Bridging The Resource-Efficiency Gap Between ASICs and FPGA Prototypes, To appear at ICCAD '19.(Paper PDF)
David Biancolin, Sagar Karandikar, Donggyu Kim, Jack Koenig, Andrew Waterman, Jonathan Bachrach, Krste Asanović, “FASED: FPGA-Accelerated Simulation and Evaluation of DRAM”, In proceedings of the 27th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, February 2019. (Paper PDF)
Donggyu Kim, Christopher Celio, Sagar Karandikar, David Biancolin, Jonathan Bachrach, and Krste Asanović, “DESSERT: Debugging RTL Effectively with State Snapshotting for Error Replays across Trillions of cycles”, In proceedings of the 28th International Conference on Field Programmable Logic & Applications (FPL 2018), Dublin, Ireland, August 2018. (IEEE Xplore)
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolić, Randy Katz, Jonathan Bachrach, and Krste Asanović, “FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud”, In proceedings of the 45th ACM/IEEE International Symposium on Computer Architecture (ISCA 2018), Los Angeles, June 2018. (Paper PDF, IEEE Xplore) Selected as one of IEEE Micro’s “Top Picks from Computer Architecture Conferences, 2018”.
Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach, and Krste Asanović, "Evaluation of RISC-V RTL with FPGA-Accelerated Simulation", The First Workshop on Computer Architecture Research with RISC-V (CARRV 2017), Boston, MA, USA, Oct 2017. (Paper PDF)
Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović, "Strober: Fast and Accurate Sample-Based Energy Simulation for Arbitrary RTL", International Symposium on Computer Architecture (ISCA-2016), Seoul, Korea, June 2016. (ACM DL, Slides)

Dependencies

This repository depends on the following projects:

Chisel: Target-RTL that MIDAS transformed must be written in Chisel RTL in the current version. Additionally, MIDAS RTL libraries are all written in Chisel.
FIRRTL: Transformations of target-RTL are performed using FIRRTL compiler passes.
RocketChip: Rocket Chip is not only a chip generator, but also a collection of useful libraries for various hardware designs.
barstools: Some additional technology-dependent custom transforms(e.g. macro compiler) are required when Strober energy modelling is enabled.

midas's People

Contributors

Stargazers

Watchers

Forkers

cfandy rogerxujiang songjunjian csl-ku sleepbook kkangle rockstarrecords11 tmagik davidbiancolin ocakgun mfkiwl wchgithb joonho3020

midas's Issues

Synthesize Prints Cycle Count Mismatch

Seems like there is a mismatch between the cycle count associated with synthesize prints, and the target cycle count. This seems to manifest in the case of of sparse prints (~370000 print statements out of 6246847756 target cycles). As a particular example, the last print statement in an experiment indicates CYCLE: 167551685485, while the end of simulation indicates Runs 6246847756 cycles

No tester dependency

With #6, ZynqShimTester is not working because of weird timing behavior of testers, so it's time for strober to graduate from chisel tester.

Expose MSHRs as a runtime configurable settings.

MSHRs can and should be made a runtime-configurable setting.

@farzadfch

step() in peek poke interface in simif is deceptive

step() in simif can be deceptive for users who are familiar with Chisel's PeekPokeTesters. Consider the following example:

import chisel3._

class ShiftRegister extends Module {
  val io = IO(new Bundle {
    val in1  = Input(UInt(8.W))
    val in2  = Input(UInt(8.W))
    val out = Output(UInt(8.W))
    val enable = Input(Bool())
  })
  val out = RegInit(0.U(8.W))
  io.out := out
  when (io.enable) {
    out := io.in1 + io.in2
  }
}

#include "simif.h"

class ShiftRegister_t: virtual simif_t
{
public:
  void run() {
    std::vector<uint32_t> reg(4);
    target_reset();
    poke(io_enable, 0);
    step(5);
    poke(io_enable, 1);
    step(1);
    poke(io_in1, 10);
    poke(io_in2, 20);
    step(1);
    expect(io_out, 30);
  }
};

In chisel-testers, the poke-step-expect sequence works as expected, but since simif's "step" is really "fire one cycle of targetFire", the expect line actually dequeues the old value of io_out right before the posedge, which is different from what the chisel-testers do.

Suggestions: either document this, or perhaps rename as "targetFireStep()" or some other disambiguated name, or provide step(1) as an alias for targetFireStep(2), etc.

@davidbiancolin

Golden Gate Release Checklist

Endpoint-based black box
Conversion to FIRRTL stage
Blackbox support in FAME 1 transform
Dead code clean up

Allow endpoints to specify initial token count in their channels.

Currently, all target DecoupledIO between endpoints and the transformed-RTL model is given a decoupled channel with latency = 1 (they are seeded with one initial token).

While this will be fixed in the new FAME compiler; a short term solution would be to allow endpoints to specify what sort of channel (or latencies) they'd like on the interconnect moving between the transformed-RTL.

Croak more obviously when the driver is mismatched with a bitstream.

Make sure output dir exists before writing a conf file.

Update README to explain deps on RISCV toolchain.

[MIDAS 2] Bring up FASED Configuration Generation

The new endpoint system will break how this currently invoked.

Remove requires where possible FASED functional model

The functional model needs to be more flexible since it's being driven by an edge which can vary widely from target-to-target.
Eg. non-powers-of-two multi-queues

Generating a Midas Memory Model
  Max Read Requests: 16
  Max Write Requests: 16
  Max Read Length: 8
  Max Write Length: 8
  Max Read ID Reuse: 3
  Max Write ID Reuse: 3
Timing Model Parameters
  Timing Model Class: Latency Bandwidth Pipe
  No LLC Model Instantiated
[error] (run-main-0) java.lang.IllegalArgumentException: requirement failed
[error] java.lang.IllegalArgumentException: requirement failed
[error]         at scala.Predef$.require(Predef.scala:264)
[error]         at midas.widgets.MultiQueue.<init>(Lib.scala:118)
``

Issue with subclasses in widget matching

We recently made changes to IceNet/SimpleNIC that changed the NICIO to be a subclass of SerialIO instead of StreamIO. Unfortunately, this caused midas widget mapping to break, because the SerialWidget was matching on SerialIO and all its subclasses, so it mistakenly matched the NICIO with the SerialWidget instead of the SimpleNICWidget. The solution was to change SimSerialIO's matchType function to explicitly return false for NICIO. I'm not sure how exactly this issue could be avoided in the future.

Master doesn't properly measure runtime, reports wrong simulation frequency

For long running workloads we stuff like this:

==> spec-test/473.astar.test.err <==
SEED: 7282986
time elapsed: 18446744072974.2 s, simulation speed = 0.00 KHz
*** PASSED *** after 74901751853 cycles
Runs 74901751853 cycles
[PASS] MidasTop Test
SEED: 7282989

real    130m58.391s
user    0m0.237s
sys     0m0.526s

FAME-1 transforms should more intelligently bundle target I/O into FAME channels

I'm just going to start opening issues for things that need obvious improvement. It'll be easy to track them here.

Presently, all leaf signals are broken into fame decoupled bundles, despite the fact all input tokens will be consumed on the same host cycle and all outputs are being produced on the same cycle.

We should only create fame decoupled bundles for subsets of the output that can be produced on different cycles, and subsets of the input that can be consumed on different host cycles. An example of this would be a fame-1 decoupled target with multiple clock-domains (whose frequencies differ).

For fame-1 decoupled targets with a signal clock, there should be only a single output and input fame-1 channel produced.

Remote set url for zc706_MIG

The remote url for
zc706_MIG/fpga-images-zc706 is
[email protected]:ucb-bar/fpga-images-zc706.git

This makes it inaccessible to those outside of the project members.

Fix:
Edit .gitmodules
git submodule sync

[DRAM model] queue-size setting by using mmReg

DRAM FRFCFSModel and PCRAM model use the queue and buffer with configurable-size by mmReg. But, for example, if transactionQueueDepth=8 is applied to the model, this model set the queue depth of "000". As discussed with David, this may be caused by overflow issue, so the below code should be modified to increase register size.

===
class FirstReadyFCFSMMRegIO(val cfg:FirstReadyFCFSConfig) extends BaseDRAMMMRegIO(cfg) {
val schedulerWindowSize = Input(UInt(log2Ceil(cfg.schedulerWindowSize).W))
val transactionQueueDepth = Input(UInt(log2Ceil(cfg.transactionQueueDepth).W))

SerialWidget's target-time behavior influenced by simulation stalls

The SerialWidget inBuf starts to be filled in a modified design (modified to also stall when myStall is high) while the stall signal is active. As a result, the inBuf already contains an element once the simulation is restored and this is immediately sent to the target. Instead, in the golden design, the inBuf sends the element 2 rocket-chip cycles later than in the modified design because the inBuf had to be filled first

midas is overdependent on rocketchip

For now, midas depends on rocketchip, so regardless of the target design, we should import rocketchip to use midas. This is very unacceptable in most cases including midas-examples. midas needs to depend only on chissel and firrtl. Here's stats for code sharing from rocketchip:

config: 71 lines
arbiters: 68 lines for nasti
multi-width FIFO: 79 lines
misc: 74 lines (from util/Misc.scala)

Thus, we use only <300 lines from >10K lines of rocketchip. If we don't need parameterized bundle any more, this is less. I don't see any justification of rocketchip dependency right now. We may use tilelink for midas, but it is an uncertain future. Also, don't tell me that submoduling rocketchip and its build time do not matter at all. Even not importing riscv-tools is very hard without a script. Writing a build system for rocketchip is even harder.

Here's my plan to cut off the rocketchip dependency:

Have an own config system until parameterization is provided by chisel.
Just copy some util code. This code is unlikely to change. Copying code is very ugly, but I believe less uglier than importing the whole rocketchip and bumping effort whenever it changes. If some modules need more rocketchip code, they are in a wrong place. (Should reside in firesim)

Of course, we should cut off the barstools dependency too.

[DRAM Model] Accept target-address offset and memory system size in configuration

This will let us better allocate host-dram.

[DRAM Model] Better indicate that the functional model is undersized

The quick fix: add messages to assertions.