GithubHelp home page GithubHelp logo

scalesim-project / scale-sim-v2 Goto Github PK

View Code? Open in Web Editor NEW
194.0 194.0 91.0 27.91 MB

Repository to host and maintain scale-sim-v2 code

License: MIT License

Shell 1.11% Python 89.54% Makefile 0.40% Verilog 8.96%

scale-sim-v2's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

scale-sim-v2's Issues

Where is the interface to DRAMsim2?

I only see the statistics given by the end of the simulation, whereas the paper reads "SCALE-SIM allows for modeling the main memory behavior by generating accurate read and write bandwidths at the interface, which can then be fed into a DRAM simulator e.g., DRAM-Sim2 [22]." So where do you dump the trace file to be fed into DRAMsim2? Or how could you invoke DRAMsim2?

Provide a new method to get a dummy config string with default parameters

It is useful to have to have a method in the scale_config class, which can generate a list with default configurations. This can act as template for code to pass the configurations as string as opposed to reading a dummy config file, when using scalesim as a library.

Hint:
Temoprarily bypass the check for valid conf in the following implementation

    def get_conf_as_list(self):
        out_list = []

        if not self.valid_conf_flag:
            print("ERROR: scale_config.get_conf_as_list: Configuration is not valid")
            return

       ...

Number of SRAM reads for ofmap for input stationary dataflow is unrealistically small

The problem here is that number of reads for SRAM in ofmap for input stationary dataflow is unrealistically small. Upon inspection of the code, it seems that during the simulation the number of reads is not added to each other in each iteration of the output matrix. So instead the number of reads are overwriting each other so the final number of output reads becomes the number of reads of the last iteration. Hope this is clear.

Issue with -i gemm argument

Hi,
I experience an issue when I try scale-sim with the argument -i gemm . The tool pops an assertion which mentions that the size per row in the csv file should be at least 4. However in the documentation example for MNK input format seem to be exactly 4 parameters (Layer Name, M, N, K). Am I missing something here?

My csv has the same format as the example in README of scale-sim for GEMM operations.

Thanks in advance,
Kostas.

Add support to take inputs in GEMM format

Support for this statement:

The tool however expects the inputs to be in the convolution format by default. When using the mnk format for input, please specify using the -i gemm switch, as shown in the example below.

$ python3 <scale sim repo root>/scalesim/scale.py -c <path_to_config_file> -t <path_to_mnk_topology_file> -i gemm

Likely typo in `memory_map.py`

I just remembered that a while ago I noticed a very likely typo in memory_map.py.

The following line is probably missing a ::

self.ifmap_map_list.append(elems[self.num_banks])

I.e., the line should be

self.ifmap_map_list.append(elems[:self.num_banks])

While I do not understand the code 100%, I am pretty sure that the : is missing because there are very similar lines that come after that have a : in this position:

Eyeriss configuration

Thanks for your kindness and great works.

At cfg file,
IfmapSramSzkB: 6144
FilterSramSzkB: 6144
OfmapSramSzkB: 2048
means that the processor SRAM size is (6144 + 6144 + 2048)?

At cfg of eyeriss, the sram size of (IA, W, OA) are 108, 108, 108 but eyeriss has only 108KB in total.
Moreover eyeriss has its register files in PE each. How the simulator handles it? (RF size, SRAM bandwidth)

Make a dir

Hello!

I'm using scale-sim-v2 on Windows (Anaconda 3, Jupyter Notebook, Python3.10). The only error i got is for the creation of the directory (simulator.py and single_layer_sim.py): the creation is OS based, instead of using the universal os.mkdir() function, if anyone wants to update that for the next version

Cheer!
N.

Tile size

Hi,
How do I get ofmap tile size (number of outputs generated in one iteration of DRAM)? Can you please point out the exact location in the code, where I can get an estimate about the total number of ofmap tiles as well as the size?

Dram access & bandwidth

Thanks for your great works.
I would like to ask you a question.
I have tried a lot of experiments using scale-sim and found that the results with DRAM seem to be wrong.
Total DRAM access count is not a function of SRAM size in your simulator. It causes the simulator reports to use much larger dram access than an accelerators had to require.
Also, setting the simulator to USER mode will cause the accelerator to stall even if it have enough DRAM bandwidth.
Not sure if I'm wrong, but I know how to design a TPU using my own verilog code.

issue in systolic_compute_ws.py

In line 184 inside function create_ifmap_demand_mat() , the variable inter_fold_gap_suffix should be equal to self.arr_col + self.arr_row - 2 instead of self.arr_col-1. The last input needs self.arr_row - 1 cycles to reach the last column of PE array and then self.arr_col - 1 cycles to reduce along the last column.

ISPASS documentation Figure 9.(b),(c)

I'm trying to reproduce ISPASS documentation's Figure 9.(b),(c)
At the figure description which is right below the figure picture, it say's 2^14 MACs, 2^16 MACs for Fig9 (b),(c)

But, at the next page(P.65 ISPASS) "Figure 9(b-c) ~~~ 4096 and 16384 MAC units respectively" this makes me kinda confusing. (May be I misinterpreted it.) Can You give an advice for this?

Also, about reproducing TF0 layer has 31999 84 1024 (Sr, T, Sc) and considering it's 2^14 MACs(Figure 9.b) at R,C is 64x64
I thought topofile should be in this format
either 1 or 2
Layer name, IFMAP Height, IFMAP Width, Filter Height, Filter Width, Channels, Num Filter, Strides,
TF0, 64, 84, 64, 84, 250, 8, 1, ...(1)
TF0, 16000, 84, 512, 84, 1, 1, 1, ...(2)
(Maybe this could be wrong too. Sadly)

I figured out basic tutorial how to run scale sim but having trouble setting the confis and topos. So can I ask You to give an topo file and config file for Figure 9.(b,c)? (for me to help understanding)

Thank you for Reading.
Hope to get an answer for this.
Have a Good Day.

Enable Batching Support

Enable batching of multiple IFMAP matrices for a single run.

Note:
Details to be updated soon.

Main memory wasted

With the input topology becoming larger, the cost of main memory(DDR) on host is growing much larger. Varients like 'self.ifmap_prefetch_matrix' will waste too much without release. Hope it can solved in the future. Thank you!

How to set the batch size in a training progress

Hi, I want to use the simulator to simulate a training progress of a CNN for research purpose, but I'm confused about the batch size setting in the simulator. Is it defaulted to 1 or other constent value? Furthermore, the simulation cycle of each layer in the output, I'm wondering its meaning. Does it mean training 1 data once through the CNN structure to get the weights costs that cycle?
Looking for anyone's reply and help me to use this simulator better.
Thanks very much!

variables mean

I want to know what these variables mean.Can you tell me?
in scale-sim-v2-main/scalesim/memory/read_buffer_estimate_bw.py
# Tracking variables
self.num_items_per_set = -1
self.elems_current_set = 0
self.current_set_id = 0
self.read_buffer_set_start_id = -1
self.read_buffer_set_end_id = -1
self.prefetch_buffer_set_start_id = -1
self.prefetch_buffer_set_end_id = -1
self.last_prefetch_start_cycle = -2
self.last_prefetch_end_cycle = -1
self.first_request_rcvd_cycle = 0

Memory leak in read_buffer_estimate_bw method

A significant memory leak is observed in the read_buffer_estimate_bw.py file whenever the tool is run in estimate bandwidth mode. Sets are used to manage prefetches and do fast lookups. The fast lookups come at a cost of higher memory usage. Only the sets which contain elements in the active read buffer portion of the SRAM need to be stored to perform lookup operations. The remaining sets become redundant and need to be removed.
The attached graph shows the difference created by solving this memory leak. As large as 96% of memory can be saved by solving this issue.
Screenshot from 2021-12-11 17-52-35

Can i get the tutorial3 script?

In the asplos tutorial, there is a tutorial 3 section.
But i cannot find the tutorial 3 script in this repository.

In this repository, there are tutorial 1 and 2, but not 3.
So, can we get it?

Basic question about memory access patterns

Thank you for creating and releasing this tool. It helped me understand systolic array functionality to a large extent.

I have a basic question: after the simulation ends, the tool produces DRAM and SRAM read and write trace files. In each of these files, each line lists the memory addresses where data are read from/written to, for each computation cycle of the systolic array. These addresses are not consecutive in memory. This may be a basic question, but I wanted to understand how data are accessed from multiple distinct memory locations in the same processor clock cycle or how is this implemented in real hardware.

Additionally, the ISPASS paper associated with this work presents a number of energy consumption graphs based on Scale-Sim simulations. It would be great if you could shed some light on how these energy results were obtained.

Hope you can clarify my doubts. Thank you!

asking about order or unit

Thanks for your great works.

My question is about unit and order on your code.

  1. At cfg file, does Bandwidth means max DRAM bandwidth in GB/s ? What is InterfaceBandwidth: CALC ?
  2. At BANDWIDTH_REPORT.csv, what does Avg Bandwidth means? GB/s? At 1), bandwidth was set but Avg DRAM BW exceeds 10.
  3. Do DRAM and SRAM bandwidth affect on Overall Utilization? What is exact definition of it? Does it differ with Compute utilization? They show same result.
  4. How map DW layer on TPU and scale model? It has too low mapping efficient what i expect. (less than 3% at COMPUTE_REPORT.csv )
  5. At DETAILED_ACCESS_REPORT.csv, Does DRAM FilterReads mean number of access on DRAM for weight? or Bytes?

Your works are very helpful. Thanks for again.

Error While Running Scalesim

I am getting a TypeError while running Scalesim

Traceback (most recent call last):
File "scalesim/scale.py", line 37, in
input_type_gemm=gemm_input
TypeError: init() got an unexpected keyword argument 'input_type_gemm'

Here is the command that I used
python3 scalesim/scale.py -c configs/scale.cfg -t topologies/conv_nets/yolo_tiny.csv -p OUTPUT

Make release 2.0.2

  1. Tag in Github
  2. Release in Github
  3. Release notes to changelog.md
  4. Push to PyPi with new version tag
  5. Push to ARM

Experminenting with a small example: Weird pauses and indices

After some experimenting with scale-sim-v2 (and SCALE-SIM v1), I noticed a few things, when looking at the results of scale-sim-v2, I just could not quite make sense of:

The example I based my experiments on was:

  • config
    [general]
    run_name=conf_4x4_os
    
    [architecture_presets]
    ArrayHeight:    4
    ArrayWidth:     4
    IfmapSramSzkB:  128
    FilterSramSzkB: 128
    OfmapSramSzkB:  128
    IfmapOffset:    0
    FilterOffset:   10000000
    OfmapOffset:    20000000
    Bandwidth:      10
    Dataflow:       os
    MemoryBanks:    1
    
    [run_presets]
    InterfaceBandwidth: CALC
  • topology
    Layer name IFMAP Height IFMAP Width Filter Height Filter Width Channels Num Filter Strides
    Conv1 4 4 3 3 1 5 1

The ouput is the following (empty cells originally contained "-1" entries)

  • IFMAP_SRAM_TRACE.csv + FILTER_SRAM_TRACE.csv

    1 0       10000000      
    2 1 1     10000001 10000009    
    3 2 2 4   10000002 10000010 10000018  
    4 4 3 5 5 10000003 10000011 10000019 10000027
    5 5 5 6 6 10000004 10000012 10000020 10000028
    6 6 6 8 7 10000005 10000013 10000021 10000029
    7 8 7 9 9 10000006 10000014 10000022 10000030
    8 9 9 10 10 10000007 10000015 10000023 10000031
    9 10 10 12 11 10000008 10000016 10000024 10000032
    10   11 13 13   10000017 10000025 10000033
    11     14 14     10000026 10000034
    12       15       10000035
    13                
    14                
    15                
    16 0       10000036      
    17 1 1     10000037      
    18 2 2 4   10000038      
    19 4 3 5 5 10000039      
    20 5 5 6 6 10000040      
    21 6 6 8 7 10000041      
    22 8 7 9 9 10000042      
    23 9 9 10 10 10000043      
    24 10 10 12 11 10000044      
    25   11 13 13        
    26     14 14        
    27       15        
    28                
    29                
    30                
  • OFMAP_SRAM_TRACE.csv

    0        
    1        
    2        
    3        
    4        
    5        
    6        
    7        
    8 20000015      
    9 20000010 20000016    
    10 20000005 20000011 20000017  
    11 20000000 20000006 20000012 20000018
    12   20000001 20000007 20000013
    13     20000002 20000008
    14       20000003
    15        
    16        
    17        
    18        
    19        
    20        
    21        
    22        
    23 20000019      
    24 20000014      
    25 20000009      
    26 20000004      
    27        
    28        
    29        

1. Why are there pauses between executing filter 1 through 4 and executing filter 5?

SCALE-SIM v1 was able to execute them directly after each other.

The output of V1 looked like:

0 0       10000000      
1 1 1     10000001 10000009    
2 2 2 4   10000002 10000010 10000018  
3 4 3 5 5 10000003 10000011 10000019 10000027
4 5 5 6 6 10000004 10000012 10000020 10000028
5 6 6 8 7 10000005 10000013 10000021 10000029
6 8 7 9 9 10000006 10000014 10000022 10000030
7 9 9 10 10 10000007 10000015 10000023 10000031
8 10 10 12 11 10000008 10000016 10000024 10000032
9 0 11 13 13 10000036 10000017 10000025 10000033
10 1 1 14 14 10000037   10000026 10000034
11 2 2 4 15 10000038     10000035
12 4 3 5 5 10000039      
13 5 5 6 6 10000040      
14 6 6 8 7 10000041      
15 8 7 9 9 10000042      
16 9 9 10 10 10000043      
17 10 10 12 11 10000044      
18   11 13 13        
19     14 14        
20       15        

(The OFMAP trace of V1 looked kind of weird and different to V2 though.)

In general, the total cycle numbers of V2 are currently a lot different compared to V1.

2. The cycles for SRAM read and write in the output do not line up.

They are off by about 2 cycles.
Why do the read indices begin at 1 and the write indices at 0?


I had some additional weird findings when running bigger examples and restricting the RAM bandwidth.
For example, the DRAM read amount increased when increasing the SRAM size. When looking at the traces, there were a lot of "1" values that probably should have been "-1".
I could post the example if you would be interested in those results as well.

Errors due to usage of deprecated 'np.int'

The alias 'np.int' for the builtin 'int' was deprecated with NumPy 1.20.0. This leads to errors when running the example commands from the README.md after a fresh install.

Pipeline at MAC Unit

Firstly I would like to thank you for such this tool.
I wanted to ask a simple question. Is there anyway to simulate MAC unit that contains pipeline? For example I want to simulate systolic array that contains MAC units which gives output every 3 clock cycle.
Thank you

Scale-sim-v2 does not give "cycle accurate simulation"

According to the code, the total cycle is advanced like this

    def get_total_cycles(self):
        assert self.all_layer_run_done, 'Layer runs are not done yet'

        total_cycles = 0
        for layer_obj in self.single_layer_sim_object_list:
            cycles_this_layer = int(layer_obj.get_compute_report_items[0])
            total_cycles += cycles_this_layer

        return total_cycles

Consequently, it does not provide the behavior of each cycle, right? The simulator only "computes" the cycles to be spent, not "simulates" across each cycle.

Stall cycles reported on "ESTIMATE BANDWIDTH" mode

**anand$** python scalesim/scale.py -t topologies/conv_nets/alexnet_part.csv -c configs/scale.cfg
******************* SCALE SIM **********************
Array Size: 32x32
SRAM IFMAP (kB): 64
SRAM Filter (kB): 64
SRAM OFMAP (kB): 64
Dataflow: Weight Stationary
CSV file path: topologies/conv_nets/alexnet_part.csv
Number of Remote Memory Banks: 1
Working in ESTIMATE BANDWIDTH mode.
Running Layer 0
100%|████████████████████████████████████████████████████████| 112284/112284 [01:37<00:00, 1154.54it/s]
Compute cycles: 138802
**Stall cycles: 26519**
Overall utilization: 74.17%
Mapping efficiency: 94.53%
Average IFMAP DRAM BW: 31.969 words/cycle
Average Filter DRAM BW: 31.984 words/cycle
Average OFMAP DRAM BW: 25.242 words/cycle
Saving traces: Done!
************ SCALE SIM Run Complete ****************

Curious about Why compute cycles do not change when all of IfmapSramSzkB/FilterSramSzkB/OfmapSramSzkB increase

@AnandS09 @jmjos @boukhary123 @ritikraj7
Okay, I am sorry to trouble you! When I use the following memory options,

IfmapSramSzkB = 16
FilterSramSzkB = 16
OfmapSramSzkB = 16
<other options remain default>
IfmapSramSzkB = 32
FilterSramSzkB = 32
OfmapSramSzkB = 32
<other options remain default>
IfmapSramSzkB = 64
FilterSramSzkB = 64
OfmapSramSzkB = 64
<other options remain default>
IfmapSramSzkB = 128
FilterSramSzkB = 128
OfmapSramSzkB = 128
<other options remain default>

I find no change in compute cycles!

In DETAILED_ACCESS_REPORT.csv, I find compute cycles could be regarded as the runtime of each layer!

Then I guess that the problem may attribute to the scale of systolic array! When I use 32 x 32 systolic array, the 16/32/64/128 KiByte sram is enough to support the calculation of systolic array with no stalls. In this thought, if I use a large systolic array, the calculation process may meet stalls and

Then I try a large systolic array (4096 x 4096) and use the following sram schemes:

IfmapSramSzkB = 1
FilterSramSzkB = 1
OfmapSramSzkB = 1
<other options remain default except array>
IfmapSramSzkB = 2
FilterSramSzkB = 2
OfmapSramSzkB = 2
<other options remain default except array>
IfmapSramSzkB = 3
FilterSramSzkB = 3
OfmapSramSzkB = 3
<other options remain default except array>

Unluckliy, the compute cycles of all of these three schemes are no difference, that is they are share the same values.

Please spare your precious time, thanks very much!

GEMM OS cycle count, prefetch and demand matrices sizes

Dear all,

Thanks for writing this simulator. I am trying to understand GEMM on systolic arrays with Output stationary dataflow and I am using scalesim for that. I have a few questions about that. If you could answer would be great:-
I don't understand the shape of prefetch and demand matrices for GEMM Output Stationary operations.
For example)

For a GEMM multiplication with MNK 2,2,2 with Output Stationary on a 2x2 systolic array:

  1. ifmap_prefetch_mat: (1, 4), filter_prefetch_mat: (1, 4)
  2. ifmap_demand_mat: (4, 2), filter_demand_mat: (1, 4), ofmap_demand_mat: (4, 2)
    Re: 2) if filter demand matrix shape is 1,4 how do we use the second column in the systolic array? 🤔

For simulation time: On paper and pencil, I get 4 cycles. Scalesim gives 3 cycles....

For a GEMM multiplication with MNK 4,4,4 with Output Stationary on a 2x2 systolic array:

  1. ifmap_prefetch_mat: (1, 16), filter_prefetch_mat: (1, 16)
  2. ifmap_demand_mat: (10, 4), filter_demand_mat: (1, 16), ofmap_demand_mat: (10, 4)

Re. simulation simulation, I get 10 cycles on pen and paper. 9 cycles from the simulation..
Questions)

  1. Are prefetch matrices are calculated based on number of elements to be prefetched from DRAM?
  2. How are the demand matrix dimensions calculated for GEMM?
  3. Compared to scalesim, always I get one more cycle extra for GEMM OS in my paper and pencil calculations. Please find slides attached.

Kind Regards,
Kartik,
PhD student, Ghent University
systolic_matmul_4x4.pptx

AssertionError: IFMAP and Filter demands out of sync in systolic_compute_ws.py with GEMM inputs

Scalesim fails to run the default test_mnk_input.csv from the repo using WS method in GEMM mode. The code asserts with the following error log as shown below:

====================================================
Array Size:     256x256
SRAM IFMAP (kB):        6144
SRAM Filter (kB):       6144
SRAM OFMAP (kB):        2048
Dataflow:       Weight Stationary
CSV file path:  topologies/GEMM_mnk/test_mnk_input.csv
Number of Remote Memory Banks:  1
Working in ESTIMATE BANDWIDTH mode.

====================================================
Running Layer 0
(1149, 256)   (894, 256)
Traceback (most recent call last):
  File "/nethome/dkadiyala3/git_repos/scale-sim-v2/scalesim/scale.py", line 39, in <module>
    s.run_scale(top_path=logpath)
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/scale_sim.py", line 86, in run_scale
    self.run_once()
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/scale_sim.py", line 105, in run_once
    self.runner.run()
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/simulator.py", line 79, in run
    single_layer_obj.run()
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/single_layer_sim.py", line 126, in run
    ifmap_demand_mat, filter_demand_mat, ofmap_demand_mat = self.compute_system.get_demand_matrices()
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/compute/systolic_compute_ws.py", line 361, in get_demand_matrices
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/compute/systolic_compute_ws.py", line 174, in create_demand_matrices
**AssertionError: IFMAP and Filter demands out of sync**

========================================
Steps to reproduce this issue.

  1. Install scalesim following the steps mentioned in the README
  2. then run scalesim using the following command:
    python3 scalesim/scale.py -c configs/google.cfg -t topologies/GEMM_mnk/test_mnk_input.csv -i gemm

===================== ROOT CAUSE OF ISSUE ==========================

This happens due to a incorrect generation of demand_matrices in function def create_ifmap_demand_mat(self): in systolic_compute_ws.py.
The following lines in the code are redundant along with the previous transformations on this_fold_matrix.

                # Add skew to the IFMAP demand matrix to reflect systolic pipeline fill
                this_fold_demand = skew_matrix(this_fold_demand)

===================== POSSIBLE FIXES =======================
Method:1: To comment out the above mentioned line and that should fix that issue.

Let's say Sr (K-dim) = 256, Sc (N-dim) = 64, and T (m-dim) = 128.
Our IFMAP is (T x Sr) =128 x 256 and our Filter is (Sr x Sc) = 256 x 64.
Let's consider our systolic array as 256 x 256.

In this case, let's consider the following matrices

A =  transpose(IFMAP) = (Sr X T) = 256 x 128,
B =  to get the last elem of IFMAP to just reach the beginning of Systolic array = ( Sr, (T + arr_row-1)) = (256, (128+255)) = (256, 383) 
C = for the last elem of IFMAP to traverse the systolic array from beginning to right-most end = (Sr, (T + arr_row-1 + arr_col-1)) = (256, 638)
D  =  to pre-fill the weights into the systolic array = (Sr, (T + arr_row-1 + arr_col-1 + arr_row)) = (256 x 894)

Therefore the final demand matrix is trans(D) = (894, 256) which matches the same for Filter-matrix (894,256).

However, it is not the right way to do so since we need the skew generation to replicate the actual dataflow into systolic array. Hence, the below method resolves it properly:

Method-2:
Keeping the skew_matrix transformation, since it make more sense.

A. skew_matrix(IFMAP) -> (T+arr_row-1, Sr) = (128+255,256)
B. Add cycles to traverse the systolic array = concat(A, (arr_col-1, Sr)) = (383+255, 256) = (638, 256)
C. Finally add the cycles to fill the weights in = concat(B, (arr_row, Sr)) = (638+256, 256) = (894, 256)

Fix in the code:

def create_ifmap_demand_mat(self):
        assert self.params_set_flag, 'Parameters are not set'

        inter_fold_gap_prefix = self.arr_row
        inter_fold_gap_prefix_mat = np.ones((inter_fold_gap_prefix, self.arr_row)) * -1

        #inter_fold_gap_suffix = self.arr_row + self.arr_col - 2
        inter_fold_gap_suffix = self.arr_col - 1
        #The last input need self.arr_col - 1 cycles to reduce along the last column.

        inter_fold_gap_suffix_mat = np.ones((inter_fold_gap_suffix, self.arr_row)) * -1

        for fc in range(self.col_fold):
            for fr in range(self.row_fold):
                col_start_id = fr * self.arr_row
                col_end_idx = min(col_start_id + self.arr_row, self.Sr)
                delta = self.arr_row - (col_end_idx - col_start_id)

                # Indexing the cols with row start and row end idx are correct
                # See the comment on ifmap_prefetch generation
                this_fold_demand = self.ifmap_op_mat[:,col_start_id: col_end_idx]
                self.ifmap_reads += this_fold_demand.shape[0] * this_fold_demand.shape[1]
                
                # Take into account under utilization
                if delta > 0:
                    null_req_mat = np.ones((self.T, delta)) * -1
                    this_fold_demand = np.concatenate((this_fold_demand, null_req_mat), axis=1)

                # Add skew to the IFMAP demand matrix to reflect systolic pipeline fill
                this_fold_demand = skew_matrix(this_fold_demand)

                # Account for the cycles for input to traverse systolic array
                this_fold_demand = np.concatenate((this_fold_demand, inter_fold_gap_suffix_mat), axis=0)

                # Account for the cycles for weights to load
                this_fold_demand = np.concatenate((inter_fold_gap_prefix_mat, this_fold_demand), axis=0)

                if fr == 0 and fc == 0:
                    self.ifmap_demand_matrix = this_fold_demand
                else:
                    self.ifmap_demand_matrix = np.concatenate((self.ifmap_demand_matrix, this_fold_demand), axis=0)

This code works and I have tested on multiple cases. (Tested on inputs that are bigger, smaller, and equal to the size of systolic array.

I request the repo owners to please review my issue. If the proposed fix is good, then I will formally send a pull request with the fixed code along with the unit testcases. Thank you.

Regards,
Divya Kiran K.
[email protected]

Where is the MMIO and interrupt simulation?

In the paper, the authors claim that "The CPU is the bus master which interacts with the accelerator by writing task descriptors to memory-mapped registers inside the accelerator." And in Figure 1 there is an interrupt interface. However I cannot find corresponding code in this codebase. Could you care to point it out for me?

Accelergy energy model integration with Scale-Sim

Integrate accelergy energy model with Scale-Sim.
The energy model is specified in the config file and is called during the execution of Scale-Sim Code. Once the run is complete the results are generated in a separate file and also shown in the command line.

Killed when run large IFMAP Height, IFMAP Width

Config
[general]
run_name = GoogleTPU_v1_t

[architecture_presets]
ArrayHeight: 256
ArrayWidth: 256
IfmapSramSzkB: 8192
FilterSramSzkB: 8192
OfmapSramSzkB: 8192
IfmapOffset: 0
FilterOffset: 10000000
OfmapOffset: 20000000
Dataflow : ws
Bandwidth : 10
MemoryBanks: 1

[run_presets]
InterfaceBandwidth: CALC

Topology
Layer name, IFMAP Height, IFMAP Width, Filter Height, Filter Width, Channels, Num Filter, Stride height,Stride width,
Conv52, 1080, 1920, 3, 3, 32, 8, 1, 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.