scalesim-project / scale-sim-v2 Goto Github PK

View Code? Open in Web Editor NEW

194.0 194.0 91.0 27.91 MB

Repository to host and maintain scale-sim-v2 code

License: MIT License

Shell 1.11% Python 89.54% Makefile 0.40% Verilog 8.96%

scale-sim-v2's People

Stargazers

Watchers

Forkers

abhishekk06 ritikraj7 liushi089039 vineemon wxbbuaa2011 mfkiwl shivmgg tanvirarafin constantpark casper-min elena32061 thisispoker jia1546 hatlori9 chaogaoucr boukhary123 wangfeng012316 allencho1222 danemit nivi1501 lipikagarg lei1320719647 krerkkiat riaduli donghn krishnateja95 jnbrq zishenwan chihyoa yunkihan nobodywasishere raylee0703 desensun tkopetz dkadiyala3 yumianhuli2 charlesjiangxm hyperliuxue stevenokm tkdhmin junsungkim99 linbaiwpi fran-lemurian fpecc hanjiangwu313 changhai0109 charlieisacat memmeta spiritseeker daltamaulana elbtity yashashwee aakash-sharma aryansaple yu-gyoung-yun silekoch xiaorui-yin hkust-vlsilab-softwareteam dliul lllbbbyyy lgy35 sebaarte henrychang213 ssatyendras gpwns3717 mukullokhande99 siyuanma0316 alsrbok semicolontransistor abhiramv08 abijithyayavaram leeyyone mit10000 jaehongcs20 prateekbhaisora nikhil-kumar-patel ggangliu etikshajain mschahal28 rachmadvwp 5ree nayannair nicolas-guilbaud belaljahannia soyonchoi29 praisechan ninap21 dyanb huwenao mohanbing

scale-sim-v2's Issues

Error happens when Running Resnet50.csv with scale.cfg

When I running resnet50.csv in ./topologies/conv_nets/ with scale.cfg, I met an error!

python3 ./scalesim/scale.py -c ./configs/scale.cfg -t ./topologies/conv_nets/Resnet50.csv -p ./output

and the error is

The "MemoryBanks" parameter in config tile *.cfg can only be set to 1?

In some case, maybe the memory banks can be more than1 to read or write sram with high bandwidht, but if i set "MemoryBanks" parameter in my config file, then there is error when run the python,

Where is the interface to DRAMsim2?

I only see the statistics given by the end of the simulation, whereas the paper reads "SCALE-SIM allows for modeling the main memory behavior by generating accurate read and write bandwidths at the interface, which can then be fed into a DRAM simulator e.g., DRAM-Sim2 [22]." So where do you dump the trace file to be fed into DRAMsim2? Or how could you invoke DRAMsim2?

Provide a new method to get a dummy config string with default parameters

It is useful to have to have a method in the scale_config class, which can generate a list with default configurations. This can act as template for code to pass the configurations as string as opposed to reading a dummy config file, when using scalesim as a library.

Hint:
Temoprarily bypass the check for valid conf in the following implementation

    def get_conf_as_list(self):
        out_list = []

        if not self.valid_conf_flag:
            print("ERROR: scale_config.get_conf_as_list: Configuration is not valid")
            return

       ...

Number of SRAM reads for ofmap for input stationary dataflow is unrealistically small

The problem here is that number of reads for SRAM in ofmap for input stationary dataflow is unrealistically small. Upon inspection of the code, it seems that during the simulation the number of reads is not added to each other in each iteration of the output matrix. So instead the number of reads are overwriting each other so the final number of output reads becomes the number of reads of the last iteration. Hope this is clear.

Layer parameters are improperly indexed when updated using topologies.append_topo_arrays() method

entries = [m, k, 1, k, 1, n, 1, 1]
self.append_topo_arrays(layer_name=layer_name, elems=entries)

The above operation assigns the variable k with the value passed for variable m and shifts the assigned values to the right in a similar fashion for the other variables.

Issue with -i gemm argument

Hi,
I experience an issue when I try scale-sim with the argument -i gemm . The tool pops an assertion which mentions that the size per row in the csv file should be at least 4. However in the documentation example for MNK input format seem to be exactly 4 parameters (Layer Name, M, N, K). Am I missing something here?

My csv has the same format as the example in README of scale-sim for GEMM operations.

Thanks in advance,
Kostas.

Add support to take inputs in GEMM format

Support for this statement:

The tool however expects the inputs to be in the convolution format by default. When using the mnk format for input, please specify using the -i gemm switch, as shown in the example below.

$ python3 <scale sim repo root>/scalesim/scale.py -c <path_to_config_file> -t <path_to_mnk_topology_file> -i gemm

How do I check the bandwidth utilization of each layer of the network?

For example, Average IFMAP DRAM BW: 7.975 words/cycle, How do I know the bandwidth utilizaton?

Likely typo in `memory_map.py`

I just remembered that a while ago I noticed a very likely typo in memory_map.py.

The following line is probably missing a ::

scale-sim-v2/scalesim/memory_map.py

Line 85 in 7891990

self.ifmap_map_list.append(elems[self.num_banks])

I.e., the line should be

self.ifmap_map_list.append(elems[:self.num_banks])

While I do not understand the code 100%, I am pretty sure that the : is missing because there are very similar lines that come after that have a : in this position:

scale-sim-v2/scalesim/memory_map.py

Line 100 in 7891990

self.filter_map_list.append(elems[:self.num_banks])
scale-sim-v2/scalesim/memory_map.py

Line 115 in 7891990

self.ofmap_map_list.append(elems[:self.num_banks])

Eyeriss configuration

Thanks for your kindness and great works.

At cfg file,
IfmapSramSzkB: 6144
FilterSramSzkB: 6144
OfmapSramSzkB: 2048
means that the processor SRAM size is (6144 + 6144 + 2048)?

At cfg of eyeriss, the sram size of (IA, W, OA) are 108, 108, 108 but eyeriss has only 108KB in total.
Moreover eyeriss has its register files in PE each. How the simulator handles it? (RF size, SRAM bandwidth)

Make a dir

Hello!

I'm using scale-sim-v2 on Windows (Anaconda 3, Jupyter Notebook, Python3.10). The only error i got is for the creation of the directory (simulator.py and single_layer_sim.py): the creation is OS based, instead of using the universal os.mkdir() function, if anyone wants to update that for the next version

Cheer!
N.

Slow concatenation of Numpy matrices

The issue has been summarised in the following presentation:
https://docs.google.com/presentation/d/1U2ck2hTk8oyMGOPbBvimznSH3GaEkd2W1cCWA7dTApQ/edit?usp=sharing

Tile size

Hi,
How do I get ofmap tile size (number of outputs generated in one iteration of DRAM)? Can you please point out the exact location in the code, where I can get an estimate about the total number of ofmap tiles as well as the size?

Dram access & bandwidth

Thanks for your great works.
I would like to ask you a question.
I have tried a lot of experiments using scale-sim and found that the results with DRAM seem to be wrong.
Total DRAM access count is not a function of SRAM size in your simulator. It causes the simulator reports to use much larger dram access than an accelerators had to require.
Also, setting the simulator to USER mode will cause the accelerator to stall even if it have enough DRAM bandwidth.
Not sure if I'm wrong, but I know how to design a TPU using my own verilog code.

Add support for depth wise convolutions

issue in systolic_compute_ws.py

In line 184 inside function create_ifmap_demand_mat() , the variable inter_fold_gap_suffix should be equal to self.arr_col + self.arr_row - 2 instead of self.arr_col-1. The last input needs self.arr_row - 1 cycles to reach the last column of PE array and then self.arr_col - 1 cycles to reduce along the last column.

ISPASS documentation Figure 9.(b),(c)

I'm trying to reproduce ISPASS documentation's Figure 9.(b),(c)
At the figure description which is right below the figure picture, it say's 2^14 MACs, 2^16 MACs for Fig9 (b),(c)

But, at the next page(P.65 ISPASS) "Figure 9(b-c) ~~~ 4096 and 16384 MAC units respectively" this makes me kinda confusing. (May be I misinterpreted it.) Can You give an advice for this?

Also, about reproducing TF0 layer has 31999 84 1024 (Sr, T, Sc) and considering it's 2^14 MACs(Figure 9.b) at R,C is 64x64
I thought topofile should be in this format
either 1 or 2
Layer name, IFMAP Height, IFMAP Width, Filter Height, Filter Width, Channels, Num Filter, Strides,
TF0, 64, 84, 64, 84, 250, 8, 1, ...(1)
TF0, 16000, 84, 512, 84, 1, 1, 1, ...(2)
(Maybe this could be wrong too. Sadly)

I figured out basic tutorial how to run scale sim but having trouble setting the confis and topos. So can I ask You to give an topo file and config file for Figure 9.(b,c)? (for me to help understanding)

Thank you for Reading.
Hope to get an answer for this.
Have a Good Day.

Enable Batching Support

Enable batching of multiple IFMAP matrices for a single run.

Note:
Details to be updated soon.

Add a contributing.md

Please add a contributing.md file following your wished how ppl should add code. Then, we can link it in the website.

Maybe just copy this example? or a more complex guide

Main memory wasted

With the input topology becoming larger, the cost of main memory(DDR) on host is growing much larger. Varients like 'self.ifmap_prefetch_matrix' will waste too much without release. Hope it can solved in the future. Thank you!

The functionality of append_topo_arrays() is broken

The append_topo_arrays() method in topology_utils.py class assigns variables incorrectly.

How to set the batch size in a training progress

Hi, I want to use the simulator to simulate a training progress of a CNN for research purpose, but I'm confused about the batch size setting in the simulator. Is it defaulted to 1 or other constent value? Furthermore, the simulation cycle of each layer in the output, I'm wondering its meaning. Does it mean training 1 data once through the CNN structure to get the weights costs that cycle?
Looking for anyone's reply and help me to use this simulator better.
Thanks very much!

Performance Improvement when running small sized systolic array but large topology

Have been stuck running the second layer of Alexnet and with the configuration of scale.cfg. The reason is due to the concatenation of all the computed elements at the systolic_compute_* files.

How to generate ScaleSim topology file from PyTorch nn.module?

Is there any straightforward way to generate ScaleSim topology file from a given PyTorch model?

Tool very slow for large matrix dimension and small array dimensions

Internal numpy concatenation in create_trace_matrix() and create_prefetch_matrix() are slowing down the tool when large matrices need to be concatenated.

variables mean

I want to know what these variables mean.Can you tell me?
in scale-sim-v2-main/scalesim/memory/read_buffer_estimate_bw.py
# Tracking variables
self.num_items_per_set = -1
self.elems_current_set = 0
self.current_set_id = 0
self.read_buffer_set_start_id = -1
self.read_buffer_set_end_id = -1
self.prefetch_buffer_set_start_id = -1
self.prefetch_buffer_set_end_id = -1
self.last_prefetch_start_cycle = -2
self.last_prefetch_end_cycle = -1
self.first_request_rcvd_cycle = 0

Memory leak in read_buffer_estimate_bw method

A significant memory leak is observed in the read_buffer_estimate_bw.py file whenever the tool is run in estimate bandwidth mode. Sets are used to manage prefetches and do fast lookups. The fast lookups come at a cost of higher memory usage. Only the sets which contain elements in the active read buffer portion of the SRAM need to be stored to perform lookup operations. The remaining sets become redundant and need to be removed.
The attached graph shows the difference created by solving this memory leak. As large as 96% of memory can be saved by solving this issue.

Can i get the tutorial3 script?

In the asplos tutorial, there is a tutorial 3 section.
But i cannot find the tutorial 3 script in this repository.

In this repository, there are tutorial 1 and 2, but not 3.
So, can we get it?

quantization support

Is there a plan to support diffrent precision (int8, int4.. etc)?

Basic question about memory access patterns

Thank you for creating and releasing this tool. It helped me understand systolic array functionality to a large extent.

I have a basic question: after the simulation ends, the tool produces DRAM and SRAM read and write trace files. In each of these files, each line lists the memory addresses where data are read from/written to, for each computation cycle of the systolic array. These addresses are not consecutive in memory. This may be a basic question, but I wanted to understand how data are accessed from multiple distinct memory locations in the same processor clock cycle or how is this implemented in real hardware.

Additionally, the ISPASS paper associated with this work presents a number of energy consumption graphs based on Scale-Sim simulations. It would be great if you could shed some light on how these energy results were obtained.

Hope you can clarify my doubts. Thank you!

asking about order or unit

Thanks for your great works.

My question is about unit and order on your code.

At cfg file, does Bandwidth means max DRAM bandwidth in GB/s ? What is InterfaceBandwidth: CALC ?
At BANDWIDTH_REPORT.csv, what does Avg Bandwidth means? GB/s? At 1), bandwidth was set but Avg DRAM BW exceeds 10.
Do DRAM and SRAM bandwidth affect on Overall Utilization? What is exact definition of it? Does it differ with Compute utilization? They show same result.
How map DW layer on TPU and scale model? It has too low mapping efficient what i expect. (less than 3% at COMPUTE_REPORT.csv )
At DETAILED_ACCESS_REPORT.csv, Does DRAM FilterReads mean number of access on DRAM for weight? or Bytes?

Your works are very helpful. Thanks for again.

OFMAP Writes to SRAM is incorrectly reported in IS dataflow

Caught by @boukhary123

The number of write accesses to SRAM for OFMAP partial sums are incorrectly reported.

Example:
For the following config file:

And the following workload CSV

test_simple.csv

The trace for SRAM writes is depicted as follows:

Where we can observe 36 accesses.

However the detailed report provides incorrect counts:

Error While Running Scalesim

I am getting a TypeError while running Scalesim

Traceback (most recent call last):
File "scalesim/scale.py", line 37, in
input_type_gemm=gemm_input
TypeError: init() got an unexpected keyword argument 'input_type_gemm'

Here is the command that I used
python3 scalesim/scale.py -c configs/scale.cfg -t topologies/conv_nets/yolo_tiny.csv -p OUTPUT

Make the top path for the runs a command line argument in the included scale.py script

The following line is hardcoded make it user configurable.

s.run_scale(top_path='../test_runs')

Why use ifmap_buf.get_hit_latency() for filter_hit_latency

https://github.com/scalesim-project/scale-sim-v2/blob/08a984f11614ae04391ea5d2c8f54b14ad169e1e/scalesim/memory/double_buffered_scratchpad_mem.py#L161C34-L161C34

Is this intentional design or a mistake in writing the code？

Make release 2.0.2

Tag in Github
Release in Github
Release notes to changelog.md
Push to PyPi with new version tag
Push to ARM

Experminenting with a small example: Weird pauses and indices

After some experimenting with scale-sim-v2 (and SCALE-SIM v1), I noticed a few things, when looking at the results of scale-sim-v2, I just could not quite make sense of:

The example I based my experiments on was:

config

[general]
run_name=conf_4x4_os

[architecture_presets]
ArrayHeight:    4
ArrayWidth:     4
IfmapSramSzkB:  128
FilterSramSzkB: 128
OfmapSramSzkB:  128
IfmapOffset:    0
FilterOffset:   10000000
OfmapOffset:    20000000
Bandwidth:      10
Dataflow:       os
MemoryBanks:    1

[run_presets]
InterfaceBandwidth: CALC

topology

Layer name IFMAP Height IFMAP Width Filter Height Filter Width Channels Num Filter Strides

Conv1 4 4 3 3 1 5 1

Layer name	IFMAP Height	IFMAP Width	Filter Height	Filter Width	Channels	Num Filter	Strides
Conv1	4	4	3	3	1	5	1

The ouput is the following (empty cells originally contained "-1" entries)

IFMAP_SRAM_TRACE.csv + FILTER_SRAM_TRACE.csv

1	0				10000000
2	1	1			10000001	10000009
3	2	2	4		10000002	10000010	10000018
4	4	3	5	5	10000003	10000011	10000019	10000027
5	5	5	6	6	10000004	10000012	10000020	10000028
6	6	6	8	7	10000005	10000013	10000021	10000029
7	8	7	9	9	10000006	10000014	10000022	10000030
8	9	9	10	10	10000007	10000015	10000023	10000031
9	10	10	12	11	10000008	10000016	10000024	10000032
10		11	13	13		10000017	10000025	10000033
11			14	14			10000026	10000034
12				15				10000035
13
14
15
16	0				10000036
17	1	1			10000037
18	2	2	4		10000038
19	4	3	5	5	10000039
20	5	5	6	6	10000040
21	6	6	8	7	10000041
22	8	7	9	9	10000042
23	9	9	10	10	10000043
24	10	10	12	11	10000044
25		11	13	13
26			14	14
27				15
28
29
30

OFMAP_SRAM_TRACE.csv

0
1
2
3
4
5
6
7
8	20000015
9	20000010	20000016
10	20000005	20000011	20000017
11	20000000	20000006	20000012	20000018
12		20000001	20000007	20000013
13			20000002	20000008
14				20000003
15
16
17
18
19
20
21
22
23	20000019
24	20000014
25	20000009
26	20000004
27
28
29

1. Why are there pauses between executing filter 1 through 4 and executing filter 5?

SCALE-SIM v1 was able to execute them directly after each other.

The output of V1 looked like:

0	0				10000000
1	1	1			10000001	10000009
2	2	2	4		10000002	10000010	10000018
3	4	3	5	5	10000003	10000011	10000019	10000027
4	5	5	6	6	10000004	10000012	10000020	10000028
5	6	6	8	7	10000005	10000013	10000021	10000029
6	8	7	9	9	10000006	10000014	10000022	10000030
7	9	9	10	10	10000007	10000015	10000023	10000031
8	10	10	12	11	10000008	10000016	10000024	10000032
9	0	11	13	13	10000036	10000017	10000025	10000033
10	1	1	14	14	10000037		10000026	10000034
11	2	2	4	15	10000038			10000035
12	4	3	5	5	10000039
13	5	5	6	6	10000040
14	6	6	8	7	10000041
15	8	7	9	9	10000042
16	9	9	10	10	10000043
17	10	10	12	11	10000044
18		11	13	13
19			14	14
20				15

(The OFMAP trace of V1 looked kind of weird and different to V2 though.)

In general, the total cycle numbers of V2 are currently a lot different compared to V1.

2. The cycles for SRAM read and write in the output do not line up.

They are off by about 2 cycles.
Why do the read indices begin at 1 and the write indices at 0?

I had some additional weird findings when running bigger examples and restricting the RAM bandwidth.
For example, the DRAM read amount increased when increasing the SRAM size. When looking at the traces, there were a lot of "1" values that probably should have been "-1".
I could post the example if you would be interested in those results as well.

Errors due to usage of deprecated 'np.int'

The alias 'np.int' for the builtin 'int' was deprecated with NumPy 1.20.0. This leads to errors when running the example commands from the README.md after a fresh install.

Pipeline at MAC Unit

Firstly I would like to thank you for such this tool.
I wanted to ask a simple question. Is there anyway to simulate MAC unit that contains pipeline? For example I want to simulate systolic array that contains MAC units which gives output every 3 clock cycle.
Thank you

Make initial update of documentation

Make the first version of the documentation

Energy computing in ISPASS paper

How is the energy calculated? Where can I get the energy of MAC, SRAM and DRAM?

Scale-sim-v2 does not give "cycle accurate simulation"

According to the code, the total cycle is advanced like this

    def get_total_cycles(self):
        assert self.all_layer_run_done, 'Layer runs are not done yet'

        total_cycles = 0
        for layer_obj in self.single_layer_sim_object_list:
            cycles_this_layer = int(layer_obj.get_compute_report_items[0])
            total_cycles += cycles_this_layer

        return total_cycles

Consequently, it does not provide the behavior of each cycle, right? The simulator only "computes" the cycles to be spent, not "simulates" across each cycle.

Stall cycles reported on "ESTIMATE BANDWIDTH" mode

**anand$** python scalesim/scale.py -t topologies/conv_nets/alexnet_part.csv -c configs/scale.cfg
******************* SCALE SIM **********************
Array Size: 32x32
SRAM IFMAP (kB): 64
SRAM Filter (kB): 64
SRAM OFMAP (kB): 64
Dataflow: Weight Stationary
CSV file path: topologies/conv_nets/alexnet_part.csv
Number of Remote Memory Banks: 1
Working in ESTIMATE BANDWIDTH mode.
Running Layer 0
100%|████████████████████████████████████████████████████████| 112284/112284 [01:37<00:00, 1154.54it/s]
Compute cycles: 138802
**Stall cycles: 26519**
Overall utilization: 74.17%
Mapping efficiency: 94.53%
Average IFMAP DRAM BW: 31.969 words/cycle
Average Filter DRAM BW: 31.984 words/cycle
Average OFMAP DRAM BW: 25.242 words/cycle
Saving traces: Done!
************ SCALE SIM Run Complete ****************

Curious about Why compute cycles do not change when all of IfmapSramSzkB/FilterSramSzkB/OfmapSramSzkB increase

@AnandS09 @jmjos @boukhary123 @ritikraj7
Okay, I am sorry to trouble you! When I use the following memory options,

IfmapSramSzkB = 16
FilterSramSzkB = 16
OfmapSramSzkB = 16
<other options remain default>

IfmapSramSzkB = 32
FilterSramSzkB = 32
OfmapSramSzkB = 32
<other options remain default>

IfmapSramSzkB = 64
FilterSramSzkB = 64
OfmapSramSzkB = 64
<other options remain default>

IfmapSramSzkB = 128
FilterSramSzkB = 128
OfmapSramSzkB = 128
<other options remain default>

I find no change in compute cycles!

In DETAILED_ACCESS_REPORT.csv, I find compute cycles could be regarded as the runtime of each layer!

Then I guess that the problem may attribute to the scale of systolic array! When I use 32 x 32 systolic array, the 16/32/64/128 KiByte sram is enough to support the calculation of systolic array with no stalls. In this thought, if I use a large systolic array, the calculation process may meet stalls and

Then I try a large systolic array (4096 x 4096) and use the following sram schemes:

IfmapSramSzkB = 1
FilterSramSzkB = 1
OfmapSramSzkB = 1
<other options remain default except array>

IfmapSramSzkB = 2
FilterSramSzkB = 2
OfmapSramSzkB = 2
<other options remain default except array>

IfmapSramSzkB = 3
FilterSramSzkB = 3
OfmapSramSzkB = 3
<other options remain default except array>

Unluckliy, the compute cycles of all of these three schemes are no difference, that is they are share the same values.

Please spare your precious time, thanks very much!

GEMM OS cycle count, prefetch and demand matrices sizes

Dear all,

Thanks for writing this simulator. I am trying to understand GEMM on systolic arrays with Output stationary dataflow and I am using scalesim for that. I have a few questions about that. If you could answer would be great:-
I don't understand the shape of prefetch and demand matrices for GEMM Output Stationary operations.
For example)

For a GEMM multiplication with MNK 2,2,2 with Output Stationary on a 2x2 systolic array:

ifmap_prefetch_mat: (1, 4), filter_prefetch_mat: (1, 4)
ifmap_demand_mat: (4, 2), filter_demand_mat: (1, 4), ofmap_demand_mat: (4, 2)
Re: 2) if filter demand matrix shape is 1,4 how do we use the second column in the systolic array? 🤔

For simulation time: On paper and pencil, I get 4 cycles. Scalesim gives 3 cycles....

For a GEMM multiplication with MNK 4,4,4 with Output Stationary on a 2x2 systolic array:

ifmap_prefetch_mat: (1, 16), filter_prefetch_mat: (1, 16)
ifmap_demand_mat: (10, 4), filter_demand_mat: (1, 16), ofmap_demand_mat: (10, 4)

Re. simulation simulation, I get 10 cycles on pen and paper. 9 cycles from the simulation..
Questions)

Are prefetch matrices are calculated based on number of elements to be prefetched from DRAM?
How are the demand matrix dimensions calculated for GEMM?
Compared to scalesim, always I get one more cycle extra for GEMM OS in my paper and pencil calculations. Please find slides attached.

Kind Regards,
Kartik,
PhD student, Ghent University
systolic_matmul_4x4.pptx

AssertionError: IFMAP and Filter demands out of sync in systolic_compute_ws.py with GEMM inputs

Scalesim fails to run the default test_mnk_input.csv from the repo using WS method in GEMM mode. The code asserts with the following error log as shown below:

====================================================
Array Size:     256x256
SRAM IFMAP (kB):        6144
SRAM Filter (kB):       6144
SRAM OFMAP (kB):        2048
Dataflow:       Weight Stationary
CSV file path:  topologies/GEMM_mnk/test_mnk_input.csv
Number of Remote Memory Banks:  1
Working in ESTIMATE BANDWIDTH mode.

====================================================
Running Layer 0
(1149, 256)   (894, 256)
Traceback (most recent call last):
  File "/nethome/dkadiyala3/git_repos/scale-sim-v2/scalesim/scale.py", line 39, in <module>
    s.run_scale(top_path=logpath)
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/scale_sim.py", line 86, in run_scale
    self.run_once()
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/scale_sim.py", line 105, in run_once
    self.runner.run()
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/simulator.py", line 79, in run
    single_layer_obj.run()
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/single_layer_sim.py", line 126, in run
    ifmap_demand_mat, filter_demand_mat, ofmap_demand_mat = self.compute_system.get_demand_matrices()
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/compute/systolic_compute_ws.py", line 361, in get_demand_matrices
  File "/nethome/dkadiyala3/venv_py3.9/lib/python3.9/site-packages/scalesim-2.0.1-py3.9.egg/scalesim/compute/systolic_compute_ws.py", line 174, in create_demand_matrices
**AssertionError: IFMAP and Filter demands out of sync**

========================================
Steps to reproduce this issue.

Install scalesim following the steps mentioned in the README
then run scalesim using the following command:
python3 scalesim/scale.py -c configs/google.cfg -t topologies/GEMM_mnk/test_mnk_input.csv -i gemm

===================== ROOT CAUSE OF ISSUE ==========================

This happens due to a incorrect generation of demand_matrices in function def create_ifmap_demand_mat(self): in systolic_compute_ws.py.
The following lines in the code are redundant along with the previous transformations on this_fold_matrix.

                # Add skew to the IFMAP demand matrix to reflect systolic pipeline fill
                this_fold_demand = skew_matrix(this_fold_demand)

===================== POSSIBLE FIXES =======================
Method:1: To comment out the above mentioned line and that should fix that issue.

Let's say Sr (K-dim) = 256, Sc (N-dim) = 64, and T (m-dim) = 128.
Our IFMAP is (T x Sr) =128 x 256 and our Filter is (Sr x Sc) = 256 x 64.
Let's consider our systolic array as 256 x 256.

In this case, let's consider the following matrices

A =  transpose(IFMAP) = (Sr X T) = 256 x 128,
B =  to get the last elem of IFMAP to just reach the beginning of Systolic array = ( Sr, (T + arr_row-1)) = (256, (128+255)) = (256, 383) 
C = for the last elem of IFMAP to traverse the systolic array from beginning to right-most end = (Sr, (T + arr_row-1 + arr_col-1)) = (256, 638)
D  =  to pre-fill the weights into the systolic array = (Sr, (T + arr_row-1 + arr_col-1 + arr_row)) = (256 x 894)

Therefore the final demand matrix is trans(D) = (894, 256) which matches the same for Filter-matrix (894,256).

However, it is not the right way to do so since we need the skew generation to replicate the actual dataflow into systolic array. Hence, the below method resolves it properly:

Method-2:
Keeping the skew_matrix transformation, since it make more sense.

A. skew_matrix(IFMAP) -> (T+arr_row-1, Sr) = (128+255,256)
B. Add cycles to traverse the systolic array = concat(A, (arr_col-1, Sr)) = (383+255, 256) = (638, 256)
C. Finally add the cycles to fill the weights in = concat(B, (arr_row, Sr)) = (638+256, 256) = (894, 256)

Fix in the code:

def create_ifmap_demand_mat(self):
        assert self.params_set_flag, 'Parameters are not set'

        inter_fold_gap_prefix = self.arr_row
        inter_fold_gap_prefix_mat = np.ones((inter_fold_gap_prefix, self.arr_row)) * -1

        #inter_fold_gap_suffix = self.arr_row + self.arr_col - 2
        inter_fold_gap_suffix = self.arr_col - 1
        #The last input need self.arr_col - 1 cycles to reduce along the last column.

        inter_fold_gap_suffix_mat = np.ones((inter_fold_gap_suffix, self.arr_row)) * -1

        for fc in range(self.col_fold):
            for fr in range(self.row_fold):
                col_start_id = fr * self.arr_row
                col_end_idx = min(col_start_id + self.arr_row, self.Sr)
                delta = self.arr_row - (col_end_idx - col_start_id)

                # Indexing the cols with row start and row end idx are correct
                # See the comment on ifmap_prefetch generation
                this_fold_demand = self.ifmap_op_mat[:,col_start_id: col_end_idx]
                self.ifmap_reads += this_fold_demand.shape[0] * this_fold_demand.shape[1]
                
                # Take into account under utilization
                if delta > 0:
                    null_req_mat = np.ones((self.T, delta)) * -1
                    this_fold_demand = np.concatenate((this_fold_demand, null_req_mat), axis=1)

                # Add skew to the IFMAP demand matrix to reflect systolic pipeline fill
                this_fold_demand = skew_matrix(this_fold_demand)

                # Account for the cycles for input to traverse systolic array
                this_fold_demand = np.concatenate((this_fold_demand, inter_fold_gap_suffix_mat), axis=0)

                # Account for the cycles for weights to load
                this_fold_demand = np.concatenate((inter_fold_gap_prefix_mat, this_fold_demand), axis=0)

                if fr == 0 and fc == 0:
                    self.ifmap_demand_matrix = this_fold_demand
                else:
                    self.ifmap_demand_matrix = np.concatenate((self.ifmap_demand_matrix, this_fold_demand), axis=0)

This code works and I have tested on multiple cases. (Tested on inputs that are bigger, smaller, and equal to the size of systolic array.

I request the repo owners to please review my issue. If the proposed fix is good, then I will formally send a pull request with the fixed code along with the unit testcases. Thank you.

Regards,
Divya Kiran K.
[email protected]

Where is the MMIO and interrupt simulation?

In the paper, the authors claim that "The CPU is the bus master which interacts with the accelerator by writing task descriptors to memory-mapped registers inside the accelerator." And in Figure 1 there is an interrupt interface. However I cannot find corresponding code in this codebase. Could you care to point it out for me?

Accelergy energy model integration with Scale-Sim

Integrate accelergy energy model with Scale-Sim.
The energy model is specified in the config file and is called during the execution of Scale-Sim Code. Once the run is complete the results are generated in a separate file and also shown in the command line.

Killed when run large IFMAP Height, IFMAP Width

Config
[general]
run_name = GoogleTPU_v1_t

[architecture_presets]
ArrayHeight: 256
ArrayWidth: 256
IfmapSramSzkB: 8192
FilterSramSzkB: 8192
OfmapSramSzkB: 8192
IfmapOffset: 0
FilterOffset: 10000000
OfmapOffset: 20000000
Dataflow : ws
Bandwidth : 10
MemoryBanks: 1

[run_presets]
InterfaceBandwidth: CALC

Topology
Layer name, IFMAP Height, IFMAP Width, Filter Height, Filter Width, Channels, Num Filter, Stride height,Stride width,
Conv52, 1080, 1920, 3, 3, 32, 8, 1, 1

scalesim-project / scale-sim-v2 Goto Github PK

scale-sim-v2's People

Stargazers

Watchers

Forkers

scale-sim-v2's Issues

1. Why are there pauses between executing filter 1 through 4 and executing filter 5?

2. The cycles for SRAM read and write in the output do not line up.

Recommend Projects

Recommend Topics

Recommend Org

Jobs