openpower-cores / a2i Goto Github PK

View Code? Open in Web Editor NEW

244.0 35.0 40.0 12.83 MB

License: Other

Tcl 0.39% Verilog 0.01% VHDL 99.61% Python 0.01%

a2i's Introduction

This repo has been archived and relocated.

The new home is: https://git.openpower.foundation/cores/a2i

It is mirrored at: https://github.com/OpenPOWERFoundation/a2i

A2I

NOTICE

The license has been modified (see the LICENSE file for details), and the repository is moving soon to an OpenPower Foundation location.

The new repo will be accessible through both Github and Gitlab.

After the move is completed, this readme will be updated, and the repo will be changed to 'archived' state.

The Project

Release of the A2I POWER processor core RTL and associated FPGA implementation (used ADM-PCIE-9V3 FPGA)

See Project Info for details.

The Core

The A2I core was created as a high-frequency four-threaded design, optimized for throughput and targeted for 3+ GHz in 45nm technology.

It is a 27 FO4 implementation, with an in-order pipeline supporting 1-4 threads. It fully supports Power ISA 2.06 using Book III-E. The core was also designed to support pluggable implementations of MMU and AXU logic macros. This includes elimination of the MMU and using ERAT-only mode for translation/protection.

The History

The A2I platform was developed following IBM's game core designs. It was designed to balance performance and power and provide high streaming throughput. It supported chip, sim, and FPGA implementations through the use of a configurable latch/array library.

A2I was developed as the "wire-speed processor" for a high-throughput edge-of-network (PowerEN) SoC design. This chip included four L2's with four A2I per L2, connected through an interconnect called PBus. The units outside the core included multiple accelerators attached to the PBus. External interfaces included DDR3, PCI Gen2, and Ethernet. The chip was built and performed at ~2.3GHz (the core was throttled for power savings), but was not released.

The A2I core was then selected as the general purpose processor for BlueGene/Q, the successor to BlueGene/L and BlueGene/P supercomputers. In this design, eighteen A2I cores were included on one chip, along with cache and memory controllers, and internal networking components. The design ran at 1.6 GHz, to meet power/performance goals, and included a special-purpose AXU (high-bandwidth FPU). Multiple BlueGene/Q installations have been ranked in the top 10 of the TOP500 list for many years (#1,#3,#7,#8 in 2012), and three are still ranked in the TOP500 as of June 2020.

The Future

There may be uses for this core where a full feature-set is needed, and its limitations can be overcome by the intended environment. Specifically, single-thread performance is limited by the in-order implementation, requiring a well-behaved application set to enable efficient use of the pipeline to cover pipeline dependencies, branch misprediction, etc.

The design of the A2L2 interface (core-to-L2/nest) is straightforward, and offers multiple configurable options for data interfacing. There is also some configurability for handling certain Power-specific features (core vs. L2).

The ability to add an AXU that is tightly-coupled to the core enables many possibilities for special-purpose designs, like an open distributed Web 3.0 hardware/software system integrating streaming encryption, blockchain, semantic query, etc.

Technology Scaling

A comparison of the design in original technology and scaled to 7nm (fixed-point, no MMU):

	Freq	Pwr	Freq Sort	Pwr Sort	Area	Vdd
45nm	2.30 GHz	0.88 W			2.90 mm²	0.97 V
7nm	3.90 GHz	0.44 W	4.17 GHz	0.47 W	0.17 mm²	1.1 V
7nm	3.75 GHz	0.35 W	4.03 GHz	0.37 W	0.17 mm²	1.0 V
7nm	3.55 GHz	0.27 W	3.87 GHz	0.29 W	0.17 mm²	0.9 V
7nm	3.07 GHz	0.18 W	3.60 GHz	0.21 W	0.17 mm²	0.8 V
7nm	2.40 GHz	0.08 W	3.00 GHz	0.14 W	0.17 mm²	0.7 V

These estimates are based on a semicustom design in representative foundry processes (IBM 45nm/Samsung 7nm).

Compliancy

The A2I core is compliant to Power ISA 2.06 and will need updates to be compliant with either version 3.0c or 3.1. Power ISA 3.0c and 3.1 are the two Power ISA versions contributed to OpenPOWER Foundation by IBM. Changes will include:

radix translation
op updates, to eliminate noncompliant ones and add missing ones required for a given compliancy level
various 'mode' and other changes to meet the open specification targeted compliancy level

Miscellaneous

PVR = Ver 48/Rev 2

a2i's People

Contributors

Stargazers

Watchers

a2i's Issues

The 'pgsize_eq' of the TLB entries with xbit

According to the BGQ article, If a 1M page have 64KB hole, what are the 'pgsize_eq' of the remaining 960KB TLB entry and the pagesize of the 64KB.

I think for the 960KB page, its 'pgsize_eq' is 1MB and the 64KB hole has 64KB 'pgsize_eq'

Is it right?

simulation environment

Hi,
I use irun, vcs-mx, modelsim try including the A2 code, but there are so much compile error, like
(1) macro definition.
(2) library call.
Could you provide simulation environment and a simple testbench demo. And which simulation tools and which vision did you use? Thanks.

cache coherency

Hi Guys,

Without changing the code of a2 core，is it possible to realize cache coherency based on bus-snooping protocol？or directory-based protocol is the only choice？

Many thanks.

The speculative excution of instructions relying on previous loads

In the A2I manual, it says A2I supports speculative issuing instructions which are dependent on a load based on the assumption that the load will hit in the data cache. I remeber this technique is also designed in Alpha21264, cool!

My quesiton is if I want to learn more details about this technique in A2I, which files and what key signals should I pay attention to? In the iuq_fxu_dep.vhdl, there are some codes related to "shadow pipeline", are these related to this speculative technique?

Error happens when two threads co-execute together

Hi Guys,

I have two threads co-execute together. One thread named T0 executes mfspr to read some values from slowspr to GPR and the other thread named T1 executes normal ALU instructions. I found this can cause errors since slowspr can return value later meanwhile the datapath of mfspr and ALU to the GPR is the same. In other words, T0 and T1 can write GPR at the same time with the same data path and control path. Any comments about this?

Many thanks

build and run videos are no longer available

Video files linked in /rel/doc/a2_build_video.md and /rel/doc/a2_run_video.md are no longer available.

About the "ex3_l_s_q_val" signal in "xuq_lsu_l2cmdq" module

Hi guys,

It seems to me that the input signal "ex3_l_s_q_val" of the "xuq_lsu_l2cmdq" module can be used to indicate a load miss happens. In this case, "xuq_lsu_l2cmdq" can use this signal to decide whether to insert the current load in the LMQ or not.
However, after tracing the source of signal "ex3_l_s_q_val" in xuq_lsu_dc_cntrl.vhdl, I could not find its relation with load miss. Below is the source code in xuq_lsu_dc_cntrl.vhdl.

ex3_l2_op_d <= (l2_ctype or is_mem_bar_op or ex2_msgsnd_instr_q or ex2_mtspr_trace_q or ex2_dci_instr_q or ex2_ici_instr_q) and not ex2_stg_flush;

l_s_q_val <= ex3_l2_op_q;

Did I miss anything? If the signal "ex3_l_s_q_val" has no relation with load miss, how can xuq_lsu_l2cmdq decide to insert the current load into the LMQ or not?

Thanks for your help in advance! Let's enjoy the beauty of A2I!

Cheers,
Xia

Questions regarding testing A2I with Coremark on FPGA without OS

Hi @openpowerwtf , we are currently testing A2I with Coremark (https://github.com/eembc/coremark) on FPGA without OS and L2 cache (in our case, the core jumps to the Coremark program address space and starts running it immediately when the core finishes running boot.s). The highest score we achieved is 1.33 CoreMark/MHz with the compilation optimization option set as –Ofast. We have not been able to make our hardware platform work with OS, so I’m wondering whether running Coremark with OS would help with the final score or not? Besides, how bad the impact on the Coremark test would be if not having a L2 cache? We couldn’t find a proper linux kernel that is compatible with a2i core. It would be very time-consuming if we try to do the complete kernel porting and clipping from scratch. Is there any good reference kernel that you can recommend to us? Thanks!

The RAW/WAW dependency check in iuq_fxu_dep.vhdl

Hi,
I am reading the code related to the RAW/WAW dependency detection in the iuq_fxu_dep.vhdl file.

I am curious why do we only need to check the dependency between the current instruction and the instruction in the IS2, RF0, RF1, EX1, EX2 pipeline stage. The code is listed below.

raw_s1_cmp: entity work.iuq_fxu_dep_cmp(iuq_fxu_dep_cmp) 
port map (
     is1_v      => fdec_fdep_is1_s1_vld,

     is2_v      => sp_L2(IS2).ta_vld,
     rf0_v      => sp_L2(RF0).ta_vld,
     rf1_v      => sp_L2(RF1).ta_vld,
     ex1_v      => sp_L2(EX1).ta_vld,
     ex2_v      => sp_L2(EX2).ta_vld,
     lm0_v      => sp_L2_LM(0).ta_vld,
     lm1_v      => sp_L2_LM(1).ta_vld,
     lm2_v      => sp_L2_LM(2).ta_vld,
     lm3_v      => sp_L2_LM(3).ta_vld,
     lm4_v      => sp_L2_LM(4).ta_vld,
     lm5_v      => sp_L2_LM(5).ta_vld,
     lm6_v      => sp_L2_LM(6).ta_vld,
     lm7_v      => sp_L2_LM(7).ta_vld,

     is1_ad     => fdec_fdep_is1_s1,

     is2_ad     => sp_L2(IS2).ta,
     rf0_ad     => sp_L2(RF0).ta,
     rf1_ad     => sp_L2(RF1).ta,
     ex1_ad     => sp_L2(EX1).ta,
     ex2_ad     => sp_L2(EX2).ta,
     lm0_ad     => sp_L2_LM(0).ta,
     lm1_ad     => sp_L2_LM(1).ta,
     lm2_ad     => sp_L2_LM(2).ta,
     lm3_ad     => sp_L2_LM(3).ta,
     lm4_ad     => sp_L2_LM(4).ta,
     lm5_ad     => sp_L2_LM(5).ta,
     lm6_ad     => sp_L2_LM(6).ta,
     lm7_ad     => sp_L2_LM(7).ta,

     ad_hit_b   => RAW_s1_hit_b
);

In other words, why do not need to pay attention to the instruction in EX3, EX4, and the later pipeline stages? I guess bypassing the data might be the answer but I am not sure.

BTW, I think IS1 represents the IU stages before instruction decode and IS2 represents the IU stages after the instruction decode, right?

Many thanks
`

How to quickly find the key information of the combination logics

Hi Guys,

In A2I, there are so many combination logics. For example, in the xuq_add_loc.vhdl file, there are many "and", "or", "not" gates as shown below.

u_g01t  : g01_t  (0 to 7) <= not g01_b(0 to 7) ; 
u_g01not: g01_not(0 to 7) <= not g01_t(0 to 7) ; 
u_z01b:   z01_b  (0 to 7) <= not t01_b(0 to 7) ;
u_p01b:   p01_b  (0 to 7) <= not( g01_not(0 to 7) and z01_b(0 to 7) );
u_p01:    p01    (0 to 7) <= not( p01_b(0 to 7) );

u_g08i_0: g08_b(0) <= not g08(0) ;
u_g08i_1: g08_b(1) <= not g08(1) ;
u_g08i_2: g08_b(2) <= not g08(2) ;
u_g08i_3: g08_b(3) <= not g08(3) ;
u_g08i_4: g08_b(4) <= not g08(4) ;
u_g08i_5: g08_b(5) <= not g08(5) ;
u_g08i_6: g08_b(6) <= not g08(6) ;
u_g08i_7: g08_b(7) <= not g08(7) ;

u_t08i_0: t08_b(0) <= not t08(0) ;
u_t08i_1: t08_b(1) <= not t08(1) ;
u_t08i_2: t08_b(2) <= not t08(2) ;
u_t08i_3: t08_b(3) <= not t08(3) ;
u_t08i_4: t08_b(4) <= not t08(4) ;
u_t08i_5: t08_b(5) <= not t08(5) ;
u_t08i_6: t08_b(6) <= not t08(6) ;
u_t08i_7: t08_b(7) <= not t08(7) ;


u_sum_0_0: sum_0(0) <= not(  ( p01(0) and  g08(1) ) or  ( p01_b(0) and  g08_b(1) )   ); --output--
u_sum_0_1: sum_0(1) <= not(  ( p01(1) and  g08(2) ) or  ( p01_b(1) and  g08_b(2) )   ); --output--
u_sum_0_2: sum_0(2) <= not(  ( p01(2) and  g08(3) ) or  ( p01_b(2) and  g08_b(3) )   ); --output--
u_sum_0_3: sum_0(3) <= not(  ( p01(3) and  g08(4) ) or  ( p01_b(3) and  g08_b(4) )   ); --output--
u_sum_0_4: sum_0(4) <= not(  ( p01(4) and  g08(5) ) or  ( p01_b(4) and  g08_b(5) )   ); --output--
u_sum_0_5: sum_0(5) <= not(  ( p01(5) and  g08(6) ) or  ( p01_b(5) and  g08_b(6) )   ); --output--
u_sum_0_6: sum_0(6) <= not(  ( p01(6) and  g08(7) ) or  ( p01_b(6) and  g08_b(7) )   ); --output--

`
I know this file is related to the ALU add operation but I have no idea about how it works. Any idea about how to read these codes?

Many thanks

The order of writing the GPRs

Hi All,

In the pipeline, the A2I core writes the data into GPRs (general purpose registers) after the ex7 stage which is named as ex8 or rf1. My current understanding of A2I is that different instructions of one thread can actually write the data into the register file out-of-order.

For example, there are two instructions, i.e., "Load RegA, addr1"; and "Add RegB, RegC, RegD". If the load instruction misses in the L1 cache, it will be stored in the LMQ and waiting to access the L2 cache. In this case, if the add instruction keeps executing, it will write the data into GPR before the load instruction. If this is the case, the order of writing the register file for these two instructions, i.e.., load and add, is out of sequence. Is it the right understanding?

If I understand correctly, how can A2I guarantee the precise interrupt or exception? For example, one exception happens after the add instruction writing the GPRs.

Thanks!

How to map the branch prediction algorithm to the RTL code

I read the branch prediction section which seems to be a common technique and is very easy to understand.
However, when I tried to understand the RTL code in the iuq_bp.vhdl based on my understanding, I could not find the relation.

For example, there are 1024 entries in the BP table which are accessed by IFAR(50:59) based on Figure D-3 in A2_BGQ.pdf. Below code only generates 8-bit address which confused me.

iu1_bh_ti0gs1_rd_addr(0 to 7) <= (ic_bp_iu1_ifar(52 to 55) xor iu1_gshare(0 to 3)) & ic_bp_iu1_ifar(56 to 59); iu1_bh_ti1gs1_rd_addr(0 to 7) <= iu1_tid_enc(0 to 1) & (ic_bp_iu1_ifar(54 to 57) xor iu1_gshare(0 to 3)) & ic_bp_iu1_ifar(58 to 59);

I tried to track the flow of the signal "iu1_bh_ti0gs1_rd_addr" and found it is the input of the tri_bht.vhdl file. "ary_r_data" seems to be the output of the BP table, but what's "data_out" which is related to INIT_MASK and "r_addr_q(0)". I am also confused by the below code.

data_out(0 to 7) <= gate(ary_r_data(0 to 7) xor (INIT_MASK(0 to 1) & INIT_MASK(0 to 1) & INIT_MASK(0 to 1) & INIT_MASK(0 to 1)), r_addr_q(0) = '0') or gate(ary_r_data(8 to 15) xor (INIT_MASK(0 to 1) & INIT_MASK(0 to 1) & INIT_MASK(0 to 1) & INIT_MASK(0 to 1)), r_addr_q(0) = '1') ; .

After reading the data, these data are processed again in iuq_bp.vhdl.

`with ic_bp_iu3_ifar(60 to 61) select
iu3_0_br_hist <= iu3_3_bh_rd_data(0 to 1) when "11",
iu3_2_bh_rd_data(0 to 1) when "10",
iu3_1_bh_rd_data(0 to 1) when "01",
iu3_0_bh_rd_data(0 to 1) when others;

with ic_bp_iu3_ifar(60 to 61) select
iu3_1_br_hist <= iu3_3_bh_rd_data(0 to 1) when "10",
iu3_2_bh_rd_data(0 to 1) when "01",
iu3_1_bh_rd_data(0 to 1) when others;

with ic_bp_iu3_ifar(60 to 61) select
iu3_2_br_hist <= iu3_3_bh_rd_data(0 to 1) when "01",
iu3_2_bh_rd_data(0 to 1) when others;

iu3_3_br_hist <= iu3_3_bh_rd_data(0 to 1);
`.

Can someone tell me how to understand these codes from the architectural view? Any hints are welcome

About some unknow instructions in A2 Core

Hi all,

When I was reading the A2 code, I found some unknown codes (xuq_dec_b.vhdl) as shown below. I guess these instructions are related to DITC (direct interthread communication) but I cannot find any information in the A2 manual or the power ISA manual. Any help?

rf1_is_mfdp                  <=  '1' when rf1_opcode_is_31_q(3) = '1'  and   rf1_instr_21to30_04_q(21 to 30)    = "0000100011"                              else '0';
rf1_is_mfdpx                 <=  '1' when rf1_opcode_is_31_q(3) = '1'  and   rf1_instr_21to30_04_q(21 to 30)    = "0000000011"                              else '0';
rf1_is_mtdp                  <=  '1' when rf1_opcode_is_31_q(3) = '1'  and   rf1_instr_21to30_04_q(21 to 30)    = "0001100011"                              else '0';
rf1_is_mtdpx                 <=  '1' when rf1_opcode_is_31_q(4) = '1'  and   rf1_instr_21to30_04_q(21 to 30)    = "0001000011"                              else '0';

How to calculate the set number based on the EA (effective address) in the MMU module

There are 512 entries in the TLB which are organized as 4-way and 128-set. Meanwhile, a2i core supports different page size such as 4KB, 64KB, 1MB and so on. I am curious about how MMU calculates the set number when an effective/virtual address comes.
In the mmq_tlb_ctl.vhdl, I find the logic related to the set number calculation as shown below.

size_1G_hashed_addr(6) <=  tlb_tag0_q(33) xor tlb_tag0_q(tagpos_pid+pid_width-1);
size_1G_hashed_addr(5) <=  tlb_tag0_q(32) xor tlb_tag0_q(tagpos_pid+pid_width-2);
size_1G_hashed_addr(4) <=  tlb_tag0_q(31) xor tlb_tag0_q(tagpos_pid+pid_width-3);
size_1G_hashed_addr(3) <=  tlb_tag0_q(30) xor tlb_tag0_q(tagpos_pid+pid_width-4);
size_1G_hashed_addr(2) <=  tlb_tag0_q(29) xor tlb_tag0_q(tagpos_pid+pid_width-5);
size_1G_hashed_addr(1) <=  tlb_tag0_q(28) xor tlb_tag0_q(tagpos_pid+pid_width-6);
size_1G_hashed_addr(0) <=  tlb_tag0_q(27) xor tlb_tag0_q(tagpos_pid+pid_width-7);
size_1G_hashed_tid0_addr(6) <=  tlb_tag0_q(33);
size_1G_hashed_tid0_addr(5) <=  tlb_tag0_q(32);
size_1G_hashed_tid0_addr(4) <=  tlb_tag0_q(31);
size_1G_hashed_tid0_addr(3) <=  tlb_tag0_q(30);
size_1G_hashed_tid0_addr(2) <=  tlb_tag0_q(29);
size_1G_hashed_tid0_addr(1) <=  tlb_tag0_q(28);
size_1G_hashed_tid0_addr(0) <=  tlb_tag0_q(27);
size_256M_hashed_addr(6) <=  tlb_tag0_q(35)                     xor tlb_tag0_q(tagpos_pid+pid_width-1);
size_256M_hashed_addr(5) <=  tlb_tag0_q(34)                     xor tlb_tag0_q(tagpos_pid+pid_width-2);
size_256M_hashed_addr(4) <=  tlb_tag0_q(33)                     xor tlb_tag0_q(tagpos_pid+pid_width-3);
size_256M_hashed_addr(3) <=  tlb_tag0_q(32)                     xor tlb_tag0_q(tagpos_pid+pid_width-4);
size_256M_hashed_addr(2) <=  tlb_tag0_q(31)                     xor tlb_tag0_q(tagpos_pid+pid_width-5);
size_256M_hashed_addr(1) <=  tlb_tag0_q(30) xor tlb_tag0_q(28) xor tlb_tag0_q(tagpos_pid+pid_width-6);
size_256M_hashed_addr(0) <=  tlb_tag0_q(29) xor tlb_tag0_q(27) xor tlb_tag0_q(tagpos_pid+pid_width-7);
size_256M_hashed_tid0_addr(6) <=  tlb_tag0_q(35);
size_256M_hashed_tid0_addr(5) <=  tlb_tag0_q(34);
size_256M_hashed_tid0_addr(4) <=  tlb_tag0_q(33);
size_256M_hashed_tid0_addr(3) <=  tlb_tag0_q(32);
size_256M_hashed_tid0_addr(2) <=  tlb_tag0_q(31);
size_256M_hashed_tid0_addr(1) <=  tlb_tag0_q(30) xor tlb_tag0_q(28);
size_256M_hashed_tid0_addr(0) <=  tlb_tag0_q(29) xor tlb_tag0_q(27);
size_16M_hashed_addr(6) <=  tlb_tag0_q(39)                     xor tlb_tag0_q(tagpos_pid+pid_width-1);
size_16M_hashed_addr(5) <=  tlb_tag0_q(38)                     xor tlb_tag0_q(tagpos_pid+pid_width-2);
size_16M_hashed_addr(4) <=  tlb_tag0_q(37)                     xor tlb_tag0_q(tagpos_pid+pid_width-3);
size_16M_hashed_addr(3) <=  tlb_tag0_q(36) xor tlb_tag0_q(32) xor tlb_tag0_q(tagpos_pid+pid_width-4);
size_16M_hashed_addr(2) <=  tlb_tag0_q(35) xor tlb_tag0_q(31) xor tlb_tag0_q(tagpos_pid+pid_width-5);
size_16M_hashed_addr(1) <=  tlb_tag0_q(34) xor tlb_tag0_q(30) xor tlb_tag0_q(tagpos_pid+pid_width-6);
size_16M_hashed_addr(0) <=  tlb_tag0_q(33) xor tlb_tag0_q(29) xor tlb_tag0_q(tagpos_pid+pid_width-7);
size_16M_hashed_tid0_addr(6) <=  tlb_tag0_q(39);
size_16M_hashed_tid0_addr(5) <=  tlb_tag0_q(38);
size_16M_hashed_tid0_addr(4) <=  tlb_tag0_q(37);
size_16M_hashed_tid0_addr(3) <=  tlb_tag0_q(36) xor tlb_tag0_q(32);
size_16M_hashed_tid0_addr(2) <=  tlb_tag0_q(35) xor tlb_tag0_q(31);
size_16M_hashed_tid0_addr(1) <=  tlb_tag0_q(34) xor tlb_tag0_q(30);
size_16M_hashed_tid0_addr(0) <=  tlb_tag0_q(33) xor tlb_tag0_q(29);
size_1M_hashed_addr(6) <=  tlb_tag0_q(43) xor tlb_tag0_q(36) xor tlb_tag0_q(tagpos_pid+pid_width-1);
size_1M_hashed_addr(5) <=  tlb_tag0_q(42) xor tlb_tag0_q(35) xor tlb_tag0_q(tagpos_pid+pid_width-2);
size_1M_hashed_addr(4) <=  tlb_tag0_q(41) xor tlb_tag0_q(34) xor tlb_tag0_q(tagpos_pid+pid_width-3);
size_1M_hashed_addr(3) <=  tlb_tag0_q(40) xor tlb_tag0_q(33) xor tlb_tag0_q(tagpos_pid+pid_width-4);
size_1M_hashed_addr(2) <=  tlb_tag0_q(39) xor tlb_tag0_q(32) xor tlb_tag0_q(tagpos_pid+pid_width-5);
size_1M_hashed_addr(1) <=  tlb_tag0_q(38) xor tlb_tag0_q(31) xor tlb_tag0_q(tagpos_pid+pid_width-6);
size_1M_hashed_addr(0) <=  tlb_tag0_q(37) xor tlb_tag0_q(30) xor tlb_tag0_q(tagpos_pid+pid_width-7);
size_1M_hashed_tid0_addr(6) <=  tlb_tag0_q(43) xor tlb_tag0_q(36);
size_1M_hashed_tid0_addr(5) <=  tlb_tag0_q(42) xor tlb_tag0_q(35);
size_1M_hashed_tid0_addr(4) <=  tlb_tag0_q(41) xor tlb_tag0_q(34);
size_1M_hashed_tid0_addr(3) <=  tlb_tag0_q(40) xor tlb_tag0_q(33);
size_1M_hashed_tid0_addr(2) <=  tlb_tag0_q(39) xor tlb_tag0_q(32);
size_1M_hashed_tid0_addr(1) <=  tlb_tag0_q(38) xor tlb_tag0_q(31);
size_1M_hashed_tid0_addr(0) <=  tlb_tag0_q(37) xor tlb_tag0_q(30);
size_64K_hashed_addr(6) <=  tlb_tag0_q(47)                     xor tlb_tag0_q(37) xor tlb_tag0_q(tagpos_pid+pid_width-1);
size_64K_hashed_addr(5) <=  tlb_tag0_q(46)                     xor tlb_tag0_q(36) xor tlb_tag0_q(tagpos_pid+pid_width-2);
size_64K_hashed_addr(4) <=  tlb_tag0_q(45)                     xor tlb_tag0_q(35) xor tlb_tag0_q(tagpos_pid+pid_width-3);
size_64K_hashed_addr(3) <=  tlb_tag0_q(44)                     xor tlb_tag0_q(34) xor tlb_tag0_q(tagpos_pid+pid_width-4);
size_64K_hashed_addr(2) <=  tlb_tag0_q(43) xor tlb_tag0_q(40) xor tlb_tag0_q(33) xor tlb_tag0_q(tagpos_pid+pid_width-5);
size_64K_hashed_addr(1) <=  tlb_tag0_q(42) xor tlb_tag0_q(39) xor tlb_tag0_q(32) xor tlb_tag0_q(tagpos_pid+pid_width-6);
size_64K_hashed_addr(0) <=  tlb_tag0_q(41) xor tlb_tag0_q(38) xor tlb_tag0_q(31) xor tlb_tag0_q(tagpos_pid+pid_width-7);
size_64K_hashed_tid0_addr(6) <=  tlb_tag0_q(47)                     xor tlb_tag0_q(37);
size_64K_hashed_tid0_addr(5) <=  tlb_tag0_q(46)                     xor tlb_tag0_q(36);
size_64K_hashed_tid0_addr(4) <=  tlb_tag0_q(45)                     xor tlb_tag0_q(35);
size_64K_hashed_tid0_addr(3) <=  tlb_tag0_q(44)                     xor tlb_tag0_q(34);
size_64K_hashed_tid0_addr(2) <=  tlb_tag0_q(43) xor tlb_tag0_q(40) xor tlb_tag0_q(33);
size_64K_hashed_tid0_addr(1) <=  tlb_tag0_q(42) xor tlb_tag0_q(39) xor tlb_tag0_q(32);
size_64K_hashed_tid0_addr(0) <=  tlb_tag0_q(41) xor tlb_tag0_q(38) xor tlb_tag0_q(31);
size_4K_hashed_addr(6) <=  tlb_tag0_q(51) xor tlb_tag0_q(44) xor tlb_tag0_q(37) xor tlb_tag0_q(tagpos_pid+pid_width-1);
size_4K_hashed_addr(5) <=  tlb_tag0_q(50) xor tlb_tag0_q(43) xor tlb_tag0_q(36) xor tlb_tag0_q(tagpos_pid+pid_width-2);
size_4K_hashed_addr(4) <=  tlb_tag0_q(49) xor tlb_tag0_q(42) xor tlb_tag0_q(35) xor tlb_tag0_q(tagpos_pid+pid_width-3);
size_4K_hashed_addr(3) <=  tlb_tag0_q(48) xor tlb_tag0_q(41) xor tlb_tag0_q(34) xor tlb_tag0_q(tagpos_pid+pid_width-4);
size_4K_hashed_addr(2) <=  tlb_tag0_q(47) xor tlb_tag0_q(40) xor tlb_tag0_q(33) xor tlb_tag0_q(tagpos_pid+pid_width-5);
size_4K_hashed_addr(1) <=  tlb_tag0_q(46) xor tlb_tag0_q(39) xor tlb_tag0_q(32) xor tlb_tag0_q(tagpos_pid+pid_width-6);
size_4K_hashed_addr(0) <=  tlb_tag0_q(45) xor tlb_tag0_q(38) xor tlb_tag0_q(31) xor tlb_tag0_q(tagpos_pid+pid_width-7);
size_4K_hashed_tid0_addr(6) <=  tlb_tag0_q(51) xor tlb_tag0_q(44) xor tlb_tag0_q(37);
size_4K_hashed_tid0_addr(5) <=  tlb_tag0_q(50) xor tlb_tag0_q(43) xor tlb_tag0_q(36);
size_4K_hashed_tid0_addr(4) <=  tlb_tag0_q(49) xor tlb_tag0_q(42) xor tlb_tag0_q(35);
size_4K_hashed_tid0_addr(3) <=  tlb_tag0_q(48) xor tlb_tag0_q(41) xor tlb_tag0_q(34);
size_4K_hashed_tid0_addr(2) <=  tlb_tag0_q(47) xor tlb_tag0_q(40) xor tlb_tag0_q(33);
size_4K_hashed_tid0_addr(1) <=  tlb_tag0_q(46) xor tlb_tag0_q(39) xor tlb_tag0_q(32);
size_4K_hashed_tid0_addr(0) <=  tlb_tag0_q(45) xor tlb_tag0_q(38) xor tlb_tag0_q(31);

It seems to me that the set number is calcuated by using some hash functions for different page size, respectively. Based on my knowledge, the normal virtual address to physical address translation is first calculating the set number and then compare the tags of the 4 ways in the right set in order to find the right way and then do the translation in the end.

If this is the translation process that A2I core follows, my question is when an effective address comes, how can MMU know which page size this address is using?

CoreMark and Simulation

@openpowerwtf Hi buddy, has your team already used CoreMark to test the functionality of the A2 processor core? If the answer is yes, may I know what the test result is? We are planning to test A2 core on FPGA by using CoreMark and UART, do you have any suggestion for this plan?
We want to output CoreMark test result from our FPGA board to PC through UART cable and axi_uart IP, also do the same for simulation. While we are doing the simulation, we encountered a few issues: we wrote our own boot code based on the simple a2_boot code provided by you to test the uart IP both on board and on simulation. In simulation, after modelsim run for a while, we got a fatal error shown in the attachments. In order to solve this issue, we changed line 1025 in a2l2_axi.vhdl from shl_ordered <= rotl(req_p1_any_shl, stq_head_q) to shl_ordered <= rotl(req_p1_any_shl, stq_head_q(2 to 3)) and line 1027 from shl_youngest <= rotr(shl_ordered_youngest, stq_head_q) to shl_youngest <= rotr(shl_ordered_youngest, stq_head_q(2 to 3)) and it fixed the issue. Do you think this change might cause some other issues for the design? Is it normal to encounter this fatal error in simulation? In your opinion, under what kind of circumstances this simulation fatal error would be caused and is there way to prevent it from happening? Thx!

Assistance Required in building A2I core in Vivado Windows

When it comes to building of the a2i core on vivado on windows machine ( I did look at the video uploaded for IBM a2i core and tried to follow the same steps on the windows machine). I tried to the following steps in windows.
VIVADO->TCL CONSOLE and run the tcl script for IP's mentioned(reverserator32,reverserator 64..,) to build the core and save the project in a folder say(ip_repo).
Now when I try to run create_a2x_project.tcl and then fixup_a2x_bd.tcl nothing really comes up.
Requesting your assistance on this.

Code not compiling

The provided code fails to compile (with Modelsim) due to

Declarations are missing for a number of attributes. For example like_builtin used here.
The references to libraries latches and macros. They are not used so the code compiles without them. Here is an example.

Questions about MMQ TLB xbit hole

Hi, guys. I'm looking the memory xbit part.

If the xbit is used in a TLB entry and the corresponding effective address, how does it match?

As what I know, the matching process does not distinguish the "hole" or the outside part.

The process just identify the page size part of TLB entry and address-in,(0 to N, N = 64 - log2(page-size),

then match the N to 63 part of entry and address-in.

Does anyone have the same or different thoughts?

A design drawback of thread issue selection?

Dear all,

I am running the simulation of 4 threads together to test the issue unit. Here is the source code of the program which is super easy.

Below is the simulation wave and I find some problems with the i_afi_is2_take signal.

In particular, in the time area which is highlighted with the vertical red line, the value of i_afi_is2_take[0] keeps high. This is because "addi 17, 7, 10" instruction of thread0 is stalled in the iuq_fxu_issue unit. Meanwhile, the "fadd 7, 4, 5" instruction already enters the iuq_axu_fu_dec unit and pull the signal i_afd_is1_instr_v and signal i_axu_is1_early_v high. The iuq_axu_fu_iss unit then sets iu_is2_take_t[0], i.e., i_afi_is2_take[0], high.

I noticed that the value of i_afi_is2_take[0] keeping high will not affect the correctness since the signal i_afd_is2_t0_instr_v is low. However, I think the incorrect value of i_afi_is2_take will affect the scheduling scheme since the scheduling counter will increase. Any comments?

Many thanks

Questions about the medium priority when selecting threads to issue

According to the A2_BGQ manual, the selection logic for the issue logic is a simple round-robin scheme with three levels of priority to allow software more flexibility. As for the medium priority, the manual mentioned:
Medium priority instructions only issue when no high-priority instruction is available. The same time-out counter eventually promotes the medium-priority instruction to high-priority to prevent starvation.
But I didn't find the medium counter in iuq_fxu_issue.vhdl. So how could it issue?
Besides, the manual gave an example of round-robin logic when selecting a thread to issue on Page 81(figure 2-4).

In my opinion, thread 2 and thread 3 are both increased to high priority after 3 cycles. So I think thread 3 should be issued after thread 2 was issued rather than thread1. Why?
Many thanks!

Address space of an SoC using A2I core

Hi,
I am curious about the memory address space of an SoC such as Blue Gene/Q using the A2I core. Besides this, I also tried to search the address space of the powerPC and power-series server but also did not find any useful information. Any recommended references?

Missing testbenches

I would like to see your testbenches being provided. Without them it will be difficult to create pull requests knowing that the proposed changes will work. Are you intention to open source the tests?

Problem in simulation of the a2l2_axi.vhdl file

Hi buddy @openpowerwtf , we are currently doing simulation of the a2l2_axi.vhdl file. We first initialized the cache by storing random data into all addr consecutively. After all storing commands finished, we first issued load (001000) and iftech (000000) req instructions in two consecutive cycles to fetch 4 words. The an_ac_reld_data we got is correct in this case. Then we issued load (001000) and ifetch (000000) req instructions in two consecutive cycles again to fetch 16 words, we noticed for the first group of 16 words, the last 4 words were missing as they were not fetched into an_ac_reld_data (all zero) because rld_data_qw3 is all zero. But the second group of 16 words were all fetched into an_ac_reld_data properly. What might be the cause of missing the last 4 words? Is this a bug? The corresponding waveform was uploaded here. Thanks!

The below water remark in iuq_ib_buff.vhdl

In iuq_ib_buff.vhdl, the "ib_ic_below_water" signal is set to 1 when "buffer4_valid_l2=1'b0" and the "ib_ic_empty" signal is set to 1 when "buffer1_valid_l2=1'b0".

ib_ic_empty <= not (buffer1_valid_l2 or stall_l2(0));
ib_ic_below_water <= (not buffer4_valid_l2) or (not buffer5_valid_l2 and not stall_l2(0));

Then, in iuq_ic_select.vhdl file, "low_mask_d" signal is set when "ib_ic_below_water" is 1 and "high_mask_d" is set when "ib_ic_empty" is 1. I am confused here.

According to the A2I manual, "There are two watermarks within the instruction buffer that determine a thread’s priority level for fetches that are empty and half-empty, a halfempty level gives the thread a low-priority fetch request."

If "ib_ic_below_water" represets the half-empty, I think it should be set to 1 when "buffer4_valid_l2" is 1.

The current code seems to me "ib_ic_empty" and "ib_ic_below_water" will both be set to 1 if the insturction buffer is totally empty. This clearly has conflication since low_mask_d and high_mask_d will be set to 1 at the same time.

Why choosing ex4 to deal with interrupt and exception

As shown in xuq_cpl.vhdl, ex4 is used to deal with the interrupt and exception. Why ex4 instead of ex3 or ex5? Any comments about this?

How to match the coming effective/virtual address with the given TLB entry when considering the X _bit?

In the mmq_tlb_matchline.vhdl, there have 2 kinds of signals:
function_34_51 <= not(entry_xbit) or
not(pgsize_eq_1G) or
or_reduce(entry_epn_b(34 to 51) and addr_in(34 to 51));
-- 256M
function_36_51 <= not(entry_xbit) or
not(pgsize_eq_256M) or
or_reduce(entry_epn_b(36 to 51) and addr_in(36 to 51));
-- 16M
function_40_51 <= not(entry_xbit) or
not(pgsize_eq_16M) or
or_reduce(entry_epn_b(40 to 51) and addr_in(40 to 51));
-- 1M
function_44_51 <= not(entry_xbit) or
not(pgsize_eq_1M) or
or_reduce(entry_epn_b(44 to 51) and addr_in(44 to 51));
-- 64K
function_48_51 <= not(entry_xbit) or
not(pgsize_eq_64K) or
or_reduce(entry_epn_b(48 to 51) and addr_in(48 to 51));

comp_or_48_51 <= and_reduce(match_line(48 to 51)) or pgsize_gte_64K;
comp_or_44_47 <= and_reduce(match_line(44 to 47)) or pgsize_gte_1M;
comp_or_40_43 <= and_reduce(match_line(40 to 43)) or pgsize_gte_16M;
comp_or_36_39 <= and_reduce(match_line(36 to 39)) or pgsize_gte_256M;
comp_or_34_35 <= and_reduce(match_line(34 to 35)) or pgsize_gte_1G;

When the xbit exists, what does the signals function_n_51 and comp_or_n1_n2 mean? And what do they actually do when matching the effective/virtual address with the TLB entry?

Resources required?

How many BRAM and LUT resources are needed on the FPGA for this core when the full design ( including AXI bridge ) in synthesized?

Any hints about the ignore_flush_is0 signal

Hi guys,

I cannot clearly understand the below code and the related comments. Can anyone give me more hints about this? What does " before or after the "real" fxu selection" mean?

`-- During fdiv/fsqrt the axu may select this thread before or after the "real" fxu selection.

-- If the axu selects this thread earlier than the fxu, s1 is simply updated early.

-- If the axu selects this thread later than the fxu, ucode instructions would get wiped out by the flush

-- This signal protects the instruction from being flushed

ignore_flush_is0 <= (fdiv_is0 or fsqrt_is0) and isfu_dec_is0; -- these opcodes will not change the FpScr or any Fpr. Only scratch reg s0 will be changed
`

Many thanks

about the expand_type

Hi Guys,

The comment about expand_type is "-- 0 = ibm (Umbra), 1 = non-ibm, 2 = ibm (MPG)".
What does this mean? Can someone give any detailed explanation?

Many thanks

Unbalanced instruction fetch latency and instruction fetch width

Dear all,

I noticed that there are 5-cycle delay, i.e., IU0, IU1, IU2, IU3 and IU4, between the thread selection and putting the instruction into the instruction buffer so the instruction can be issued. Since we can only fetch 4 instructions per cycle in the idea case, are there any unbalanced cases here?
In particular, if each thread can issue one instruction per cycle in the ideal case, I think the current instruction fetch unit will cause the thread starvation, i.e., a thread cannot issue instructions because its instruction buffer is empty. I know this will not happen in A2I since 4 threads need to compete for the 2 execution units here. Just an architectural design discussion:)

Cheers,

L1 cache interface

Hi，in order to implement L2 cache, we want to make an block level testbench contain L1 and L2 cache，how could I get the L1 interface? or get L1 rtl block?

The LRU algorithm in I-ERAT

I spent one week reading the source code in I-ERAT to understand how LRU algorithm works. Until now, I still do not find the right way to read the code. I am totally lost in the combination logics.
I guess the key algorithm or implementation of LRU does not have too many codes but the watermark and other SPR operations compliate the design a lot.

Can someone give me any help about how to quickly understand the LRU algorithm in A2I? I know the classic LRU algorithm which is widely taught in the computer courses... Here, it seems to me to be a totally different LRU compared to what I learned before:))

About the memory consistency model in A2

Hi guys,

Power ISA book III-S mentions "Stores are not performed out-of-order (even if the Store instructions that caused them were executed out-of-order).",

However, there are many papers mentioning "the memory consistency model in Power is very relex where the stores can be out-of-order".

My question is can stores be out-of-order in Power ISA and in A2? I might mis-understand the Power ISA book...

Many thanks

About the payload bits of msgsnd instruction

Hi,
As mentioned in 7.7.2 Doorbell Message Filtering, if the 37 bit of payload is set, the message is accepted by all processors regardless of the value of the PIR register and the value of PIRTAG.

However, if you look at the xuq_spr_cspr.vhdl file, the doorbell signal is created as shown below.

set_dbell(t) <= lsu_xu_dbell_val_q and lsu_xu_dbell_type_q = "00000" and lsu_xu_dbell_lpid_match_q and (lsu_xu_dbell_brdcast_q or (dbell_pir_match and dbell_pir_thread(t)));
It is clear that lsu_xu_dbell_brdcast_q can only affect pir comparision. In other words, if broadcast is set in the message but the LPIDTAG does not match, the message cannot be accepted.

Any comments?

How to deal with wrong target address prediction of link stack

Hi Guys,

Based on my understanding, the link stack can give a wrong prediction of the branch target address. How does A2 detect this case and flush the pipeline. I did not find any related logic in the iuq_bp.vhdl.

Many thanks

Fix regression with new GHDL with locally static entity attributes

CI started breaking here (no a2i code change, just new ghdl/vunit:mcode docker images)
https://github.com/openpower-cores/a2i/actions/runs/2843708471

It seems to come down to a new check in:

/src/rel/src/vhdl/clib/c_prism_csa42.vhdl:58:14: attribute expression for 'entity' must be locally static
   ATTRIBUTE PIN_BIT_INFORMATION of c_prism_csa42 : entity is
             ^
/src/rel/src/vhdl/clib/c_prism_csa42.vhdl:58:14:note: (you can use -frelaxed to turn this error into a warning)
   ATTRIBUTE PIN_BIT_INFORMATION of c_prism_csa42 : entity is

pin_bit_information is defined as

  type pbi_el_t is array(0 to 3) of string;
  type pbi_t is array(integer range <>) of pbi_el_t;
  attribute pin_bit_information: pbi_t;

The pbi_t unconstrained array I think is the issue.

@tgingold Any chance you could help work out how to fix this? pin_bit_information is used lots of locations with different sizes, so contraining it will be difficult without a major rework.

Question regarding the address map of A2i core

Hi @openpowerwtf , does A2i core have an address map that is similar to the one mentioned in the attachment? When we first allocated 512M address space to DDR memory starting at address 0x8000_0000, DDR did not work properly. We were able to write data to DDR and read correct data from it through JTAG, but when the core started running, we didn’t see any data being transferred through the axi bus between the core and DDR. Then we changed the starting address of DDR to 0x2000_0000, the core and DDR started working properly and we saw data being transferred through the axi bus between the core and DDR. So we are wondering if there is also an address map of A2i core for certain peripherals? For example, if we want to add DDR, SPI bus connected to SD card and UART as peripherals for A2i core, what would be the ideal starting address for them? Thx!

How could i simulaion it in vivado or any other tools?

in vivado, it will report many errors focus in the ibm library. e.g. boolean not match integer literal, the invert function has too many element

what's more, i tried irun 15.2 and VCS Mx, they all reported too many compile errors.

how do you run the simulation and verification? how could i run a high level simluation?

thanks

What's the function of field LPID?

In the A2_BGQ, the Logical Partition ID's function is described as below:

The LPID is part of the virtual address and is used during address translation comparing LPID to the TLPID field in the TLB entry to determine a matching TLB entry.

In the TLB entry, the TLPID's function is described as below:
Specifies the value that is written to the TLB entry TLPID field by the execution of tlbwe. This field is loaded from the TLB entry TLPID by the execution of tlbre and by the execution of tlbsx that finds a matching entry. The TLB entry TLPID identifies the logical partition associated with the TLB entry.

What does the logical partition really mean? And what does this field do for matching a TLB entry?

Question regarding how to drive signal ac_an_req_ld_xfr_len and ac_an_req_ttype (hwsync) of a2l2_axi.vhdl in simulation.

@openpowerwtf When we set req_ld_xfr_len to b’111 (32 bytes) for load ttype in simulation, we didn’t get correct reld_data. We only saw the first 16 bytes on reld_data and the last 16 bytes were missing, but it looked good on axi_rdata. It worked fine with req_ld_xfr_len being set to other values apart from b’111. Is there any specific requirement for driving this signal in simulation?

Under what kind of circumstances should we use hwsync (req_ttype=b’101011) in smulation? Would it affect the performance of L2 cache if we didn’t use it at all? Thx!

Why msgclr does not set UsesSPR in iuq_fxu_decode.vhdl

Hi,

I find that msgclr does not set UsesSPR signal in iuq_fxu_decode.vhdl. This cause an error when "mtspr, CCR2, 0x20000; msgclr 0" is executed.
In particular, "mtspr CCR2, 0x20000" sets EN_PC to enable the msgclr instruction. However, since msgclr does not set UsersSPR in iuq_fxu_decode, the dependency check module fails. Thus, msgclr is executed with EN_PC is 0. An error happened.
Any comments?

Thanks

How to fetch later instructions if a branch instrution is met

The thread fetches the instruction from IU0 but the branch prediction comes out at IU5 due to the slow access to BHT. If the thread meets a branch instruction, how does it fetch later instructions? Wait until the branch prediction result comes out or continue fetching continuous instructions and flush if the prediction is a jmp.

Any comments about this?

openpower-cores / a2i Goto Github PK

a2i's Introduction

A2I

NOTICE

The Project

The Core

The History

The Future

Technology Scaling

Compliancy

Miscellaneous

a2i's People

Contributors

Stargazers

Watchers

Forkers

a2i's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs