GithubHelp home page GithubHelp logo

Comments (6)

Dolu1990 avatar Dolu1990 commented on June 16, 2024

Hi,

As far as I understand using Vec means using more LUT cause all the signals have to be synthetized

Yes right, that's not the way to go.

Indeed using the Axi Stream is a bit slower, but use less LUT, am I right?

Yes, serializing things is a better tradeoff i think, then eventualy buffering things into a Mem (ram) for later reuse.

from vexriscv.

lk-davidegironi avatar lk-davidegironi commented on June 16, 2024

Thank you @Dolu1990 ,
So, going to the Streaming mode, I'm using code taken from here: #53
I'm trying, without success to make it work with readStreamNonBlocking. I've to investigate further on the valid and read signal.
I'll keep you updated. Any further help is appreciated.

Apb3Axis is connected in the main SoC scala code, the Apb3Axis class looks like above

package lk.lib

import spinal.core._
import spinal.lib._
import spinal.lib.bus.amba3.apb.{Apb3, Apb3Config, Apb3SlaveFactory}

case class Apb3Axis(apb3Config: Apb3Config) extends Component {
  val io = new Bundle {
    val apb = slave(Apb3(apb3Config))
    val input = slave(Stream(Bits(32 bits)))
    val output = master(Stream(Bits(32 bits)))
  }

  val ctrl = Apb3SlaveFactory(io.apb)

  // input stream is by readStreamNonBlocking, but is not working, indeed comment code by streamfifo is working
  ctrl.readStreamNonBlocking(io.input.queue(128), address = 0)
  //val ififo = StreamFifo(dataType = Bits(32 bits), depth = 128)
  //ififo.io.push << io.input
  //ctrl.read(ififo.io.pop.payload, address = 0);
  //val ififoPopReady = ctrl.drive(ififo.io.pop.ready, address = 4)
  //ctrl.read(ififo.io.pop.valid, address = 8);
  //when(ififo.io.pop.valid) { ififoPopReady := False }


  val wordCount = (1 + widthOf(io.input.payload) - 1) / 32 + 1
  val wordAddressInc = 32 / 8
  val addressHigh = 0 + (2 - 1) * wordAddressInc
  SpinalInfo("Wordcount: " + wordCount)
  SpinalInfo("addressHigh: " + addressHigh)

  // output stream is by streamfifo, but needs to be converted to createAndDriveFlow
  val ofifo = StreamFifo(dataType = Bits(32 bits), depth = 128)
  ofifo.io.pop >> io.output
  ctrl.drive(ofifo.io.push.payload, address = 12)
  val ofifoPushValid = ctrl.drive(ofifo.io.push.valid, address = 16)
  ctrl.read(ofifo.io.push.ready, address = 20)
  when(ofifo.io.push.ready) { ofifoPushValid := False }
  //val writeFlow = ctrl.createAndDriveFlow(Bits(32 bits), address = 0)
  //writeFlow.toStream.stage() >> ofifo.io.push

}

main C sample code is below


typedef struct
{
  volatile uint32_t IN_DATA;
  volatile uint32_t IN_READY;
  volatile uint32_t IN_VALID;
  volatile uint32_t OUT_DATA;
  volatile uint32_t OUT_VALID;
  volatile uint32_t OUT_READY;
} AXIS_Reg;
#define AXIS ((AXIS_Reg *)(0xF0060000))

// inside the main function
while (1)
	{
		while (AXIS->IN_VALID == 0)
		{
			asm volatile("");
		}
		AXIS->OUT_DATA = 3 + AXIS->IN_DATA;
		AXIS->OUT_VALID = 0xFFFF;
		while (AXIS->OUT_VALID != 0)
		{
			asm volatile("");
		}
		AXIS->IN_READY = 0xFFFF;
		while (AXIS->IN_READY != 0)
		{
			asm volatile("");
		}
	}

then the verilog top function syntetized
I'm looking the the verilog analizer at axis_input_payload and axis_output_payload;
they works (output is input +3 each ticks) if input stream is implemented using streamfifo (the commented code of Apb3Axis), indeed does not work using readStreamNonBlocking


 //tick_1s_tick ticks every 1 second

  reg axis_input_valid;
  wire axis_input_ready;
  reg [31:0] axis_input_payload;
  wire axis_output_valid;
  reg axis_output_ready;
  wire [31:0] axis_output_payload;

    always @(posedge clk)
    begin
        if(tick_1s_tick)
        begin
            axis_input_valid <= 1'b1;
            axis_input_payload <= samplePayload;
        end
        else
        begin
            axis_input_valid <= 1'b0;
        end

        if(axis_output_valid)
        begin
            axis_output_ready <= 1'b1;
        end
        else
        begin
            axis_output_ready <= 1'b0;
        end

    end

    Soc Soc_inst(
        // all the other signals
        .io_axis_input_valid(axis_input_valid),
        .io_axis_input_ready(axis_input_ready),
        .io_axis_input_payload(axis_input_payload),
        .io_axis_output_valid(axis_output_valid),
        .io_axis_output_ready(axis_output_ready),
        .io_axis_output_payload(axis_output_payload)
    );

from vexriscv.

lk-davidegironi avatar lk-davidegironi commented on June 16, 2024

I was able to make it works, but I've performance problem.

So, first. How i make this work using readStreamNonBlocking and createAndDriveFlow, using a 31 bit payload.

Apb3Axis looks now like below:

case class Apb3Axis(apb3Config: Apb3Config) extends Component {
  val io = new Bundle {
    val apb = slave(Apb3(apb3Config))
    val input = slave(Stream(Bits(31 bits)))
    val output = master(Stream(Bits(31 bits)))
  }

  val busCtrl = Apb3SlaveFactory(io.apb)

  val ioinputqueue = io.input.queueLowLatency(128)
  busCtrl.readStreamNonBlocking(
    ioinputqueue,
    address = 0,
    validBitOffset = 31,
    payloadBitOffset = 0
  )

  val ofifo = StreamFifoLowLatency(dataType = Bits(31 bits), depth = 128)
  ofifo.io.pop >> io.output
  val writeFlow = busCtrl.createAndDriveFlow(Bits(31 bits), address = 4)
  writeFlow.toStream.stage() >> ofifo.io.push
}

software side (almost like below) - notice I'm using a Gpio output to check the timing on the the analyzer

typedef struct
{
  volatile uint32_t IN_PAYLOAD;
  volatile uint32_t OUT_PAYLOAD;
} AXIS_Reg;
#define AXIS ((AXIS_Reg *)(0xF0060000))

#define IN_PAYLOAD_VALID_MASK 0x80000000
#define IN_PAYLOAD_VALID_SHIFT 31
#define IN_PAYLOAD_DATA_MASK 0x7FFFFFFF
#define IN_PAYLOAD_DATA_SHIFT 0

//
// in main function, main while loop
//
    while (1)
	{

	    uint32_t payload = AXIS->IN_PAYLOAD;
    	    if ((payload & IN_PAYLOAD_VALID_MASK) >> IN_PAYLOAD_VALID_SHIFT == 1) {
		  uint32_t data = (payload & IN_PAYLOAD_DATA_MASK) >> IN_PAYLOAD_DATA_SHIFT;
		  if (data == 10) {
			  gpioA_setOutputBit(0);
			  gpioA_clearOutputBit(0);
		  }
		  // AXIS->OUT_PAYLOAD = data;
	    }
	}

Verilog side

reg axis_input_valid;
  wire axis_input_ready;
  reg [30:0] axis_input_payload;
  wire axis_output_valid;
  reg axis_output_ready;
  wire [30:0] axis_output_payload;
  reg [30:0] axis_output_payload_reg;
  
  reg [30:0] signaldata_reg;
  reg [30:0] signalretdata_reg;
  initial  signaldata_reg = 0; 
  
    always @(posedge clk)
    begin
       // enable signal send every 1 second (it will be at 10kHz in the future)
        if(tick_1s_tick)
        begin
            signalnum_reg <= 1;
        end

       // check sending 32 data payload
        if(signaldata_reg >= 1 && signaldata_reg <= 32 && axis_input_ready)
        begin
            signaldata_reg <= signaldata_reg + 1;
            axis_input_valid <= 1'b1;
            axis_input_payload <= signaldata_reg;
        end
        else
        begin
            axis_input_valid <= 1'b0;
        end

        // receiving output and moving to data
        axis_output_ready <= 1'b1;
        if(axis_output_valid)
        begin
            signalretdata_reg <= axis_output_payload;
        end
    end

   Soc Soc_inst(
        // all the other signals
        .io_axis_input_valid(axis_input_valid),
        .io_axis_input_ready(axis_input_ready),
        .io_axis_input_payload(axis_input_payload),
        .io_axis_output_valid(axis_output_valid),
        .io_axis_output_ready(axis_output_ready),
        .io_axis_output_payload(axis_output_payload)
    );

Code above works with StreamFifo and .queue. I've find no difference using that or the LowLatency one.
I'm running that code on a Briey based Soc, running @ 72Mhz main clk on a Tang Primer 20k.

In the future I'm going to use a payload of 31 bit, then I'll put a command in the first 7 bit, and use the other 24 for data.

My problem is about performance. I've try StreamFifo and .queue instead of queueLowLatency and StreamFifoLowLatency but it makes not difference.
I've used the gpio output to measure how much time it takes for a signal to be read (or sent). it seems reading a signal takes many cycles. If you look at the 1024 cycles capture below you will notice the gpio I/O happened almost at cycle 530, for data number 10. It means if I have to send 32 data payload it will takes 1700 cycle almost. That is some kind of too much for my requirements. I've to send data hopefully at 10kHz. it means I've 7200 cycles to to make math (simple math) each loop in my software.

StreamFifo and .queue instead of queueLowLatency and StreamFifoLowLatency makes not difference. Running the SoC at 72Mhz or 12Mhz makes not difference. FPGA/Briey and busses (AXI+APB3) are all running within the same clock domain).

Do I miss something?

Note for image (here I'm using a payload that contains a command in the first 7 bit, and use the other 24 for data).

Capture

Writing data back to output 'AXIS->OUT_PAYLOAD = xxx ' makes no difference in timing, that means SoC to FPGA is fast. It's just the input a little too slow for me.

Capture2

Thanks for help.

from vexriscv.

Dolu1990 avatar Dolu1990 commented on June 16, 2024

Hi, looking at your simulation, it show things from time 0 right ?
Thing is, the CPU will need a bit of time to reach the while loop.

from vexriscv.

lk-davidegironi avatar lk-davidegironi commented on June 16, 2024

Thank you @Dolu1990
So, analyzer it's triggered at tick_1s_tick edge. And it's showing from time 0. If you look at signnum_reg you will find the 32 signals loaded to the StreamFifo in 32 cycles (find zoomed below).
I'm testing the InterruptCtrl but this will not make difference now, cause in the while loop I'm always reading.

Zoomed in the other image the a load (uint32_t payload = AXIS->IN_PAYLOAD;) + unload (AXIS->OUT_PAYLOAD = data;) timing.
It's almost 90 cycles.

Maybe something involving DMA can help?
Sorry for my dumbness but I've just entered the FPGA SoC world.

I know I'm asking a lot from this core. My plan is to make this works on VexRiscV (even at slower speed), cause I like this project (portable and customizable), then when I'll be ready maybe moving to an hardware core (xilinx ARM) changing the busses of course.

Verilog below contains the actual payload content (signal number + integer), and will clarify it to you:

always @(posedge clk)
    begin
        if(tick_1s_tick)
        begin
            signalnum_reg <= 1;
        end
        if(signalnum_reg >= 1 && signalnum_reg <= 32)
        begin

            if(axis_input_ready)
            begin
                signalnum_reg <= signalnum_reg + 1;
                signaldata_reg <= signaldata_reg + 1;
            end
            
            axis_input_valid <= 1'b1;
            axis_input_payload <= {signalnum_reg, signaldata_reg};
        end
        else
        begin
            axis_input_valid <= 1'b0;
        end

        axis_output_ready <= 1'b1;
        if(axis_output_valid)
        begin
            signalretnum_reg <= axis_output_payload[30-:7];
            signalretdata_reg <= axis_output_payload[23:0];
        end

    end

Capture
Capture2

from vexriscv.

Dolu1990 avatar Dolu1990 commented on June 16, 2024

Hmm, one thing to be carefull about aswell, is that the first attempt you will hit i$ d$ refills, so to take mesurements, you realy have to run the code more than once, and then take mesurment of the last execution.

Are your picture from the very first execution ?
Was your code compiled in O3 ?

from vexriscv.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.