lukego / easynic Goto Github PK

View Code? Open in Web Editor NEW

43.0 43.0 0.0 5 KB

EasyNIC: an easy-to-use host interface for network cards

easynic's People

Contributors

Stargazers

Watchers

easynic's Issues

Support multiple read/write cores efficiently

Multiple CPU cores should be able to read and write on the EasyNIC at the same time.

Ideally these cores should all share the same transmit/receive interface and efficiently distribute traffic using an algorithm implemented on the CPU. This will likely require some accommodations to be made in the transmit/receive interface to avoid expensive synchronization between the cores and possibly some special rules for memory alignment/arenas/etc as floated on #5.

(The alternative of having the NIC switch packets between multiple transmit/receive interfaces would be nice but it may be prohibitively complex to implement a sufficiently general dispatching mechanism. The trouble with features like RSS is that they only cover special cases of protocols and hashing rules and so on.)

Conserve PCIe bandwidth

PCIe bandwidth is more scarce than it used to be. The host interface has to be designed to use PCIe bandwidth efficiently. This will probably involve organizing DMA into fewer longer contiguous transfers rather than more smaller scattered ones.

How come PCIe bandwidth is more scarce? It's because Ethernet bandwidth has been increasing in powers of 10 while PCIe bandwidth in powers of 2. The basic numerical relationship has changed with the transition from 10G/40G to 25G/100G:

Ethernet bandwidth	PCIe bandwidth	PCIe-to-Ethernet ratio
10G	16G (PCIe 2.0 x4)	1.6x
40G	64G (PCIe 3.0 x8)	1.6x
25G	32G (PCIe 3.0 x4)	1.28x
50G	64G (PCIe 3.0 x8)	1.28x
100G	128G (PCIe 3.0 x16)	1.28x
200G	256G (PCIe 4.0 x16)	1.28x

In the good old days of 10G/40G the PCIe links had 60% extra capacity for overhead such as transferring DMA descriptors and for the PCIe protocols themselves. These modern times of 25G/100G are leaner and only 28% is available. This means that we must treat PCIe bandwidth as a scarce resource because any wastage is likely to actually impact operational performance.

Podcast with Hennessy and Patterson about RISC

Just came upon the transcript of a recent Recode podcast interviewing Hennessy and Patterson about their work on RISC. They did a great job of showing that CPU instruction sets don't have to be complex to be efficient. Inspiring stuff! They are sharing the Turing award for that work this year.

Hardware for 10G EasyNIC

Suppose we wanted to build a 10G EasyNIC. Is there suitable hardware available?

Surprisingly to me, the answer seems to be yes. Here is a suggestion from a twitter conversation with @daveshah1 and others:

EasyNIC 10G would be:

Custom PCB.
Lattice ECP5-5G FPGA implementing Ethernet MAC and PCIe 2.0 x4 (16Gbps) endpoint.
Microsemi VSC8486YJB-11 10G Ethernet PHY (or similar.)

The FPGA would use programmable logic to implement the PCIe endpoint, the Ethernet MAC, and the EasyNIC driver interface. The I/O interface from FPGA to PCIe would be the 4 x 5Gbps SERDES on the ECP5-5G. The I/O interface from FPGA to 10G PHY would be 32x311Mbps using ordinary I/O pins on the FPGA (which is known to support twice this bitrate for DDR3 memory.)

The cost of the FPGA seems to be ~$50 and the PHY ~$20. There are no license fees for the developer tools or hardware features thanks to full ECP5 support in the Yosys open source hardware toolchain.

The NIC might be able to have multiple 10G ports each using separate silicon and connecting to a "bifurcated" PCIe slot.

This is an exciting possibility. Overall this NIC would seem comparable to the Intel 82599. This would make it a practical NIC for many serious applications.

If we decide to move forward with this approach then we could start by developing a 1G NIC using an off-the-shelf ECP5 Versa development board that costs ~$200 and includes PCIe and 10/100/1000 RJ45 connectivity.

Hard to keep up with the concepts!

Hi,

I have been investigating the discussions; they are pretty cool! And facts in the README.MD and discussions look sensible. However, conceptually, I cannot digest why?!

For instance, it is stated that the success of RISC-V inspires this project, How? Is there more info we could follow?

I'm not experienced in device drivers. But I have played with IXGBE driver and netdevice.h in Linux kernel for a project, and I have reviewed the codebase around that area.

I found this project interesting! As data centers Ethernet is moving toward 100G and 400G, it becomes intriguing to how system software and hardware should behave and manage such rates at the data centers. I have read a paper that looks have some synergy with this project(same problem, different perspective and different solution). However, I know that this paper is out of the scope of this repo through.

I believe if there are some illustrative documentations or links or papers that I could follow and study then not only me but more people can get involved in such interesting projects.

Thanks
Alireza

Consider adding TX_BLOCK_SIZE register

The initial Transmit design (#5) expects the NIC to fetch a variable-length buffer with DMA. This seems likely to be awkward and inefficient on the silicon side since the device will need to speculatively read ahead and scan for the terminating.

Consider adding a register TX_BLOCK_SIZE where the host writes the exact size of the next block before its address is written to TX_BLOCK_SEND. This way the device can always fetch exactly the memory containing the block. (The block would not need the zero-length-terminator anymore, either.)

Benchmark setup for validating design

We will need a realistic and accessible benchmarking setup to validate the design. For example we need to be able to experimentally work out what special accommodations the DMA design needs to make to the CPU regarding alignment etc (see #9).

How to do it? Here are a few ideas from hardest / most realistic downwards:

Implement a real EasyNIC on a high-end 100G FPGA.
Implement a fake EasyNIC on an Amazon F1 instance FPGA. This could implement the DMA engine and e.g. automatically loopback TX to RX. This FPGA has 12Gbps of PCIe bandwidth and no network connection (IIUC.)
Just fake it on the CPU e.g. assume that reading from L3 cache has equivalent performance to reading from a NIC and implement registers e.g. via SIGSEGV signal handlers.

The last one seems very convenient. Has any research been done (e.g. to the PMU level) about how well L3 cache (e.g. array too large to fit into L2) works as a proxy for freshly DMA'd data on x86? (Maybe you guys have looked at this @emmericp?)

Avoid individual descriptors

One possible consequence of optimizing PCIe efficiency (#3) is to avoid using individual descriptors for each packet that is transmitted and received.

If consecutive packets are streamed to and from large memory buffers then two bytes of per-packet metadata may be sufficient e.g. to indicate the packet length and the Ethernet FCS validity.

This would be considerably more streamlined than the Intel and Mellanox approaches that typically require between 16 and 48 bytes of metadata for each packet. These scatter-gather designs are burdened with transferring the 64-bit address of each packet's individual buffer(s) and often with other non-essential metadata.

Define Transmit interface

Has to be really easy to interface with on a driver, and really easy to implement in silicon, and really efficient with PCIe bandwidth.

Define the Receive interface

Has to be really easy to interface with on a driver, and really easy to implement in silicon, and really efficient with PCIe bandwidth. (Like the Transmit interface #1.)

How a ConnectX device driver works [FOSDEM talk]

On the weekend I spoke at FOSDEM about How a ConnectX device driver works. This is a 10-minute talk that could alternatively be titled "How close can you come to using ConnectX as an off-the-shelf EasyNIC?" Covers the theme of trying to keep complicated hardware details from screwing up software.

[Mildy interesting] NIC speed vs. driver lines of code

The plot shows Ethernet drivers in Linux 4.19 and DPDK (forgot the exact version) by the maximum speed they support vs. the lines of code. There is a linear correlation (R^2 = 0.37) between network speed and driver complexity, so it's only going to get worse for now.

Eliminate read from transmit path

Suggestion from @blitz via Twitter: Consider avoiding the synchronous read across PCIe from the transmit path (TX_BLOCK_AVAIL) for the sake of efficiency. The read does not have to be made frequently, only perhaps once every thousand packets, but it will be slow.

Could make sense to have transmit/receive/etc state DMA'd onto the host at regular intervals so that it can be pulled from memory rather than fetched synchronously from the device.

Avoid streaming non-idempotent register writes?

Suggestion from @blitz via Twitter: Avoid non-idempotent streaming writes to the same register, such as the way the host writes multiple addresses back-to-back into TX_BLOCK_SEND and requires the device to process all of them in FIFO order. Awkward to handle this on the hardware side? What is a better approach?

lukego / easynic Goto Github PK

easynic's People

Contributors

Stargazers

Watchers

easynic's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs