lukego / easynic Goto Github PK
View Code? Open in Web Editor NEWEasyNIC: an easy-to-use host interface for network cards
EasyNIC: an easy-to-use host interface for network cards
Multiple CPU cores should be able to read and write on the EasyNIC at the same time.
Ideally these cores should all share the same transmit/receive interface and efficiently distribute traffic using an algorithm implemented on the CPU. This will likely require some accommodations to be made in the transmit/receive interface to avoid expensive synchronization between the cores and possibly some special rules for memory alignment/arenas/etc as floated on #5.
(The alternative of having the NIC switch packets between multiple transmit/receive interfaces would be nice but it may be prohibitively complex to implement a sufficiently general dispatching mechanism. The trouble with features like RSS is that they only cover special cases of protocols and hashing rules and so on.)
PCIe bandwidth is more scarce than it used to be. The host interface has to be designed to use PCIe bandwidth efficiently. This will probably involve organizing DMA into fewer longer contiguous transfers rather than more smaller scattered ones.
How come PCIe bandwidth is more scarce? It's because Ethernet bandwidth has been increasing in powers of 10 while PCIe bandwidth in powers of 2. The basic numerical relationship has changed with the transition from 10G/40G to 25G/100G:
Ethernet bandwidth | PCIe bandwidth | PCIe-to-Ethernet ratio |
---|---|---|
10G | 16G (PCIe 2.0 x4) | 1.6x |
40G | 64G (PCIe 3.0 x8) | 1.6x |
25G | 32G (PCIe 3.0 x4) | 1.28x |
50G | 64G (PCIe 3.0 x8) | 1.28x |
100G | 128G (PCIe 3.0 x16) | 1.28x |
200G | 256G (PCIe 4.0 x16) | 1.28x |
In the good old days of 10G/40G the PCIe links had 60% extra capacity for overhead such as transferring DMA descriptors and for the PCIe protocols themselves. These modern times of 25G/100G are leaner and only 28% is available. This means that we must treat PCIe bandwidth as a scarce resource because any wastage is likely to actually impact operational performance.
Just came upon the transcript of a recent Recode podcast interviewing Hennessy and Patterson about their work on RISC. They did a great job of showing that CPU instruction sets don't have to be complex to be efficient. Inspiring stuff! They are sharing the Turing award for that work this year.
Suppose we wanted to build a 10G EasyNIC. Is there suitable hardware available?
Surprisingly to me, the answer seems to be yes. Here is a suggestion from a twitter conversation with @daveshah1 and others:
EasyNIC 10G would be:
The FPGA would use programmable logic to implement the PCIe endpoint, the Ethernet MAC, and the EasyNIC driver interface. The I/O interface from FPGA to PCIe would be the 4 x 5Gbps SERDES on the ECP5-5G. The I/O interface from FPGA to 10G PHY would be 32x311Mbps using ordinary I/O pins on the FPGA (which is known to support twice this bitrate for DDR3 memory.)
The cost of the FPGA seems to be ~$50 and the PHY ~$20. There are no license fees for the developer tools or hardware features thanks to full ECP5 support in the Yosys open source hardware toolchain.
The NIC might be able to have multiple 10G ports each using separate silicon and connecting to a "bifurcated" PCIe slot.
This is an exciting possibility. Overall this NIC would seem comparable to the Intel 82599. This would make it a practical NIC for many serious applications.
If we decide to move forward with this approach then we could start by developing a 1G NIC using an off-the-shelf ECP5 Versa development board that costs ~$200 and includes PCIe and 10/100/1000 RJ45 connectivity.
Hi,
I have been investigating the discussions; they are pretty cool! And facts in the README.MD and discussions look sensible. However, conceptually, I cannot digest why?!
For instance, it is stated that the success of RISC-V inspires this project, How? Is there more info we could follow?
I'm not experienced in device drivers. But I have played with IXGBE
driver and netdevice.h
in Linux kernel for a project, and I have reviewed the codebase around that area.
I found this project interesting! As data centers Ethernet is moving toward 100G and 400G, it becomes intriguing to how system software and hardware should behave and manage such rates at the data centers. I have read a paper that looks have some synergy with this project(same problem, different perspective and different solution). However, I know that this paper is out of the scope of this repo through.
I believe if there are some illustrative documentations or links or papers that I could follow and study then not only me but more people can get involved in such interesting projects.
Thanks
Alireza
The initial Transmit design (#5) expects the NIC to fetch a variable-length buffer with DMA. This seems likely to be awkward and inefficient on the silicon side since the device will need to speculatively read ahead and scan for the terminating.
Consider adding a register TX_BLOCK_SIZE
where the host writes the exact size of the next block before its address is written to TX_BLOCK_SEND
. This way the device can always fetch exactly the memory containing the block. (The block would not need the zero-length-terminator anymore, either.)
We will need a realistic and accessible benchmarking setup to validate the design. For example we need to be able to experimentally work out what special accommodations the DMA design needs to make to the CPU regarding alignment etc (see #9).
How to do it? Here are a few ideas from hardest / most realistic downwards:
The last one seems very convenient. Has any research been done (e.g. to the PMU level) about how well L3 cache (e.g. array too large to fit into L2) works as a proxy for freshly DMA'd data on x86? (Maybe you guys have looked at this @emmericp?)
One possible consequence of optimizing PCIe efficiency (#3) is to avoid using individual descriptors for each packet that is transmitted and received.
If consecutive packets are streamed to and from large memory buffers then two bytes of per-packet metadata may be sufficient e.g. to indicate the packet length and the Ethernet FCS validity.
This would be considerably more streamlined than the Intel and Mellanox approaches that typically require between 16 and 48 bytes of metadata for each packet. These scatter-gather designs are burdened with transferring the 64-bit address of each packet's individual buffer(s) and often with other non-essential metadata.
Has to be really easy to interface with on a driver, and really easy to implement in silicon, and really efficient with PCIe bandwidth.
Has to be really easy to interface with on a driver, and really easy to implement in silicon, and really efficient with PCIe bandwidth. (Like the Transmit interface #1.)
On the weekend I spoke at FOSDEM about How a ConnectX device driver works. This is a 10-minute talk that could alternatively be titled "How close can you come to using ConnectX as an off-the-shelf EasyNIC?" Covers the theme of trying to keep complicated hardware details from screwing up software.
Suggestion from @blitz via Twitter: Consider avoiding the synchronous read across PCIe from the transmit path (TX_BLOCK_AVAIL
) for the sake of efficiency. The read does not have to be made frequently, only perhaps once every thousand packets, but it will be slow.
Could make sense to have transmit/receive/etc state DMA'd onto the host at regular intervals so that it can be pulled from memory rather than fetched synchronously from the device.
Suggestion from @blitz via Twitter: Avoid non-idempotent streaming writes to the same register, such as the way the host writes multiple addresses back-to-back into TX_BLOCK_SEND
and requires the device to process all of them in FIFO order. Awkward to handle this on the hardware side? What is a better approach?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.