The DNPCIE_400G_VUP_HBM_LL is a PCIe-based FPGA board designed to minimize input to output processing latency on 10-Gbit, 40-Gbit, or 100GbE Ethernet packets. The primary application is for low-cost, low latency, high throughput trading without CPU intervention. Every possible variable that affects input to output latency has been analyzed and minimized. Raw 10/40/100 GbE Ethernet packets can be analyzed and acted upon without a MAC, interrupts, or an operating system adding delay to the process. This configurable hardware computing platform has the ability to achieve the theoretical minimum Ethernet packet processing latency.
1. The FPGA - Xilinx Virtex UltraScale+/UltraScale with HBM
We use a single FPGA from the Xilinx Virtex UltraScale+ family in the H2104 package. This package supports 416 I/Os with the majority utilized. Most are dedicated to off chip DDR4. The Virtex UltraScale+ FPGA contains high-speed transceivers capable of 25 GHz. Sixteen of these transceivers are used for a 16-lane GEN3/4 PCIe interface. Four sets of 4 GTY transceivers are connected to QSFP28 sockets for 40/100GbE Ethernet (or 4 channels of 10 GbE). Assuming a VU35P, sixteen addition GTY transceivers are attached to Samtec Firefly connectors and can be used for high speed board to board communication using cables or more 10/40/100GbE ports.
Two possible UltraScale+ FPGAs can be stuffed: VU35P and VU33P. These FPGAs come in a variety of speed grades (-3, -2/2L, -1) with -3 the fastest. Table 1 depicts the resources of the FPGAs with the Xilinx marketing exaggerations excised. These are large FPGAs with the VU35P is capable of handling ~10M ASIC gates of logic and remember that the internal FPGA memory and multiplier blocks are not part of this number. UltraScale+ adds large blocks of internal 4k x 72 RAM (UltraRAM). Features of the Xilinx UltraScale+ FPGAs include efficient, dual-register 6-input look-up table (LUT) logic, 18 Kb (2 x 9 Kb) block RAMs, and third generation DSP slices (includes 27 x 18 multipliers and 48-bit accumulator). Floating point functions can be implemented using these DSP slices.
2. High Bandwidth Memory (HBM)
Compared to the other UltraScale+ FPGAs, the big difference in the VU35P/VU33P is the addition of 8GB of High Bandwidth Memory (HBM). This memory is divided into two separate banks of 4GB each with a 1024-bit DQ bus. The HBMs can be split into several separate memories and IP from Xilinx enables 16 AXI3 ports with a 256-bit interface. The data bandwidth can be as much as 20x faster than external DDR4.
3. Low Latency Network Interface
4 channels of 40/100 GbE or a mix of 10 GbE via quad QSFP28
The Virtex UltraScale+ FPGA has transceivers capable of 25 GbE. The physical interface (PHY) is handled using dual QSFP28 modules for 40/100 GbE. With the proper cable this can be split into 4 separate channels of 10 GbE. Raw Ethernet packets can be accessed directly by bypassing the MAC.
4. External Memory
DDR4 - 16GB of local bulk memory
PC4-2400 DDR4 chips are mounted on the card, providing a single 64-bit bank of DDR4 memory configured as 1G x 64. This memory bank is tested at the maximum FPGA I/O frequency: 1200 MHz (2400 Mb/s with DDR).
To minimize data synchronization across clock boundaries, it probably makes sense to clock this DDR4 interface at a 7x multiple of the base Ethernet frequency of 156.25 MHz, which is 1093.75 MHz A 9x phase synchronous clock can be easily generated internal to the FPGA, allowing zero latency synchronous data transfers between the Ethernet packet receiving logic and the DDR4 memory controller. The DDR4 controller can be optimized in any way you choose. We, of course, provide several Verilog examples for no charge that you are welcome to use. All functions of the DDR4 DRAM can be exploited and optimized. Up to 8 banks can be open at once. Timing variables such as CAS latency and precharge can be tailored to the minimum given your operating frequency and the timing specification of the exact DDR4 memory utilized. The only real limitation is the amount of time and effort spent customizing the DDR4 memory controller to your needs.
5. PCIe - Customizable 16-lane, GEN3/4 PCI Express
PCIe is connected directly to the FPGA via 16-lanes of GTY transceivers. The interface is fully GEN2/GEN3 and GEN4 capable. We ship GEN3 PCIe IP that is a full function, fixed, 16-lane master/target. To gain access to the PCIe interface, this IP must be integrated with your application. The Dini Group PCIe IP provides a flexible interface that allows the user access to multiple DMA engines, scratchpad memories, interrupts, and other endpoint-related functions to maximize performance while utilizing minimal FPGA resources. Drivers (required) for 'C' source for several operating systems are included no charge.
6. How Everything Works...
With direct data feeds such as NASDAQ ITCH and OUCH, the DNPCIE_400G_VUP_HBM_LLcontains all of the basic functions required to minimize the amount of time it takes to receive Ethernet packets, process them, and respond deterministically. By using the FPGA to process Ethernet packets, the processor and operating system are removed from the critical path and traditional sources of latency such as interrupts and context switching no longer hinder performance. Not a single clock cycle is wasted. For algorithms requiring processing, FPGA resources can be hard coded to perform the task. This includes real-time Monte Carlo analysis, and floating point.
l Quad QSFP28 sockets. Each socket can be:
? 4-ports 10 GbE or
? 1-port 40 GbE or
? 1-port 100 GbE (UltraScale+ only)
l 4 separate Samtec Firefly connectors for MTP
? 4 GTY lanes per connector
? Additional 10/40/100GbE ports or board-to-board connections
l Hosted in a 16-lane GEN3/GEN4 PCIe slot (GEN4 with 8-lanes)
? Compatible with Xilinx PCI Express Solutions
? Compatible with Northwest Logic PCI Express Solutions
? PCIe full, height, GPU length
l Fully compatible with our optional TCP Offload Engine (TOE/TOE128)
l Optional FIX board support package (DN_FBSP).
l Functioning reference design with:
? 10 GbE/40GbE/100GbE MAC
? TCP/IP Offload Engine (TOE)
l Up to 128 sessions
? FIX protocol parser
? PCIe Interface (16-lane, GEN3)
l QRDII+ Controller
l DDR4 Controller
l Xilinx Virtex UltraScale+ FPGA
? XCVU35P or XCVU33P (H2104)
l With VU35P:
l 10M ASIC gates (ASIC measure)
l 871k flip-flop/6-input LUTs (1.7M total FFs)
l 29 Mbytes total FPGA block memory
l 8GB HBM memory
l 5,952 multipliers: 27 x 18
l DDR4 Memory (16GB total) in 5 separate blocks
? 4 blocks: 1G x 16
? 1 block: 1G x 64
? 1200MHz operation, PC4-2400
? DDR4 interface compatible with Vivado MIG
l QDRII+ configured as 1M x 18
? 1200Mb/sec (600MHz)
l SMBus-based thermal management
l 5 bits of 1.8V general purpose I/O (GPIO)
l Full support for embedded logic analyzers via JTAG interface
? ChipScope Integrated Logic Analyzer (ILA), Exostiv, and other third-party debug solutions
l Five FPGA-controlled LEDs
? 1 RGB tri-color LED piped to front-panel
? 4 Green LED's on-board
? Enough debug LEDs to illuminate virtually nothing.