This product is intended to be used for various FPGA-based algorithmic acceleration tasks that require access to large amounts of local memory. The DNVUF2_HPC_PCIe hosts two Xilinx FPGAs from the UltraScale and UltraScale+ families. Each FPGAs has multiple banks of high performance DDR4 memory. Data movement to/from the FPGAs is accomplished via an 8-lane, GEN3 PCIe interface. Each of the two FPGAs (A and B in the block diagram) has six separate 1G x 16 DDR4 (16 Gb) memories and a bank of 1G x 64 DDR4.
1. The FPGA - Xilinx Virtex UltraScale+/UltraScale with HBM
We use a single FPGA from the Xilinx Virtex UltraScale+ family in the H2104 package. This package supports 416 I/Os with the majority utilized. Most are dedicated to off chip DDR4. The Virtex UltraScale+ FPGA contains high-speed transceivers capable of 25 GHz. Sixteen of these transceivers are used for a 16-lane GEN3/4 PCIe interface. Four sets of 4 GTY transceivers are connected to QSFP28 sockets for 40/100GbE Ethernet (or 4 channels of 10 GbE). Assuming a VU35P, sixteen addition GTY transceivers are attached to Samtec Firefly connectors and can be used for high speed board to board communication using cables or more 10/40/100GbE ports.
Two possible UltraScale+ FPGAs can be stuffed: VU35P and VU33P. These FPGAs come in a variety of speed grades (-3, -2/2L, -1) with -3 the fastest. Table 1 depicts the resources of the FPGAs with the Xilinx marketing exaggerations excised. These are large FPGAs with the VU35P is capable of handling ~10M ASIC gates of logic and remember that the internal FPGA memory and multiplier blocks are not part of this number. UltraScale+ adds large blocks of internal 4k x 72 RAM (UltraRAM). Features of the Xilinx UltraScale+ FPGAs include efficient, dual-register 6-input look-up table (LUT) logic, 18 Kb (2 x 9 Kb) block RAMs, and third generation DSP slices (includes 27 x 18 multipliers and 48-bit accumulator). Floating point functions can be implemented using these DSP slices.
2. High Bandwidth Memory (HBM)
Compared to the other UltraScale+ FPGAs, the big difference in the VU35P/VU33P is the addition of 8GB of High Bandwidth Memory (HBM). This memory is divided into two separate banks of 4GB each with a 1024-bit DQ bus. The HBMs can be split into several separate memories and IP from Xilinx enables 16 AXI3 ports with a 256-bit interface. The data bandwidth can be as much as 20x faster than external DDR4.
3. Low Latency Network Interface
4 channels of 40/100 GbE or a mix of 10 GbE via quad QSFP28
The Virtex UltraScale+ FPGA has transceivers capable of 25 GbE. The physical interface (PHY) is handled using dual QSFP28 modules for 40/100 GbE. With the proper cable this can be split into 4 separate channels of 10 GbE. Raw Ethernet packets can be accessed directly by bypassing the MAC.
4. External Memory
DDR4 - 16GB of local bulk memory
PC4-2400 DDR4 chips are mounted on the card, providing a single 64-bit bank of DDR4 memory configured as 1G x 64. This memory bank is tested at the maximum FPGA I/O frequency: 1200 MHz (2400 Mb/s with DDR).
To minimize data synchronization across clock boundaries, it probably makes sense to clock this DDR4 interface at a 7x multiple of the base Ethernet frequency of 156.25 MHz, which is 1093.75 MHz A 9x phase synchronous clock can be easily generated internal to the FPGA, allowing zero latency synchronous data transfers between the Ethernet packet receiving logic and the DDR4 memory controller. The DDR4 controller can be optimized in any way you choose. We, of course, provide several Verilog examples for no charge that you are welcome to use. All functions of the DDR4 DRAM can be exploited and optimized. Up to 8 banks can be open at once. Timing variables such as CAS latency and precharge can be tailored to the minimum given your operating frequency and the timing specification of the exact DDR4 memory utilized. The only real limitation is the amount of time and effort spent customizing the DDR4 memory controller to your needs.
5. PCIe - Customizable 16-lane, GEN3/4 PCI Express
PCIe is connected directly to the FPGA via 16-lanes of GTY transceivers. The interface is fully GEN2/GEN3 and GEN4 capable. We ship GEN3 PCIe IP that is a full function, fixed, 16-lane master/target. To gain access to the PCIe interface, this IP must be integrated with your application. The Dini Group PCIe IP provides a flexible interface that allows the user access to multiple DMA engines, scratchpad memories, interrupts, and other endpoint-related functions to maximize performance while utilizing minimal FPGA resources. Drivers (required) for 'C' source for several operating systems are included no charge.
6. How Everything Works...
With direct data feeds such as NASDAQ ITCH and OUCH, the DNPCIE_400G_VUP_HBM_LLcontains all of the basic functions required to minimize the amount of time it takes to receive Ethernet packets, process them, and respond deterministically. By using the FPGA to process Ethernet packets, the processor and operating system are removed from the critical path and traditional sources of latency such as interrupts and context switching no longer hinder performance. Not a single clock cycle is wasted. For algorithms requiring processing, FPGA resources can be hard coded to perform the task. This includes real-time Monte Carlo analysis, and floating point.
l PCI Express (8-lane, GEN3) FPGA-based algorithm acceleration peripheral with dual FPGAs:
? Virtex UltraScale+: (-3 is fastest)
l VU11P-3,-2,-1 (reduced DDR4 memory capacity- see block diagram)
? Virtex UltraScale:
? Kintex UltraScale: (note limited GTH speeds)
l Xilinx FPGA UltraScale: XC11P
? 2.58M flip-flops per FPGA
l 1.3M flips-flops with 6-input LUT
? 8,928 - 27x18 multiplier+48-bit accumulator per FPGA
l 4,032, 18-kbit block RAM per FPGA
l 960 - 4k x 72 UltraRAM blocks
l 43MB total internal block memory
l Fully dual-ported
l Each 36-kbit block RAM configurable as:
l 32Kx1, 16Kx2, 8Kx4, 4Kx9 (or 8),
l 2Kx18 (or 16), 1Kx36 (or 32), or 512x72 (or 64)
? Support package for OpenCL/SDAccel (consult factory for availability)
l 2 Ethernet ports: QSFP28
? 100 GbE or (Virtex UltraScale/+ only)
? 40 GbE or
? 4 ports 10GbE
l Fixed, 8-lane GEN 3 PCIe interface and controller
l FPGA to FPGA interconnect
? 16 bidirectional MGT connections, each 12.5Gb/s.
l ~25GB/s data transfer throughput A->B and B->A
l Memory. Lots of it at high performance. Each FPGA has:
? 6 separate 1Gb x 16 DDR4 (PC4-2400) memories
l Each with separate address, data, and control
? 1G x 64 DDR4 (PC4-2400) for bulk storage
l Two independent low-skew global clock networks
? distributed differentially and balanced
l Fast and Painless FPGA configuration via PCIe
? On-board battery for AES bitstream encryption
l Full support for embedded logic analyzers via JTAG interface
? ChipScope, Certus, and other third-party debug solutions