The DNVUF2_HPC_PCIE_Cluster is a complete, 5U rack mount FPGA acceleration cluster. The standard configuration contains the following:
Trenton System HEP8225 Dual Xeon Processor Card
7 - DNVUF2_HPC_PCIE cards with 2 VU095-2 FPGAs per card.
1.5 TB SATA II Hard Drive
This system contains the maximum number of cost-effective FPGAs that can be reasonably integrated into a 5U chassis. Power and cooling are the constraining variables. In short, the DNVUF2_HPC_PCIE_Cluster is a massive number of large, high performance FPGAs integrated with an excellent dual Xeon-based processor host. Since it is your own box, this solution does not have the security problems associated with cloud-based FPGA farms. Your data and algorithms remain proprietary to you. A partial list of possible applications includes:
low latency analysis
encryption/decryption (cryptography )
1. The Processor Card - Dual Intel Xeon
Central to the DNVUF2_HPC_PCIE_Cluster is the Trenton HEP8225 HDEC Series System Host Board (other boards may be substituted). This single-board computer has dual Intel Xeon Broadwell processors clocked at 2.4GHz. Each processor has 14 cores and 35MB of cache. This host card has 8 DIMM slots and can be stuffed with up to 512GB of DDR4 RAM with a max 64GB of memory per slot. The processor card has two 100/1000/10000Base-T Ethernet ports, along with four USB2.0 ports. The 5U chassis can host up to 2 SATA drives. Power and cooling are provided for up to 7 DNVUF2_HPC_PCIE cards. Power is cabled to the FPGA cards separately and not drawn from the motherboard, allowing us to exceed the 25W slot PCIe limitation. The power budget is 75W per board. Note that this requires a lot of airflow and the fans are noisy. Fully populated, the system is perhaps too noisy to be in close quarters with an engineer.
2. The FPGAs: DNVUF2_HPC_PCIE - 2 Xilinx UltraScale/UltraScale+ FPGAs
The DNVUF2_HPC_PCIe hosts two Xilinx FPGAs from the UltraScale and UltraScale+ families. Each FPGAs has multiple banks of high performance DDR4 memory. Data movement to/from the FPGAs is accomplished via an 8-lane, GEN3 PCIe interface. Each of the two FPGAs (A and B in the block diagram) has six separate 1G x 16 DDR4 (16Gb) memories and a bank of 1G x 64 DDR4. Up to seven of these cards can be populated in the 5U chassis.
3. Virtex UltraScale/UltraScale+ FPGAs from Xilinx
We use the B2104 package. In this package, the largest device is the UltraScale+ VU13P. Most all of the I/Os are utilized. Other than a GEN3 PCIe controller, 100% of the resources of the two FPGAs is dedicated to the user application. With two XCVU13Ps, the DNVUF2_HPC_PCIe is capable of nearly 40 million gates of ASIC logic with plenty of resource margin. Features of the Virtex UltraScale/UltraScale+ FPGAs include efficient, dual-register 6-input look-up table (LUT) logic, 18 Kb (2 x 9 Kb) block RAMs, and third generation DSP slices (includes 27x18 multipliers and 48-bit accumulator). Floating point functions can be implemented using these DSP slices. UltraScale+ adds many megabytes of UltraRAM in a 4k x 72 configuration. This list of possible FPGA stuffing options is lengthy, but these three FPGAs are the most interesting:
4. UltraScale+ VU13P
Most number of DSP blocks for multipliers
Largest amount of internal block RAM
5. UltraScale+ VU095
Most cost effective UltraScale device
6. UltraScale+ KU0115
Most amount of resources for the lowest cost.
7. QSFP28 for 40 GbE or 100 GbE - Board to board or other
The GTY transceivers native to UltraScale are capable of 25 Gbps. From FPGA A, eight of these transceivers are connected to two QSFP28s, enabling a dual 100 GbE interfaces. The QSFP28 interface can also be used for 40 GbE, or 4x 10 GbE. Raw Ethernet packets can be accessed directly by bypassing the MAC if low latency is required. Note that Kintex UltraScale is limited to 10GbE for the GTH transceivers, limiting the QSFP28 to 40 GbE (or 4x of 10GbE). The 8 ports can be cabled separately in any manner between boards in the chassis or externally. You get to choose the network routing architecture - it is not fixed. Figure 2 shows the links connected in a daisy chain. Any, or all, of the links can be used for an external interface if capturing and processing large amounts of streaming data is required.
8. FPGA to FPGA
FPGA to FPGA data transfer internal to the board are done via 16 high-speed serial links. These links use GTH transceivers and are characterized at 10 Gb/s. The links are bidirectional and TX/RX are independent. We expect to be able to transfer 16 GB/second in each direction simultaneously.
9. Memory - DDR4
The availability of large amounts of local high speed memory is pivotal to FPGA-based algorithmic acceleration applications. The DNVUF2_HPC_PCIe is optimized accordingly. Each FPGA has six individually accessed 16 Gb DDR4 memories. Each of these memories is 1G x 16. In addition, each FPGA has a bulk DDR4 memory bank, organized as 1G x 64. All seven banks are slated to run at PC4-2400. The Xilinx Memory Interface Generator (MIG) works fine so no separate memory controller IP is required. The seven memories can be used independently or grouped in any manner that best fits your application. As always, we provide examples and reference designs to help you with all of your memory interface issues. Please check with us to make sure that what we ship for no charge meets your requirements.
10. Power Consumption
The PCI Express specification limits slot power to 25 watts. The DNVUF2_HPC_PCIe is capable of consuming power significantly beyond that. In addition to the PCIe fingers, a separate connector adds a second path for power. This product is shipped with adequate heat sinks to consume 75 watts, but airflow is required in the chassis to dissipate the heat. Contact the factory if you require high reliability, no-fan heatsinks.
l 5U Rackmount Chassis containing:
? Trenton System HEP8225 Dual Xeon Processor Card
? 7 - DNVUF2_HPC_PCIE Dual UltraScale/UltraScale+ cards
? Other configurations with different CPU to FPGA ratios are available
? 2 bays for SATA/600 hard drives
? None of the security issues associated with cloud-based FPGA farms
l Processor card from Trenton Systems: HEP8225
? Dual Intel Xeon E5-2680 v4 processor (Broadwell), 2.4 GHz
l 14 cores, 35MB cache
l Options to 512GB DDR4 memory
l VGA video
l 10/100/1000/10000BASE-T Ethernet (2 ports)
l USB 3.0 (4 ports total)
l 2 ports on front panel
l 2 ports on back bracket
l Supports most all Linux and Windows distributions
l DNVUF2_HPC_PCIe FPGA HPC Acceleration card
? 2 Xilinx Virtex UltraScale or UltraScale+ FPGAs (B2104)
l Virtex UltraScale+ (largest): VU13P
l Virtex UltraScale (largest): VU190
l Kintex UltraScale (most cost effective): KU115
? 80GB of DDR4 memory