The Critical Importance of High-Speed Networking in HPC
High Performance Computing (HPC) has three principal elements: the compute servers, shared storage and the network that ties everything together. The obvious heart of any HPC Cluster is the network – since it connects everything. However, I can’t tell you how many times I have seen customer installations with the highest performing compute servers and big Lustre parallel file systems connected together with a slow network, with predictably disappointing results. One time a customer asked me to update their Lustre storage system with faster storage, since they felt it was too slow. A quick look over their infrastructure showed it was loaded with blocking top of rack switches. Replacing only those solved their problem, at far less cost than updating their Lustre system (which wouldn’t have solved the problem).
The Importance of the Network in HPC is Getting More Critical
The new technology at the ends of the network – GPUs in the servers and NVMe in storage – now directly connected by Remote Direct Memory Access (RDMA), bypassing the server CPU all together, can easily overwhelm networks. What’s more, HPC has historically been latency sensitive , but new workloads, like AI/ML, are more throughput intensive making proper network choice and design fundamental to HPC cluster operation. For HPC/AI clusters, the higher the bandwidth and smaller the latency, the better.
NVIDIA Quantum InfiniBand Networking to the Rescue
So, the moral of the story is don’t scrimp on the network when procuring your next HPC cluster, you will only be shooting yourself in the foot. And the NVIDIA Quantum InfiniBand networking platform is a great place to start. Not only does InfiniBand enjoy great performance and high scalability – with the most advanced adaptive routing capabilities available, it is uniquely superior at delivering leading IO performance for the most demanding next generation workloads that will be hitting your HPC cluster. NVIDIA Quantum InfiniBand supports hardware RDMA allowing for the high-bandwidth and low-latency remote memory operations required for GPUs, including NVIDIA’s Magnum IO GPUDirect Storage protocol.
NVIDIA Quantum-2 InfiniBand switch systems deliver the highest performance and port density available with Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ and advanced management features such as self-healing network capabilities, quality of service, enhanced virtual lane mapping, and NVIDIA In-Network Computing acceleration engines that provide the necessary performance for the AI and scientific applications we are concerned about. And their InfiniBand adaptive routing technology reroutes data to eliminate congestion, increasing HPC application performance.
My HPC Customers’ Biggest Concern
The biggest fear my current customers have is making sure they are feeding their very fast and expensive GPU servers as quickly as they are capable of, so we make sure we deliver a balanced architecture that is designed to feed these computational beasts with no bottleneck. This not only includes the network, but also architecting a high-performance storage system too. HPC moving forward, “The network is what drives the bus” and selecting the right network for your HPC system will ensure maximum levels of utilization that will optimize your return on investment.