top of page
server-parts.eu

server-parts.eu Blog

InfiniBand Networking for NVIDIA H100/H200 GPU Clusters

  • Apr 30
  • 6 min read

Updated: May 3

Designing clusters with NVIDIA H100 Tensor Core GPUs and NVIDIA H200 Tensor Core GPUs quickly stops being about GPUs. At 16, 32, or 64 nodes, performance is defined by how efficiently nodes exchange data during synchronized operations.


That makes InfiniBand design—ports, rails, switches, and especially cabling—the deciding factor.


NVIDIA InfiniBand Cards, Switches, DAC & AOC Cables

Limited stock at special pricing



In this article, when we refer to “clusters,” we are specifically talking about 8× GPU baseboard servers, such as:

  • Dell PowerEdge XE9680

  • HPE Cray XD670

  • Supermicro AS-8125GS-TNHR


InfiniBand design for NVIDIA H100H200 GPU clusters, 400G800G NDR networking with Quantum-2 switches, ConnectX-7 NICs, Clos topology, breakout AOC cabling, and NCCL performance optimization for AI and HPC. server-parts.eu.


InfiniBand for NVIDIA H100/H200 GPU Clusters: Traffic Behavior & Scaling


Distributed training relies on NCCL collectives such as AllReduce and AllGather. These generate synchronized, many-to-many traffic where all GPUs communicate at the same time, in cycles.


NCCL uses all available network paths automatically. With 4 or 8 ports per server, traffic is striped across them, so performance depends on balanced paths, not just peak bandwidth.


A single weaker path can slow down the entire cluster, for example:

  • longer cable

  • additional network hop

  • misaligned rails


GPUs communicate internally over NVLink/NVSwitch at much higher bandwidth than InfiniBand, so if inter-node network bandwidth is insufficient, GPUs will idle waiting for communication instead of computing.



InfiniBand for NVIDIA H100/H200 GPU Clusters: Port & NIC Layout


The starting point is the number of ports per server.


Typical 8× GPU systems support:

  • 2× 400G ports

  • 4× 400G ports

  • 8× 400G ports


These are implemented using NVIDIA ConnectX-7.


Most high-end HGX-based systems (including DGX-class designs) use 8× single-port NICs, often consolidated through 4 OSFP cages, where each OSFP port internally carries two 400G connections (e.g., via DensiLink-style internal cabling).


This allows clean 1:1 GPU-to-rail mapping, which simplifies topology and improves balance.


Alternative implementations using dual-port NICs exist and are electrically equivalent, but consolidated OSFP designs are more common in reference architectures.


Scaling ports improves performance, but with trade-offs:

  • 2 → 4 ports: clear and immediate gain

  • 4 → 8 ports: better scaling for 32+ node clusters

  • higher port counts are limited by PCIe bandwidth, as each 400G NIC requires a full PCIe x16 link and shared PCIe paths inside the server can reduce effective throughput even when port count is sufficient.


NICs must be evenly distributed across CPU sockets (NUMA nodes). If traffic crosses sockets via UPI/Infinity Fabric, latency increases and bandwidth drops.


Not all “400G-capable” configurations are equivalent in practice - common NIC variants include:

  • Single-port 400G (NDR) – 1× 400G per card

  • Dual-port 400G – 2× 400G per card (higher PCIe pressure)

  • Dual-port 200G (HDR) – 2× 200G, often mistaken for 400G capability


A dual-port 200G NIC is not the same as a single-port 400G NIC. It results in half the rail bandwidth. For 8× GPU servers, the target is 8× 400G rails, ideally with single-port NICs for clean mapping.


GPU-to-NIC mapping (each of the 8 GPUs connects to its own network rail):

  • 8 NIC ports → 1:1 GPU-to-rail mapping

  • fewer NICs → multiple GPUs share the same rail


When rails are shared, communication contention increases, reducing efficiency during NCCL collectives such as AllReduce.



InfiniBand for NVIDIA H100/H200 GPU Clusters: Rail-Optimized Topology


Each port represents a separate communication path, known as a rail. At scale, these rails must be explicitly designed.


A rail-optimized topology assigns each rail consistently across the fabric. For example, “port 1” from every server connects to the same logical leaf group, forming a low-diameter path for that rail. This aligns with NVIDIA’s recommended design for AI fabrics.


The benefits:

  • fewer hops per communication

  • predictable latency

  • better load distribution


Without this, traffic spreads across mixed paths and additional hops, increasing congestion and reducing efficiency during synchronized NCCL operations.



InfiniBand for NVIDIA H100/H200 GPU Clusters: Switch Architecture & Port Usage


Switch design must match the port strategy.


Using NVIDIA Quantum-2 QM9700:

  • 64 × 400G NDR ports

  • implemented via 32 OSFP cages (800G each)

  • each cage supports breakout (2× 400G) or native mode

  • each OSFP port can run as 1×800G or 2×400G, so breakout and uplink usage must be balanced to avoid oversubscription.

  • total throughput: 51.2 Tb/s non-blocking


This allows flexible usage:

  • breakout for server connectivity

  • full-bandwidth links for inter-switch connections


For larger clusters, newer platforms such as NVIDIA Quantum-X800 provide:

  • native 800G ports

  • significantly higher radix (e.g., 100+ ports depending on chassis)

  • better scaling for very large AI environments


In practice, switch ports are divided into:

  • server-facing connections (downlinks)

  • inter-switch connections (uplinks)


For AI workloads, the goal is near 1:1 aggregate bandwidth between these layers to maintain non-blocking behavior under burst traffic.



InfiniBand for NVIDIA H100/H200 GPU Clusters: Cable Types, Function & Selection


Cables define how the network is physically realized.


Breakout AOC cables (800G → 2×400G) - Used for switch-to-server connections.

Interface compatibility (OSFP vs QSFP112) must be verified, as mismatches can prevent links from coming up.


Function:

  • splits one switch port into two server connections

  • maximizes port utilization


Key considerations:

  • maintain rail consistency

  • avoid mixing rails within a breakout

  • keep cable lengths consistent per rail


800G AOC cables (direct) - Used for switch-to-switch connections.

Function:

  • carries full bandwidth between switches

  • forms the fabric backbone


Key considerations:

  • must match server-side bandwidth

  • should be evenly distributed across rails

  • required for predictable Clos behavior


DAC cables - Used only within racks.

Function:

  • low-latency short connections


Limitations:

  • distance

  • cable thickness at scale


Cable behavior and impact on performance

At 400G/800G speeds, cables influence:

  • latency

  • signal integrity

  • path symmetry


At these speeds, AOC cables are preferred over DAC beyond short distances due to better signal integrity and easier cable management.


Even small inconsistencies—like uneven cable lengths—can create measurable imbalance between rails. NCCL assumes uniform paths and cannot compensate for this.



InfiniBand for NVIDIA H100/H200 GPU Clusters: Cable Quantities & Planning


Define:

  • N = number of servers

  • P = ports per server


Breakout cables = (N × P) / 2


Inter-switch cables depend on design, but for near non-blocking fabrics:

  • typically 0.8–1.2× breakout count


Example: 32-node cluster, 8 ports
  • 256 server ports

  • 128 breakout cables

  • ~120–180 inter-switch cables


Example: 64-node cluster, 8 ports
  • 512 server ports

  • 256 breakout cables

  • ~250–350 inter-switch cables


Exact numbers depend on:

  • oversubscription ratio

  • redundancy design

  • required bisection bandwidth


Cable planning considerations

At scale:

  • symmetry is critical

  • cables must be labeled and traceable

  • routing must preserve rail separation

  • airflow must not be blocked

  • serviceability must be maintained


Many large clusters fail operationally due to cable complexity, not bandwidth limits.



InfiniBand for NVIDIA H100/H200 GPU Clusters: Clos Topology & Scaling


Clos (spine–leaf) topology is standard beyond a single switch.


It provides:

  • multiple equal-cost paths

  • support for adaptive routing

  • scalability across racks


Quantum-2 includes:

  • adaptive routing

  • SHARP (in-network reduction)


These improve efficiency but do not replace correct topology or rail symmetry. Also, oversubscription must be controlled, as AI workloads like NCCL assume near 1:1 bandwidth, and even moderate oversubscription can create bottlenecks.



InfiniBand for NVIDIA H100/H200 GPU Clusters: Monitoring & Validation


Monitoring is essential.


Fabric-level:

  • NVIDIA UFM → congestion, routing, utilization


Application-level:

NCCL_DEBUG=INFO

Validation should include:

  • verifying all rails are used

  • checking topology mapping

  • running NCCL benchmarks (e.g., all_reduce_perf)


A well-designed system should achieve 85–95% of theoretical bandwidth, depending on message size and cluster scale.



Click To Check Configuration



InfiniBand for NVIDIA H100/H200 GPU Clusters: Practical Limitations


Key constraints:

  • PCIe bandwidth vs GPU traffic

  • NUMA placement

  • cable density and routing

  • cost scaling with ports

  • diminishing returns at higher port counts


In practice, clusters often underperform due to issues like wrong NIC types, mixed cable lengths, uneven NUMA placement, or oversubscribed topology.



InfiniBand for NVIDIA H100/H200 GPU Clusters: Design Patterns


Requirements change as clusters grow. Below are practical guidelines for 8-GPU servers (8 GPUs per node):


16-Node GPU Cluster (128 GPUs):

  • single switch or small leaf–spine setup

  • rail optimization usually not needed

  • simple cabling


32-Node GPU Cluster (256 GPUs):

  • rail-optimized topology recommended

  • consistent rail mapping across servers

  • leaf–spine layout begins

  • small path differences impact performance


64-Node GPU Cluster (512 GPUs):

  • full leaf–spine (Clos) topology required

  • strict rail symmetry

  • NUMA-aware NIC placement

  • uneven paths reduce NCCL efficiency


128-Node GPU Cluster (1024 GPUs):

  • multi-tier Clos topology

  • balanced links between layers

  • rails consistent across racks

  • multiple spine switches required


256-Node GPU Cluster (2048 GPUs):

  • large-scale fabric design

  • strict rail consistency across the cluster

  • larger spine layer or extra switch tier

  • inter-switch bandwidth must match server traffic


512-Node GPU Cluster (4096+ GPUs):

  • multi-cluster scaling

  • additional core layer between clusters

  • careful inter-cluster traffic planning

  • higher complexity in routing, monitoring, and cabling



NVIDIA InfiniBand Cards, Switches, DAC & AOC Cables

Limited stock at special pricing


GPU cluster performance at scale depends on aligning ports, rails, switches, and cables to ensure efficient communication, otherwise hardware potential is wasted.


FAQ – InfiniBand for NVIDIA H100/H200 GPU Clusters


How many InfiniBand ports are needed for NVIDIA H100/H200 GPU clusters?

Most NVIDIA H100/H200 GPU clusters use 4–8× 400G InfiniBand ports per server to ensure sufficient bandwidth for NCCL communication.


What is the best network topology for H100/H200 GPU clusters?

A rail-optimized Clos (spine–leaf) topology is the standard design for scalable NVIDIA H100/H200 GPU clusters.


How many cables are required for an InfiniBand GPU cluster?

Cable count scales with ports, typically (servers × ports ÷ 2) breakout cables plus a similar number of inter-switch cables.


What cable types are used in InfiniBand H100/H200 clusters?

Most deployments use 800G to 2×400G breakout AOC cables for servers and 800G AOC cables for switch-to-switch connections.


Why use InfiniBand instead of Ethernet for H100/H200 GPU clusters?

InfiniBand provides lower latency, native RDMA, and more predictable NCCL performance compared to Ethernet-based solutions.



Sources - InfiniBand for NVIDIA H100/H200 GPU Clusters


Comments


bottom of page