NVIDIA Blackwell Ultra B300: Full Specs, 288GB HBM3e Memory, 15 PFLOPS FP4, Architecture & GB300 Platform
- 2 days ago
- 8 min read
The NVIDIA B300 GPU, also called Blackwell Ultra, is the newest version of NVIDIA’s Blackwell architecture built for hyperscale AI infrastructure, large language models (LLMs), and AI inference workloads.
NVIDIA GPU Servers in Stock
Competitive Pricing
It improves on the NVIDIA B200 with higher GPU memory capacity and better inference performance, and works with the Grace-Blackwell GB300 platform that combines NVIDIA Grace CPUs and Blackwell GPUs to run trillion-parameter AI models in rack-scale AI systems.
This article covers the NVIDIA B300 and GB300 architecture, including GPU microarchitecture, compute units, HBM memory, NVLink interconnect, AI server platforms, rack-scale AI infrastructure, and the software stack used in modern AI data centers.
NVIDIA B300 Blackwell Ultra Architecture Overview
The NVIDIA B300 GPU is based on the Blackwell architecture, designed for AI training and inference workloads. It builds on the original Blackwell platform with higher memory capacity and improved inference throughput.
The architecture includes several major components:
Streaming Multiprocessors (SMs)
CUDA cores for general computation
5th-generation Tensor Cores
Transformer Engine for AI workloads
HBM3e high-bandwidth memory
NVLink 5 GPU interconnect
NVSwitch GPU fabric
SXM module packaging
Together, these components create a GPU architecture capable of delivering extremely high performance for transformer-based models and large-scale distributed AI workloads.
NVIDIA B300 GPU Microarchitecture
At the core of the NVIDIA B300 GPU are Streaming Multiprocessors (SMs), the main compute units of the Blackwell architecture. These units execute thousands of threads in parallel using NVIDIA’s SIMT (Single Instruction Multiple Thread) execution model, which is essential for AI workloads and scientific computing.
Key B300 Compute Architecture
Component | Details |
Architecture | Blackwell |
Streaming Multiprocessors | ~160 SMs |
GPU Design | Dual-die architecture |
Execution Model | SIMT (Single Instruction Multiple Thread) |
Warp Size | 32 threads per warp |
Transistor Count | ~208 billion transistors |
Each SM (Streaming Multiprocessors) in the B300 contains several key compute components:
CUDA cores for general GPU compute
Tensor Cores for AI and deep learning operations
Warp schedulers managing thread execution
Register files for fast thread data access
Shared memory for inter-thread communication
Special Function Units (SFUs) for complex math operations
Threads run in warps of 32, with warp schedulers distributing instructions across execution pipelines, enabling high AI training and inference throughput, while the B300’s ~208 billion transistors, similar to the B200, improve performance through architectural optimizations and higher power headroom.
CUDA Cores and Compute Pipelines - NVIDIA B300
CUDA cores handle general-purpose GPU computation and execute floating-point and integer instructions used in machine learning, HPC, and data processing workloads.
Component | Function |
CUDA Cores | Execute general GPU compute operations |
Floating-Point Pipelines | Handle FP calculations used in AI and HPC |
Integer Pipelines | Process integer operations and data tasks |
Tensor Compute Pipelines | Accelerate AI and deep learning workloads |
Memory Pipelines | Manage data movement across the GPU |
These execution pipelines run concurrently, allowing the GPU to perform massive numbers of operations in parallel.
Tensor Cores and AI Acceleration - NVIDIA B300
The NVIDIA B300 uses 5th-generation Tensor Cores, specialized compute units designed for accelerating deep learning workloads. Tensor cores perform matrix multiplication operations that are fundamental to neural network training and inference.
Supported precision formats include:
FP4
FP8
BF16
FP16
TF32
FP32 accumulation
These precision modes allow the GPU to balance performance and numerical accuracy depending on the AI workload being executed.
NVFP4 Precision and AI Performance - NVIDIA B300
One of the most important improvements in the B300 generation is inference performance using FP4 precision. NVIDIA refers to its optimized FP4 format as NVFP4, which enables extremely high throughput while maintaining model accuracy.
Approximate AI compute performance:
NVIDIA B200 GPU: ~9 PFLOPS dense FP4
NVIDIA B300 GPU: ~14–15 PFLOPS dense FP4
This represents roughly a 55–67% performance improvement in inference-heavy workloads.
System-level performance examples include:
NVIDIA DGX B300 (8 GPUs): approximately 108–144 PFLOPS FP4
NVIDIA GB300 NVL72 rack: roughly 1.1–1.44 exaFLOPS FP4 compute performance
These improvements are designed specifically for AI reasoning models, mixture-of-experts architectures, and large language model inference.
Transformer Engine - NVIDIA B300
Blackwell GPUs, including the NVIDIA B300, include a Transformer Engine that speeds up transformer-based AI models (such as LLMs). It automatically adjusts precision (for example FP4 or FP8) to increase performance while keeping model accuracy.
Feature | Purpose |
Dynamic Precision Scaling | Automatically selects FP4 or FP8 for optimal performance |
Optimized Transformer Operations | Accelerates transformer-based neural networks |
Inference Efficiency | Improves AI inference throughput |
Memory Optimization | Reduces memory consumption |
These capabilities are critical for large AI models with billions or trillions of parameters.
Multi-Die GPU Design - NVIDIA B300
Blackwell GPUs, including the NVIDIA B300, use a multi-die architecture instead of a single monolithic silicon die. The B300 consists of two large compute dies connected by a high-bandwidth internal interconnect, enabling larger GPU designs.
Component | Description |
GPU Design | Multi-die architecture |
Compute Dies | Two large compute dies |
Interconnect | High-bandwidth internal interconnect |
Memory Behavior | Unified compute and memory system |
Advantages of the dual-die architecture:
Enables GPUs larger than single silicon die limits
Supports higher transistor counts
Improves manufacturing yield
Maintains unified GPU compute and memory behavior
Memory Architecture - NVIDIA B300
The NVIDIA B300 GPU uses HBM3e (High Bandwidth Memory), a stacked memory technology designed for data-intensive AI workloads. Compared to the B200, the B300 significantly increases memory capacity to support larger AI models.
NVIDIA B200 vs B300 Memory Capacity
GPU | Memory Type | Capacity |
NVIDIA B200 | HBM3e | 192 GB |
NVIDIA B300 | HBM3e | 288 GB |
The increase is achieved using 12-high HBM3e stacks, compared to 8-high stacks in B200 GPUs.
NVIDIA B300 Memory Specifications
Component | Specification |
HBM Memory Stacks | 8 stacks |
Memory Interface | ~8192-bit |
Memory Bandwidth | ~8 TB/s |
The 50% memory increase allows much larger AI models to run on a single GPU and reduces the need for model sharding across multiple GPUs.
Cache Hierarchy - NVIDIA B300
The GPU includes multiple cache levels that reduce memory latency and improve performance.
Cache Level | Location | Purpose |
L1 Cache | Inside each Streaming Multiprocessor (SM) | Stores frequently used data and supports shared memory between threads |
L2 Cache | Shared across the entire GPU | Caches global memory accesses and reduces traffic to HBM memory |
Note: Exact cache sizes for the B300 GPU have not been publicly disclosed.
NVLink 5 Interconnect - NVIDIA B300
Feature | Details |
NVLink Generation | NVLink 5 |
Connections per GPU | 18 NVLink links |
Total Bandwidth | ~1.8 TB/s bidirectional |
What this means in practice
GPUs communicate directly with each other, without sending data through the CPU.
This enables faster scaling in multi-GPU systems.
It is critical for distributed AI training and large-scale inference workloads.
NVSwitch Fabric - NVIDIA B300
Component | Function |
NVSwitch | Connects many GPUs into a high-speed switching fabric so every GPU can communicate with every other GPU with very low latency |
Main benefits:
Full GPU mesh connectivity – every GPU can talk directly to the others
High bandwidth communication – fast data exchange between GPUs
Efficient model parallelism – large AI models can run across multiple GPUs
Example System | NVLink Fabric Bandwidth |
NVIDIA GB300 NVL72 | ~130 TB/s |
Grace-Blackwell GB300 Platform - NVIDIA B300
The NVIDIA GB300 platform integrates NVIDIA’s Grace CPU with Blackwell GPUs.
A GB300 superchip contains:
one Grace CPU
two B300 GPUs
The CPU and GPUs are connected through NVLink-C2C, a coherent interconnect that allows direct memory sharing between CPU and GPU. This architecture significantly reduces data transfer latency compared to traditional PCIe connections.
Grace CPU Architecture - NVIDIA B300
The Grace CPU is an ARM-based processor designed specifically for AI and high-performance computing environments.
Key characteristics include:
large number of ARM cores
high memory bandwidth
optimized data processing for AI pipelines
Grace CPUs coordinate data movement, system management, and workload scheduling across GPU clusters.
Server Platforms - NVIDIA B300
NVIDIA B300 GPUs are deployed in several server platforms.
NVIDIA HGX B300 systems typically contain eight GPUs connected through NVLink and NVSwitch. These platforms are used by OEM server manufacturers.
NVIDIA DGX B300 servers are NVIDIA’s integrated AI infrastructure systems that combine GPUs, networking, and optimized cooling into a single platform.
Rack-Scale AI Systems - NVIDIA B300
One of the largest deployments of the architecture is the GB300 NVL72 rack system.
A single NVL72 rack includes:
72 Blackwell GPUs
36 Grace CPUs
large NVLink interconnect fabric
The rack behaves like a unified AI accelerator capable of training or running extremely large models.
Networking Infrastructure - NVIDIA B300
Large GPU clusters rely on high-performance networking technologies such as:
InfiniBand
Spectrum-X Ethernet
ConnectX SuperNICs
BlueField DPUs
These networks allow clusters to scale across thousands of GPUs in hyperscale AI data centers.
GPU Virtualization and Multi-Instance GPU - NVIDIA B300
Blackwell GPUs support Multi-Instance GPU (MIG) technology, which allows a single GPU to be partitioned into multiple logical GPUs.
This enables multiple users or workloads to share GPU resources securely while maintaining performance isolation.
Power and Cooling - NVIDIA B300
The B300 GPU operates at significantly higher power levels than previous generations.
Typical power envelope:
NVIDIA B200: ~1000–1200 W
NVIDIA B300: ~1000–1400 W configurable TDP
The higher power envelope allows higher sustained Tensor Core performance and improved inference throughput. Because of this power level, most B300-based servers rely on liquid cooling systems and high-density rack power infrastructure.
In large rack-scale systems such as GB300 NVL72, NVIDIA also implements power smoothing technologies that can reduce peak power spikes by up to 30%.
AI Workloads - NVIDIA B300
The B300 architecture is designed primarily for AI workloads including:
large language model training
large-scale inference
mixture-of-experts models
generative AI
reasoning systems
These workloads benefit from large GPU memory, extremely high memory bandwidth, and high-speed GPU interconnects.
Software Ecosystem - NVIDIA B300
Blackwell GPUs operate within NVIDIA’s AI software ecosystem.
Key components include:
CUDA
cuDNN
TensorRT
NCCL
Triton Inference Server
NeMo AI framework
These tools provide the infrastructure needed for training and deploying AI models across large GPU clusters.
The NVIDIA B300 GPU and GB300 platform, built on the Blackwell architecture, combine larger GPU memory, higher inference performance, and high-speed interconnects like NVLink and NVSwitch to power rack-scale AI systems (such as GB300 NVL72) that run extremely large models across dozens or even hundreds of GPUs.
FAQ - NVIDIA B300
What is the NVIDIA B300?
The NVIDIA B300 is a next-generation AI accelerator based on the Blackwell architecture, designed for large-scale AI training and inference workloads in data centers and hyperscale environments.
What is the difference between NVIDIA B300 and B200?
The B300 is an upgraded Blackwell GPU that offers more memory (up to 288 GB HBM3e vs 192 GB on B200), higher inference throughput, and improved performance for large AI models.
What is the difference between NVIDIA B300 and GB300?
The B300 is the GPU itself, while GB300 refers to a full AI computing platform that combines B300 GPUs with NVIDIA Grace CPUs, high-speed NVLink connectivity, and rack-scale infrastructure.
How many GPUs are in NVIDIA B300 systems?
A B300 is a single GPU, but it is typically deployed in multi-GPU systems where dozens of GPUs are connected using NVLink and NVSwitch technologies.
How much memory does the NVIDIA B300 have?
The NVIDIA B300 GPU supports up to 288 GB of HBM3e memory, enabling large AI models and high-performance AI workloads to run efficiently on a single GPU.
NVIDIA GPU Servers in Stock
Competitive Pricing
Sources - NVIDIA B300
NVIDIA Blackwell Architecture – Official overview of the Blackwell GPU architecture, including NVLink, NVSwitch, and AI infrastructure design:
https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
NVIDIA DGX B300 System Documentation – Technical documentation describing DGX B300 servers with 8 B300 GPUs and up to 288 GB HBM3e memory per GPU:
https://docs.nvidia.com/dgx/dgxb300-user-guide/introduction-to-dgxb300.html
NVIDIA GB300 NVL72 Platform – Official page explaining the GB300 rack-scale AI system with 72 Blackwell GPUs and NVLink interconnect fabric:
https://www.nvidia.com/en-us/data-center/gb300-nvl72/
NVIDIA NVLink Technology – Explanation of NVLink GPU interconnect technology used to scale GPU clusters and enable high-bandwidth communication:
https://www.nvidia.com/en-us/data-center/nvlink/
Inside NVIDIA Blackwell Ultra – NVIDIA developer blog explaining Blackwell Ultra GPUs, including the 288 GB HBM3e memory used in B300:


