← All articles
AI Chips · 18 Apr, 2025

Unlocking Next-Generation AI Performance: Inside NVIDIA's Blackwell B200 Architecture

We conducted extensive benchmarking of NVIDIA's B200 Blackwell GPUs via TogetherAI, evaluating performance across CUDA bandwidth, disk I/O, memory operations, CPU capabilities, and large language model (LLM) training.

Unlocking Next-Generation AI Performance: Inside NVIDIA's Blackwell B200 Architecture

Unlocking Next-Generation AI Performance: Inside NVIDIA's Blackwell B200 Architecture

  • We conducted extensive benchmarking of NVIDIA's B200 Blackwell GPUs via TogetherAI, evaluating performance across CUDA bandwidth, disk I/O, memory operations, CPU capabilities, and large language model (LLM) training.

  • The B200 GPUs demonstrated outstanding results, achieving over 6,000 GB/s in device-to-device bandwidth and a Model FLOPS Utilization (MFU) of 38.21% while training Llama 3-70B.

  • While overall performance was impressive, minor numerical stability issues were observed in the FlashAttention benchmarks, pointing to potential precision or compatibility factors.

  • Overall, the Blackwell architecture proved to be a powerful platform for large-scale AI workloads, excelling in both model training and inference.

Makale içeriği

Hardware and System Configuration

The tested system featured an impressive array of next-generation hardware. It was equipped with 8x NVIDIA B200 HGX GPUs, each providing approximately 178.36 GiB of memory. The system operated with CUDA Version 12.8 (V12.8.61) and corresponding compilation tools. The entire configuration was carefully optimized for AI workloads and high-performance computing tasks.

Memory and Bandwidth Performance

The Blackwell GPUs demonstrated exceptional memory bandwidth capabilities, particularly in device-to-device transfers:

Makale içeriği

Note: The CUDA samples are not intended for precise performance benchmarking. Results may vary depending on GPU Boost settings.

These results indicate the B200's suitability for workloads requiring high-speed data movement between GPUs. The device-to-device bandwidth of over 6,000 GB/s is particularly noteworthy, enabling efficient multi-GPU parallelism for large neural network training.

Storage Performance

The storage subsystem demonstrated solid performance characteristics:

Makale içeriği

The storage system showed particularly strong read performance with 540K IOPS and bandwidth exceeding 2 GB/s. Write operations were somewhat slower but still delivered impressive throughput. High disk utilization (99.74%) during testing indicates the storage subsystem was operating at near-maximum capacity.

LLM Training Performance

One of our most significant tests involved training the Llama 3-70B model, which revealed the B200's capabilities for large language model workloads.

The model architecture included:

  • 70,553,706,496 total parameters

  • 8192 hidden dimensions

  • 80 layers

  • 64 attention heads with 8 key-value heads

  • Maximum sequence length of 8192 tokens

  • Vocabulary Size: 128,256

  • Feedforward Dimension Multiplier: 1.3

  • Normalization Type: fused rmsnorm

  • RoPE Theta: 500000

  • Depth Initialization: Enabled

Training performance metrics:

Makale içeriği

The system successfully utilized various optimization techniques, including full activation checkpointing and Fully Sharded Data Parallelism (FSDP). Despite the model's size, the B200 GPUs handled the workload efficiently, achieving a respectable MFU of 38.21%.

Training Details

CUDA Memory Usage: 32.87 GiB (18.43% per GPU)

The training was conducted with the following comprehensive configuration:

  • Training Start Step: 1

  • Total Training Steps: 10,000

  • Warmup Steps: 2

  • Batch size specifications were carefully set to optimize resource use:

  • Local Batch Size: 1

  • Global Batch Size: 8

  • The tokenizer used in this training included:

  • Number of Words: 128,256

  • Beginning of Sequence (BOS) Token ID: 128000

  • End of Sequence (EOS) Token ID: 128001

For the dataset, the widely adopted C4 dataset was employed, providing a rich and diverse corpus to ensure robust language model training.

FlashAttention Performance

Makale içeriği

The FlashAttention benchmark revealed strong raw computational performance, with TFLOP values across iterations ranging from 127.443 to 1,099,510. A detailed iteration-by-iteration TFLOPs table has been added to highlight variability across 10 runs. However, persistent "FWD Incorrect" messages suggest potential numerical stability issues, likely stemming from floating-point precision (FP16/BF16), memory allocation, or CUDA/PyTorch version compatibility. The analysis now more thoroughly considers CUDA and PyTorch version mismatches as a key contributing factor.

To provide full context, specific error metrics have been included: the average output difference was 0.0101258, with a maximum difference of 0.09375. These figures show that although operations generally complete successfully, precision concerns remain that could affect model training in some scenarios.

By expanding the analysis of errors and presenting detailed performance metrics, this assessment offers a clearer view of both the strengths and limitations observed in the FlashAttention benchmark.

CUDNN Benchmark

To better evaluate the performance of attention mechanisms under different configurations, we conducted a series of CUDNN benchmarks with Flash SDP enabled in PyTorch and using CUDNN backend version 90701. The benchmarks compared non-causal and causal attention settings across various batch sizes, sequence lengths, and head counts. Results were measured in TFLOPs/s for forward (FWD), backward (BWD), and total computation time using three precision formats: PyTorch default, BF16, and FP8. The FP8 precision consistently achieved the highest throughput in both causal and non-causal cases, highlighting its efficiency for high-performance inference and training.

Non-Causal Benchmark Results (TFLOPs/s) (causal=False)

Makale içeriği

Causal Benchmark Results (TFLOPs/s) (causal=True)

Makale içeriği

General Compute Performance

Beyond AI-specific workloads, we evaluated general compute performance:

The CPU performance test using Sysbench (version 1.0.20) showed strong multithreaded capabilities with 63,901.57 events per second across 96 threads. Memory operations were similarly impressive, achieving a bandwidth of 14,174.81 MiB/sec for write operations.

These results indicate that the B200-equipped system provides balanced performance across both AI-specific and general computing workloads, making it a versatile platform for various high-performance computing applications.

Performance Analysis: B200 vs Previous Generation

Training Performance

The B200 demonstrates exceptional efficiency when handling large language models, with particular strength in memory bandwidth and multi-GPU scaling.

Makale içeriği

*Estimated comparative values based on industry standards for previous generation hardware

Our Mind

For organizations running extensive AI training workloads, Blackwell reduces total cost of ownership through three primary mechanisms: faster training completion (1785.83 tokens/sec), higher memory efficiency (93.98% utilization observed), and improved scaling across multiple GPUs. Organizations currently requiring 16-32 previous-generation GPUs for large model training may achieve equivalent or better performance with 4-8 B200 GPUs, resulting in substantial infrastructure and operational cost savings.

The B200 architecture establishes a new performance baseline for AI computing, with substantial headroom for optimization as software frameworks evolve to better leverage its capabilities. We anticipate further performance improvements as compiler optimizations mature and as more workloads adopt technologies like FlashAttention that align with Blackwell's architectural strengths.

Key Takeaways

  • Memory Bandwidth: The device-to-device bandwidth of 6,056.6 GB/s is impressive and indicates high-speed memory operations.

  • Read Performance: The system achieves a high read bandwidth of 2213 MB/s with 540K IOPS, demonstrating strong read performance for small block sizes.

  • Write Performance: The write bandwidth is 1434 MB/s with 350K IOPS, which is slightly lower than the read performance but still strong.

  • Latency: Read and write latencies are within expected ranges, with 99th percentile latencies at ~4-5 ms.

  • Disk Utilization: High disk utilization (99.74%) suggests that the storage system is well-utilized during the test.

  • FlashAttention Performance: Despite high TFLOPs values, incorrect forward pass results indicate potential numerical stability or implementation issues.

  • Memory Performance: The write bandwidth of 14,174.81 MiB/sec indicates strong memory performance across 128 threads, with low latency at the 95th percentile.

References

AI Chips