Diving Deeper: Insights from Our LLM Inference Testing

Introduction

In a previous post, we discussed the impressive performance of our new Supermicro server equipped with AMD Instinct™ MI300X accelerators in the realm of Large Language Model (LLM) inference.
Today, we're going to take a closer look at the technical aspects and the benchmark results of these tests, particularly focusing on the ZeRO-inference technique used with the Big Science Bloom Large Language Model.
An important detail to note is that all our tests were conducted using the bfloat16 data type, which is known for its balance of performance and precision.

‍

Benchmark Analysis

ZeRO Inference with Bloom 176B LLM
Our tests aimed to gauge throughput per token in milliseconds using 8 x AMD Instinct™ MI300X 192GB accelerators with all computations carried out using the bfloat16 data type to optimize both performance and memory usage.

The findings were insightful:

Batch Size	1	8	16	32	64	128	256
msecs	267.35	32.72	16.92	9.34	5.00	2.62	oom

> Check out our updated results here! <

‍

Batch Size and Throughput Analysis

Small Batch Sizes (1, 8, 16):
At smaller batch sizes, the MI300X exhibited higher latency compared to its performance at larger batch sizes, a common occurrence due to reduced parallelization.
However, when comparing these results to those of the competitor's 80GB models, the MI300X demonstrated improved throughput efficiency, indicating more effective utilization of the accelerator's capabilities.

Optimal Batch Sizes (32, 64, 128):
These batch sizes were where the MI300X truly shined, showing a significant improvement in throughput efficiency compared to both its smaller batch sizes and the competitor's models.
This enhanced efficiency is crucial for applications requiring efficient real-time LLM processing.

Batch Size 256:
At this batch size, the MI300X faced 'out of memory' (oom) errors, highlighting its limits in handling extremely large batch sizes.
This was a notable contrast to the competitor's models, which encountered oom errors at much smaller batch sizes, underscoring the MI300X's superior memory management and scalability.
‍

In contrast, tests using 8 x 80GB counterparts showed starkly different results, with 'oom' errors starting from a batch size of 16.
This highlights the critical role of memory capacity in managing large-scale LLM inferences.

Batch Size	1	8	16	32	64	128	256
msecs	283	34.88	oom	oom	oom	oom	oom

Comparative Analysis:

Throughput Efficiency:
The MI300X accelerators demonstrated remarkable throughput efficiency.
For instance, at a batch size of 8, the MI300X processed at 32.72 milliseconds per token, outperforming the 80GB models which clocked in at 34.88 milliseconds.
This efficiency isn't just beneficial for inference tasks but also suggests potential advantages in training scenarios, where speed and responsiveness are crucial.

Scalability:
The ability of the MI300X to handle larger batch sizes without encountering memory limitations contrasts sharply with the 80GB models.
This scalability indicates superior memory management and processing power, which are critical factors not only for inference but also for the demanding requirements of model training.

Implications for LLM Performance:
The results from our tests imply that the MI300X can offer faster and more reliable LLM inference, and by extension, these benefits could significantly enhance training processes, especially with large and complex datasets.

Conclusion:
Our ZeRO-inference testing with the Bloom 176B LLM has highlighted the advanced capabilities of our server equipped with MI300X accelerators.
The exceptional throughput efficiency and scalability in processing large batch sizes not only bolster our technology's position in AI and machine learning for inference tasks but also hold promising potential for the training phase of LLMs.
As we continue to explore these technologies, we are particularly excited about the potential enhancements they could bring to both the training and inference phases in the AI domain, paving the way for more efficient and powerful machine learning models.

‍

Setup:

To ensure transparency and provide a complete picture of our testing process, below is a list of some of the key components and software used during our inference tests:

Operating System: Ubuntu 22.04 LTS.
Containerization: Tests were conducted within Docker containers to ensure a consistent and isolated environment. (Docker version: 24.0.7, build afdd53b)
Python Version: Python 3.10.12
Deep Learning Framework: PyTorch (2.3.0.dev-rocm6.0)
Optimization library: DeepSpeed (0.13.1)

Additional Tools/Software: (some more impactful than others ;-))

Bitsandbytes (0.42.0-devp)
Diffusers (0.25.0)
Flash-attn (0.2.0-devp)
Transformers (4.37.0)

‍

References:

ZeRO-Inference Technique: For detailed insights into the ZeRO-inference technique, Documentation.
Bloom 176B LLM: Information on the Bloom 176B Large Language Model can be found at Hugging Face's Bloom Model Overview.
AMD Instinct™ MI300X Accelerators: Technical specifications and capabilities of the MI300X accelerators: AMD's Official Website.
SuperMicro's AS-8125GS-TNMR2: Specifications.

‍

Additional Notes:

Please note that our tests were conducted using a range of open-source tools and scripts.
While these tools are widely available, we made some minor modifications and optimizations to enhance performance specific to our testing environment and the unique capabilities of our hardware.
These adjustments were crucial in achieving the results we've shared.

‍

Up Next:

Stay tuned for our next blog post, where we might dive into the world of AI-driven image generation, focusing on the impressive capabilities of our MI300X accelerators.
Alternatively, we could explore how the MI300X excels at 8-bit quantization.
We'll share insights from our work with the stable-diffusion XL model, highlighting the impact of model recompilation on performance, or delve into the technical nuances of 8-bit quantization and its benefits.
This upcoming piece will offer a glimpse into the evolving potential of AI in image generation or the technical advancements in AI processing efficiency.

‍

> Join us as we continue exploring the cutting-edge of AI technology <

> Updated Benchmark Results <

If you are in Data Science, AI or HPC and want access to GPUs