February 10, 2024

Diving Deeper: Insights from Our LLM Inference Testing part 2

In our latest blog post, we delve into the enhanced performance of AMD Instinct™ MI300X accelerators for LLM inference, highlighting significant improvements in throughput efficiency and scalability following key updates to our software and hardware setup. Discover how these advancements are shaping the future of AI and machine learning in our comprehensive analysis of updated benchmarks.

Diving Deeper: Insights from Our LLM Inference Testing part 2

Introduction

Following our previous exploration of the AMD Instinct™ MI300X accelerators' capabilities in Large Language Model (LLM) inference, we've recently made some pivotal updates to our testing environment, leading to even more compelling results. In this small post, we'll delve into these updates and their impact on our benchmarks, particularly focusing on on the ZeRO-inference technique with the Big Science Bloom Large Language Model.

Key Updates:

Docker Image: We transitioned to a new Docker image, rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2_v2, aligning more closely with our hardware capabilities.
Python Version: Upgraded from Python 3.10.12 to Python 3.10.13, ensuring we leverage the latest stable features and optimizations.
ROCm Version: Updated to ROCm version 6.0.2, enhancing our compute performance and efficiency.
DeepSpeed: Compiled DeepSpeed (0.12.3-devp) from source, specifically optimized for the MI300X, to enhance our model training and inference processes.
PyTorch Version: Reverted to PyTorch 2.1.2 from 2.3.0-dev-rocm6.0, finding it to be more stable and efficient for our specific use case.

Updated Benchmark Analysis:

Our new ZeRO-inference results are as follows:

Batch Size 1 8 16 32 64 128 256
msecs 214.11 27.08 13.52 6.77 3.41 1.74 oom

DS-ZERO Batch Size 1, Bfloat 16

Our previous ZeRO-inference results are as follows:

Batch Size 1 8 16 32 64 128 256
msecs 267.35 32.72 16.92 9.34 5.00 2.62 oom

Comparing these to our previous results, we've observed a significant improvement in throughput efficiency across all batch sizes. Notably:

Small Batch Sizes (1, 8, 16):
Reduced latency was observed, indicating more efficient processing with the updated setup.

Medium to Large Batch Sizes (32, 64, 128):
The continued decline in milliseconds per token at these sizes reinforces the MI300X's capability in managing larger workloads effectively.

Batch Size 256:
'Out of memory' (oom) errors persist at this extreme size, consistent with our previous findings and indicative of the hardware's upper limits.

Conclusion:

These updated benchmarks with the Bloom 176B LLM and our revised setup underscore the dynamic nature of AI and machine learning research.
Our continuous efforts to refine and optimize our hardware and software configurations have resulted in notable performance improvements, reaffirming the MI300X's position as a powerhouse in the field of AI.
We remain committed to pushing the boundaries of what's possible and sharing our insights along this exciting journey.

>Check out the original post for more detailed info and references<

Lets get you started