Advanced Optimization in Image Generation: Boosting Iterations per second by up to 128%

Introduction:

Welcome back to our technical deep-dive series!

Following our exploration of LLM inference with the AMD Instinct™ MI300X accelerators, we're now pivoting to a distinct yet equally captivating area of AI: image generation. In this installment, we're excited to share our latest advancements in enhancing the performance of image generation, specifically using the Stable Diffusion XL model. Employing the PyTorch framework and integrating the advanced capabilities of the diffusers library, we've developed a unique and custom inferencing framework (we named it: Paiton :)). This innovative approach has led us to significantly boost our image generation speeds, moving from an initial rate of ~5 iterations per second (it/s) to an outstanding ~11 it/s without the use of external optimization libraries (e.g. tomesd).

‍

We want to clarify that this blog serves solely to demonstrate the capabilities of Paiton.
It's not intended as a comparison with other architectures; rather, it's an illustration of what we can achieve with focused effort.
Additionally, the results we've attained on one of our MI300X accelerators clearly indicate there is considerable room for improvement, and we anticipate even better performance in the future.
Once again, this blog is a showcase of our potential and a glimpse into what we can accomplish.

‍

Optimization Journey:

Our journey began with the baseline setup using publicly available tools like PyTorch. While effective, we knew there was untapped potential.

Here's a snapshot of our approach:

Transforming Model Architecture: We took a deep dive into the core architecture of our image generation models. This entailed a comprehensive review and transformation of the neural network structures, converting them for peak performance on specific hardware configurations.

Advanced Kernel Function Management: One of the pivotal enhancements involved an innovative approach to kernel function management. Our technique encompassed the dynamic execution of multiple, complex kernel functions in a coordinated manner. This was crucial in reducing computational overhead while enhancing the speed and efficiency of image processing tasks.

Optimized Computational Pipelines: We developed a series of optimized computational pipelines that significantly reduced latency and increased throughput. By strategically organizing data flow and processing steps, we ensured minimal wait times and maximum efficiency, particularly in the areas of data transfer and memory utilization.

Fine-Tuning Algorithm Efficiency: Beyond hardware and structural optimizations, we also refined the algorithms underpinning the image generation models. This included tweaking various parameters and operations to align more closely with any hardware's capabilities, thereby squeezing out every ounce of performance possible. In short, we convert the model to efficiently fit the specific hardware.

Benchmarking Success:

The results were nothing short of spectacular. Our benchmark tests, a bit more detailed in this video, showcase the remarkable jump in image generation speed, a testament to our team's dedication and expertise.

Stable Diffusion XL model
Number of inference steps: 20
Half precision (fp16)
Generation width/height: 1024x1024

`In short: a` `128%` `improvement!`

‍

Technical Setup:

To ensure transparency and provide a complete picture of our testing process, below is a list of the key components and software used during our inference tests:

Operating System: Ubuntu 22.04 LTS.
ROCm 6.0.2
Containerization: Tests were conducted within Docker containers to ensure a consistent and isolated environment.
Docker version: 24.0.7, build afdd53b.
Deep Learning Framework:
PyTorch (2.3.0a0 compiled from source))
Additional Tools/Software: (some more impactful than others ;-))
Diffusers (0.24.0)
Transformers (4.38.0)
Paiton (0.1.0: our own custom framework utilizing Composable Kernel)

References:

Stable Diffusion XL, HuggingFace. Stability AI
AMD Instinct™ MI300X Accelerators: Technical specifications and capabilities of the AMD Instinct™ MI300X accelerators: AMD's Official Website.
SuperMicro's AS-8125GS-TNMR2: Specifications.

Conclusion:

The strides we have made in image generation efficiency are not just benchmarks for our team; they represent the beginning of a much larger journey. Our choice to harness the capabilities of AMD over other competitors was driven by our recognition of its untapped potential and our affinity for embracing challenging technological frontiers.

We see AMD not just as a provider of advanced technology, but as a catalyst for innovation. Our decision to work with AMD stems from our belief in the immense possibilities their technology holds. The journey with AMD brings forth distinctive opportunities, and it’s precisely these opportunities that invigorate us. They push us to think differently, to innovate, and to break new ground.

This is more than just an optimization of image generation; it's a statement about the future. By choosing AMD, we're aligning ourselves with a vision of continual growth and breakthrough. We've begun to unlock the true potential of their technology, and what we've achieved so far is just the tip of the iceberg. Our work with AMD opens the door to a realm of possibilities in AI-driven applications, setting the stage for even more groundbreaking advancements in art, design, and beyond.

In embarking on this journey with AMD, we're not just chasing improved metrics; we're pioneering a path of technological innovation. We're excited for what the future holds and are confident that this partnership will lead us to redefine the boundaries of what's possible in AI and machine learning.

Additional Notes:

With the launch of Paiton (v0.1), we’re just beginning to tap into the capabilities of the AMD Instinct™ MI300X. accelerators Its advanced hardware, featuring high computational power and efficient memory architecture, holds immense potential yet to be fully utilized. Our focus now shifts to enhancing our software to harness this hardware’s full capacity. The synergy between the MI300X’s robust specifications and our optimized software solutions is a major step towards future breakthroughs in our journey of optimization.

As we share our progress with Paiton and the AMD Instinct™ MI300X accelerators, it is again important to note that this blog post is not about comparisons with other architectures. Instead, it’s an announcement and a celebration of what we’ve accomplished so far, and an exciting preview of the possibilities that lie ahead. We’re on a path of continuous innovation, and there is much more to explore and achieve.

"This is just a glimpse into the future we are building".

Up Next:

We’re already investigating the possibilities of utilizing the same techniques for LLM inference.
Stay tuned for more updates as we continue to push the boundaries of what's possible in AI!

Want to be part of this journey?
We recently partnered up with NScale, come check us out and see what we can do together!

Authors:

Elio Van Puyvelde

Kian Mohadjerin

‍