China has found a clever workaround for NVIDIA’s restricted AI accelerators, as DeepSeek’s latest project dramatically enhances performance using Hopper H800s AI accelerators.
DeepSeek’s FlashMLA: Transforming China’s AI Power with NVIDIA’s Hopper GPUs
It looks like China is forging its own path in tech innovation, with companies like DeepSeek leveraging software ingenuity to push hardware boundaries. DeepSeek’s new development is an eye-opener. The company has figured out how to maximize output from NVIDIA’s "cut-down" Hopper H800 GPUs by fine-tuning memory use and optimizing how resources are allocated during tasks.
During DeepSeek’s ongoing "OpenSource" week, they revealed FlashMLA—a cutting-edge decoding kernel crafted for NVIDIA’s Hopper GPUs. Before diving into its functionalities, let’s look at the groundbreaking improvements it introduces, which are indeed quite extraordinary.
DeepSeek boasts that FlashMLA can achieve 580 TFLOPS for BF16 matrix multiplication on the Hopper H800, which is nearly eight times what the industry typically sees. Plus, with efficient memory management, FlashMLA can push memory bandwidth to 3000 GB/s—double the H800’s theoretical maximum. Impressively, these gains are reached through savvy coding, not major hardware upgrades.
DeepSeek’s FlashMLA uses "low-rank key-value compression"—essentially breaking data into more manageable chunks to hasten processing while cutting memory use by 40%-60%. Another smart feature is its block-based paging system, which flexibly allocates memory based on task demands, outperforming fixed allocation methods. This innovation is especially useful for processing tasks with varying lengths, enhancing overall performance.
DeepSeek’s advancements highlight that AI computing is not reliant on a single element but is a tapestry of factors. Currently, this tool is tailored for Hopper GPUs, but it’s exciting to think about the performance leaps we might see when applied to the H100 with FlashMLA.