Not as impenetrable as you might think, but still more than Intel or AMD would like
Analysis Nvidia is facing its stiffest competition in years with new accelerators from Intel and AMD that challenge its best chips on memory capacity, performance, and price.
However, it's not enough just to build a competitive part: you also have to have software that can harness all those FLOPS -- something Nvidia has spent the better part of two decades building with its CUDA runtime.
Nvidia is well established within the developer community. A lot of codebases have been written and optimized for their specific brand of hardware, while competing frameworks for low-level GPU programming are far less mature. This early momentum is often referred to as "the CUDA moat."
But just how deep is this moat in reality?
As you might have already guessed, the answer to that question really depends on what you're trying to achieve. If you're doing low-level programming for GPUs, the CUDA moat is very real. Existing code has to be ported, refactored, and optimized to run on alternative hardware.
This is hard to avoid as certain hardware calls that exist in CUDA and Nvidia chips simply don't exist for Intel or AMD hardware and vice versa. This means that bringing code originally developed for CUDA three, four, or even 10 years ago to AMD's ROCm or Intel's OneAPI is a commitment on the part of developers.
Because of this, Intel and AMD have invested heavily in tools to automate the process of translating CUDA source code to run on their respective platforms. AMD has HIPIFY, which helps to automate the conversion of CUDA to HIP C-plus-plus code.
"If somebody is actually doing some kernel authoring, and they're very used to writing it in CUDA, or they actually have a bunch of CUDA code lying around, I would argue, we're the only other real alternative to be able to have a smooth migration path, because we have HIPIFY, you get HIP C-plus-plus, and you can compile," AMD SVP of the AI Group Vamsi Boppana told El Reg.
And while there may be some truth to that, HIPIFY isn't perfect. One of the challenges our sibling site The Next Platform has previously highlighted is that HIPIFY doesn't take into account device-side template arguments in texture memory or multiple CUDA header files, and thus requires manual intervention by developers.
Intel, meanwhile, has SYCL, which is similar to HIPIFY in that it handles most of the heavy lifting -- purportedly up to 95 percent -- of porting CUDA code to a format that can run on non-Nvidia accelerators including those from AMD and others.
Finally, we'd be remiss not to mention that AMD quietly funded the development of a project to enable untranslated CUDA code to run natively on its hardware. However, earlier this year, it abandoned the effort and took steps to quash further work by the developers for reasons we've previously discussed in depth.
But while the CUDA moat is certainly a reality for developers looking to expand support for alternative hardware platforms, the number of devs writing code at a kernel level is relatively few, or at least that's what executives from both Intel and AMD tell El Reg.
"A couple years ago it was all CUDA, and people were programming at the direct level. But now, we see most developers are programming at the PyTorch level or other frameworks," Bill Pearson, VP of datacenter and AI software at Intel, told The Register in a recent interview.
AMD is seeing something quite similar with its Instinct accelerators, which have enjoyed a sudden uptick in adoption over the last year. "The reality is people want to write at higher levels of abstraction," Boppana said.
PyTorch in particular has become a go to for many AI chip companies peddling an Nvidia alternative. It's not the only high-level programming language for modern accelerators, there's also TensorFlow, JAX, and probably something else we're forgetting, but it's among the most popular.
AMD, for its part, has enjoyed native ROCm support with PyTorch for years, while support for Intel GPUs started rolling out earlier this year. We'll touch on IPEX and Gaudi's special brand of PyTorch in a bit, but before that, let's talk about PyTorch in general as it's not necessarily the silver bullet that chipmakers sometimes make it out to be.
The idea behind PyTorch is that it exists above frameworks like CUDA, ROCm, or OneAPI and simply calls the appropriate backend based on the hardware installed in the system. In theory, this means that code written for PyTorch should run on just about anything that supports it.
"For people that are using modern frameworks like PyTorch with a set of standard libraries that are supported, I think it's an extremely easy, low friction path to come to our GPUs," Boppana said.
In practice, we're not quite there yet. Just because PyTorch works on these accelerators doesn't mean there aren't still headaches.
The fact remains that many of the libraries and modules used to build PyTorch apps have been slow to add support for alternative architectures. Because of this, there's often a degree of refactoring required to get existing scripts to run.
BitsandBytes is just one example. The quantization library is commonly used in inferencing and QLORA finetuning to reduce the memory footprint of LLMs so that these workloads can be completed on a single GPU or node.
Unfortunately, until very recently, BitsandBytes hasn't supported either Intel or AMD hardware natively. This means you can't exactly run and expect it to work, as you would on Nvidia hardware. Instead, Intel and AMD users have had to track down a vendor-specific fork of the code and hope it works with the latest version of PyTorch and Python.
To be clear, it's not just BitsandBytes. This is the reality for a lot of libraries out there.
"Developers are expecting libraries to be there. You've got GEMMs that are already built and performance and optimized, and we have to make sure that our versions of those same GEMMs and libraries and things are there," Pearson explained.
Often, this involves chipmakers working with the community to fork the codebase, modify it, run on their hardware, and then, ideally, contribute the tweaks back to the mainline branch.
"If we think that there is a dominant market need for a specific library or a set of technologies then we will gravitate and push for that. More importantly we want to say the community will push for it, because we can only do so many of those," Boppana said. "If the community starts making contributions, then we would absolutely foster that."
The good news is this is very much happening.
"Those dependencies, those little blocks that are unique and depend on the lower layers, are still there in some cases, but they're increasingly rare, and they're going away, you know, bit by bit," Pearson said.
Since first running into compatibility issues with BitsandBytes, we've observed that support for both AMD and Intel GPUs has been extended via an experimental "multi-backend" version. However, at the time of publication, installing it still isn't as simple as on Nvidia hardware.
But while support is improving, there's still plenty of ground to cover.
As you might imagine, the fragmentation of compatible libraries creates something of a software compatibility minefield. It's all too easy to find yourself in a scenario where unless you have the right version of Python, PyTorch, BitsandBytes, and who knows what else, your script will just error out.
To be perfectly fair, Nvidia customers are bound to run into this too, but it's only made more complicated by needing to track down and in many cases compile compatible versions of popular libraries.
Intel, AMD, and Nvidia have taken steps to mitigate some of these challenges by offering preconfigured container images that serve as dev environments. As we've previously explored, these container images can be as simple as a preconfigured ROCm, OneAPI, or CUDA environment or include a full build out PyTorch install.
"We have a container, for example, the PyTorch container, you can go and grab for Gaudi that has all the libraries that are needed," Pearson explained.
The availability of these containers only becomes even more relevant when you realize that PyTorch support doesn't always mean the same thing across hardware vendors.
This is particularly true of Intel, which offers custom versions of PyTorch tuned for its GPUs and Gaudi3 accelerators. Head over to the PyTorch website, scroll down, and you'll quickly realize there isn't an option for OneAPI or Gaudi? That's because PyTorch support on Gaudi accelerators relies on a custom version of the library developed by Intel.
It was a similar situation for Intel GPUs up until recently, when native PyTorch support was added. Native support is still in preview, so it might not be obvious from the PyTorch homepage, but it is there and we've tested it.
However, up until Intel's GPUs added native PyTorch support, they relied on a custom build called Intel Extension for PyTorch, or IPEX for short. The package includes a variety of performance optimizations and libraries intended to make running code on its accelerators more seamless.
"We've done the work in terms of optimizing the in terms of first building the libraries and then optimizing the GEMMs and things inside those libraries, and then being able to have a developer easily take and either write PyTorch code using templates that we provide, or take our existing code and just shift it from CUDA to Gaudi," Pearson explained.
"As they do that their experience will be largely, I'll call it easy, in that it's not zero work, but it's also not a lot of work to go and do that. It involves changing the target from Nvidia to Gaudi, and then changing the output to those same targets," he added.
While we haven't tested Intel's Gaudi platform, we can say that a lot of PyTorch scripts out there only required a few tweaks to get them to run under IPEX.
With chipmakers pushing for native support for popular frameworks and libraries, it's becoming less of a question of whether something will work and more often a question of how performant it is.
One of the things that makes building applications for x86 or Arm CPUs so easy is the hardware needed to do so is everywhere. You can start building on your notebook, desktop, or workstation and scale up to a dedicated build environment with a CICD pipeline as it matures.
It's a similar situation for developers building on Nvidia hardware. With a few exceptions, CUDA runs the same on a mobile GPU as it does on a $30,000 H100. This means that if you've got a system with an Nvidia GPU onboard, you've already got everything you need to start developing.
Things aren't quite as seamless for its rivals. In the desktop and workstation space, AMD has its Radeon and Radeon Pro GPUs, which use its RDNA microarchitecture, while its datacenter-focused Instinct chips use its CDNA architecture. Despite these differences, AMD extended ROCm support to select 7000-series Radeon cards in an attempt to shore up its developer pipeline.
For the most part, we've found that things just work. With that said, we have run into trouble with things like Flash Attention 2, a rather important library that helps to keep the memory consumption of genAI models manageable at larger context lengths. Flash Attention 2 has been supported on Instinct parts for a while now, while efforts to bring the library to Radeon are still ongoing.
In the case of Flash Attention, the advent of ahead-of-time Triton kernel libraries for efficient memory utilization can be used to overcome this limitation in some scenarios. For example, we used an experimental AoT Triton kernel in our recent Axolotl finetuning guide.
Much of this is down to market prioritization. As you might expect, the number of folks that want to use gaming GPUs to write ML apps is relatively small compared to those trying to build training clusters and inference massive trillion-plus parameter models.
"We are super focused still on ensuring that Instinct is where a lot of our priorities go to," Boppana admitted, adding that despite MI300X's debut a little over a year ago, the part is widely available in the cloud either via long-term contracts or on-demand.
Boppana, however, recognizes the need for workstation class hardware. "My own personal view is that (the cloud) cannot be the only approach, and developers love a workstation below their desk so they can hack away," he said.
The situation is even more complicated for Intel. Intel's flagship AI accelerator is Gaudi3 and at least for the moment, there aren't any workstation versions of the part.
What's more, Gaudi accelerators don't actually support OneAPI and instead rely on Habana's own SynapseAI software stack. This means developers looking to build apps on Gaudi are really limited to using Intel's Tiber developer cloud.
With that said, Intel's GPUs, including its Arc gaming cards, all support OneAPI. So if you wanted to write code that can scale up or out to a cluster of Datacenter Flex or GPU Max cards, you certainly could. Having said that, we have observed that software support usually comes to datacenter Flex first before arriving on Arc.
The other elephant in the room is that GPU Max -- AKA Ponte Vecchio -- has been sunset. So outside of Argonne National Labs, which has more than 60,000 of them in the Aurora system, we're not sure how many folks will be writing code for them in production.
This may change with the introduction of Falcon Shores next year, which will feature a GPU architecture along with some Gaudi-like design elements onboard. Presumably the part will also support OneAPI as well.
While the CUDA moat remains an ongoing challenge for developers and chipmakers alike, if all you want to do is serve up LLMs at scale, it's not really something you need to worry about at all.
Regardless of what hardware you opt for - whether it be Intel, AMD, Cerebras, SambaNova, Qualcomm, or someone else - all of these vendors have either developed, or contributed the necessary code needed to get LLMs spewing out tokens.
Pretty much everyone has a framework for streamlining the deployment of chatbot style applications like retrieval augmented generation (RAG) on their hardware as well. If you're not familiar with the concept, we encourage you to check out our hands-on guide here.
However, in some cases, the ability to serve up an OpenAI-compatible API server is all developers are really looking for. Many AI applications today are really just wrappers built around an OpenAI, Anthropic, or Mistral AI API server, which means the code never has to interface with a GPU or AI accelerator directly.
It also means that code is relatively portable. For example, a proof-of-concept could be built using Anthropic's Claude and then the API server and key could be swapped out for a local instance of vLLM or TensorRT-LLM once it's put in production for security or compliance reasons.
But while just about any AI accelerator can serve up an LLM, that doesn't mean it's as performant or efficient as it could be.
In its most recent MLPerf Inference submissions, AMD showed its MI300X accelerators performing on par with Nvidia's venerable H100. In theory, the part's higher memory bandwidth and faster floating point perf should give it a bigger advantage, but for a first submission, that's not nothing.
Since then, we've seen a number of updates to ROCm designed to improve the performance of popular model runners including vLLM and SG-Lang that promise to unlock additional performance.
And while Guadi3 has yet to make its debut on either MLPerf training or inference, for a long time, Gaudi2 was the only competing part to warrant a mention from Nvidia.
Suffice to say, for a lot of developers, Nvidia's CUDA moat isn't nearly as deep as you might think, but still deeper than AMD or Intel would like. ®