ZLUDA, a CUDA translation layer that almost closed down last year, but got saved by an unknown party, this week shared an update about its steady technical progress and team expansion over the last quarter, reports Phoronix. The project continues to build out its capabilities to run CUDA workloads on non-Nvidia GPUs; for now, it is more focused on AI rather than on other things. Yet, work has also begun on enabling 32-bit PhysX support, which is required for compatibility with older CUDA-based games.
据Phoronix报道,Zluda是一个CUDA翻译层,去年几乎关闭了,但被一个未知的聚会节省了下来,本周分享了有关其稳定的技术进步和在上个季度的稳定技术进步和团队扩张的更新。该项目继续建立其在非NVIDIA GPU上运行CUDA工作负载的能力;就目前而言,它更专注于AI,而不是其他事情。但是,还开始启用32位Physx支持的工作,这是与较旧的基于CUDA的游戏兼容所必需的。
Perhaps, the most important thing for the ZLUDA project is that its development team has grown from one to two full-time developers working on the project. The second developer, Violet, joined less than a month ago and has already delivered important improvements, particularly in advancing support for large language model (LLM) workloads through the llm.c project, according to the update.
也许,对于Zluda项目而言,最重要的是,其开发团队从从事该项目的一名全职开发人员增长。根据更新,第二位开发商紫罗兰(Violet)不到一个月前就加入了重要的改进,尤其是在通过LLM.C项目推进对大语言模型(LLM)工作量的支持方面。
32-bit PhysX
32位物理学
A community contributor named @Groowy began the initial work to enable 32-bit PhysX support in ZLUDA by collecting detailed CUDA logs, which quickly revealed several bugs. Since some of these problems could also impact 64-bit CUDA functionality, fixing them was added to the official roadmap. However, completing full 32-bit PhysX support will still rely on further help from open-source contributors.
一个名为@Growy的社区撰稿人开始了最初的工作,通过收集详细的CUDA日志来启用Zluda的32位Physx支持,该日志迅速揭示了几个错误。由于其中一些问题也可能影响64位CUDA功能,因此将其修复已添加到官方路线图中。但是,完成完整的32位Physx支持仍然将依靠开源贡献者的进一步帮助。
Compatibility with LLM.c
与LLM.C的兼容性
The ZLUDA developers are working on a test project called llm.c, which is a small example program that tries to run a GPT-2 model using CUDA. Even though this test is not huge, it is important because it is the first time ZLUDA has tried to handle both normal CUDA functions and special libraries like cuBLAS (fast math operations).
Zluda开发人员正在研究一个名为LLM.C的测试项目,该项目是一个小示例程序,试图使用CUDA运行GPT-2模型。即使该测试并不大,这很重要,因为这是Zluda首次尝试处理普通的CUDA功能和特殊库(例如Cublas(快速数学操作))。
This test program makes 8,186 separate calls to CUDA functions, spread over 44 different APIs. In the beginning, ZLUDA would crash right away on the very first call. Thanks to many updates contributed by Violet, it can now get all the way to the 552nd call before it fails. The team has already completed support for 16 of the 44 needed functions, so they are getting closer to running the whole test successfully. Once this works, it will help ZLUDA support bigger software like PyTorch in the future.
该测试程序对CUDA功能进行了8186个单独的呼叫,分布在44个不同的API上。一开始,Zluda将立即在第一个电话中崩溃。得益于紫罗兰色贡献的许多更新,它现在可以在552次呼叫之前就可以在失败之前获得。该团队已经完成了对44个所需功能中的16个的支持,因此他们越来越接近完成整个测试。一旦工作,它将帮助Zluda支持更大的软件,例如Pytorch。
Improving accuracy of ZLUDA
提高Zluda的准确性
ZLUDA's core objective is to run standard CUDA programs on non-Nvidia GPUs while matching the behavior of Nvidia hardware as precisely as possible. This means each instruction must either deliver identical results down to the last bit or stay within strict numerical tolerances compared to Nvidia hardware. Earlier versions of ZLUDA, before the major code reset, often compromised on accuracy by skipping certain instruction modifiers or failing to maintain full precision.
Zluda的核心目标是在非NVIDIA GPU上运行标准的CUDA程序,同时尽可能准确地匹配NVIDIA硬件的行为。这意味着,与NVIDIA硬件相比,每个指令必须将相同的结果传递到最后位或保持严格的数值公差。Zluda的较早版本在重置主要代码之前,通常会通过跳过某些指令修饰符或无法保持完整精度而受到准确性的损害。
The current implementation has made substantial progress in fixing this. To ensure accuracy, it runs PTX 'sweep' tests — systematic checks using Nvidia's intermediate GPU language — to confirm that every instruction and modifier combination produces correct results across all inputs, something that has never been used before. Running these checks revealed several compiler defects, which were addressed later. ZLUDA admits that not every instruction has completed this rigorous validation yet, but stressed that some of the most complex cases — such as the cvt instruction — are now confirmed bit-accurate.
当前的实施在解决此问题方面取得了重大进展。为了确保准确性,它运行PTX“扫描”测试(使用NVIDIA的中间GPU语言进行系统检查),以确认每个指令和修饰符组合都会在所有输入中产生正确的结果,这些输入从未使用过。运行这些检查显示了几个编译器缺陷,后来解决了。Zluda承认,并非所有指令都已经完成了这种严格的验证,但他强调说,现在一些最复杂的案例(例如CVT指示)被确认。
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter Get Tom's Hardware's best news and in-depth reviews, straight to your inbox. Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors
保持最前沿:让汤姆的硬件通讯获取汤姆的硬件的最佳新闻和深入的评论,直接进入收件箱。与我联系其他未来品牌的新闻,并代表我们值得信赖的合作伙伴或赞助商收到我们的电子邮件
Improving logging
改进日志记录
The foundation for getting any CUDA-based software to work on ZLUDA — whether it is a game, a 3D application, or an ML framework — is having logs of how the program communicates with CUDA, which includes tracking both direct API calls, undocumented parts of the CUDA runtime (or drivers), and any use of specialized performance libraries.
让任何基于CUDA的软件在Zluda上使用的基础(无论是游戏,3D应用程序还是ML框架)的基础是有关该程序如何与CUDA进行通信的日志,其中包括跟踪直接API呼叫,CUDA Runtime Time(或驱动程序)的无证件部分(或驱动程序)以及任何使用专业绩效图书馆。
With the recent update, ZLUDA's logging system has been significantly upgraded. The new implementation captures a wider range of activity that was not visible before, including detailed traces of internal behavior, such as when cuBLAS relies on cuBLASLt or how cuDNN interacts with the lower-level Driver API.
随着最近的更新,Zluda的记录系统已大大升级。新的实现捕获了以前看不到的更广泛的活动,包括详细的内部行为痕迹,例如Cublas依赖Cublaslt或Cudnn如何与低级驱动程序API相互作用。
Runtime compiler compatibility
运行时编译器兼容性
Modern GPU frameworks like CUDA, ROCm/HIP, ZLUDA, and OpenCL all need to compile device code dynamically while applications run to ensure that older GPU programs can still be built and executed correctly on newer hardware generations without changes to the original code.
现代GPU框架(例如CUDA,ROCM/HIP,ZLUDA和OPENCL)都需要动态编译设备代码,而应用程序运行以确保在不更改原始代码的情况下仍可以在更新的硬件上正确构建和执行旧的GPU程序。
In AMD's ROCm/HIP ecosystem, this on-the-fly compilation depends on the comgr library (short for ROCm-CompilerSupport), a compact library with extensive capabilities to handle tasks like compiling, linking, and disassembling code, available on both Linux and Windows.
在AMD的ROCM/HIP生态系统中,此封面汇编取决于COMGR库(ROCM-CompilerSupport的缩写),这是一个紧凑的库,具有广泛的功能,可以处理Linux和Windows上的编译,链接和拆卸代码等任务。
With ROCm/HIP version 6.4, a significant application binary interface (ABI) change occurred: the numeric codes representing actions were rearranged in a new v3 ABI. This caused ZLUDA to accidentally call the wrong operations — for example, attempting to link instead of compile, which led to errors. The situation was worse on Windows, where the library claimed to be version 2.9 but internally used the v3 ABI, mixing behaviors. These problems were also addressed recently by the ZLUDA team.
使用ROCM/HIP 6.4版,发生了重要的应用程序二进制接口(ABI)更改:代表动作的数字代码在新的V3 ABI中被重新排列。这导致Zluda意外调用了错误的操作 - 例如,试图链接而不是编译,这导致了错误。窗户上的情况更糟,该图书馆声称自己是2.9版,但内部使用了V3 ABI,混合行为。Zluda团队最近也解决了这些问题。
Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.
遵循汤姆(Tom)在Google News上的硬件,以获取我们的提要中的最新新闻,分析和评论。确保单击“关注”按钮。