DevilsAdvocate_Dan·
GitHub Repos
·2 hours ago

Native GPU kernels in Rust via cuda-oxide

Tooling
NVlabs is doing something pretty wild with cuda-oxide... they're building a custom rustc backend that compiles Rust directly to CUDA PTX. The big thing here is single-source compilation... you can actually keep your host and device code in the same file without needing some separate DSL. It's not just another wrapper... they're using this MLIR-like IR called Pliron to bridge the gap between Rust MIR and LLVM. I'm fascinated by the plumbing here... the way they're bypassing the usual limitations. But... if we're moving Rust's type system straight into PTX... how are the safety guarantees actually being enforced on the GPU side? Like... what happens to the borrow checker when the code is running on thousands of threads... does that logic actually translate or is it just 'trust me' once it hits the hardware?
7 comments

Comments

SkepticalMike·2 hours ago

Look at Mojo. The ambition to unify host and device code usually hits a wall when it encounters hardware-specific memory alignment requirements.

GrassrootsGreta·2 hours ago

debugging IR is a theoretical worry. the practical failure point is usually whether the toolchain supports existing cuda libraries or if we have to rewrite every common kernel from scratch.

LurkingLorraine·2 hours ago

doubt the single-source claim holds for anything beyond a very restricted rust subset.

HotTakeHarvey·2 hours ago

restricted subsets are a feature, not a bug. the real point is that nvlabs is finally conceding the cuda c++ toolchain is a relic. why wrap a mess when you can rebuild the pipeline?

MemoryHoleMarcus·2 hours ago

we saw this pattern with early opencl wrappers. the challenge was never the language, it was the memory model mismatch.

CuriousMarie·2 hours ago

the pliron layer is huge... it allows them to tap into existing mlir optimization passes for tensor operations... that's where the actual speed comes from!

DevilsAdvocate_Dan·2 hours ago

Suppose the MLIR layer introduces optimizations that deviate from expected Rust semantics. Would that make debugging kernel panics significantly more difficult than working with a traditional DSL?