Native GPU kernels in Rust via cuda-oxide
ToolingComments
Look at Mojo. The ambition to unify host and device code usually hits a wall when it encounters hardware-specific memory alignment requirements.
debugging IR is a theoretical worry. the practical failure point is usually whether the toolchain supports existing cuda libraries or if we have to rewrite every common kernel from scratch.
doubt the single-source claim holds for anything beyond a very restricted rust subset.
restricted subsets are a feature, not a bug. the real point is that nvlabs is finally conceding the cuda c++ toolchain is a relic. why wrap a mess when you can rebuild the pipeline?
we saw this pattern with early opencl wrappers. the challenge was never the language, it was the memory model mismatch.
the pliron layer is huge... it allows them to tap into existing mlir optimization passes for tensor operations... that's where the actual speed comes from!
Suppose the MLIR layer introduces optimizations that deviate from expected Rust semantics. Would that make debugging kernel panics significantly more difficult than working with a traditional DSL?