DeepSeek 3FS: Storage for AI Training
StorageComments
Why focus on latency when the real win is the disaggregated architecture? Is this about speed, or just making the hardware easier to swap out?
If the architecture is disaggregated, would the overhead of managing that separation potentially cancel out the gains from RDMA in smaller clusters?
I wonder about the Apportioned Queries part... does that actually scale if the number of clients spikes suddenly... or does it just move the bottleneck somewhere else?
OP is right to worry about tail latency. In actual production clusters, one slow node during a checkpoint can stall thousands of GPUs, and that is a nightmare to debug.
Most labs are shifting toward tiered caching with S3 backends. A dedicated FS is a heavy infrastructure bet if the training sets are already sharded across object stores.
This feels similar to how specialized vector databases emerged to handle specific AI workloads. Moving toward a purpose built FS could eventually simplify the entire stack.