2026 Spring

Specific Requirements

We adopt a topic-based organization this semester. Presenters should select papers that fit the semester topics. Paper sources are no longer restricted to SOSP, OSDI, or any specific venues, as long as the paper is confirmed with the organizers at registration time and meets the quality bar.
The presentation format is flexible. For one paper, a full discussion should last 45–50 minutes, while a sharing presentation should last 30–35 minutes. Presenters are expected to prepare around 30–40 slides.
The presentation should clearly introduce the main idea, technical contributions, and key strengths and weaknesses of the paper. Relevant background and related work may also be included when appropriate.

Other Information

The playback video and text summary will be uploaded to bilibili and zhihu as soon as possible.

Schedule

April 7

💡 Kick-off meeting
🙎‍♂️ Zewen Jin, Chizheng Fang, Yuzhe Li, Mulong Li and Shen Fu
📕 slides

April 14

Topic I

💡 [CVPR'26] AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
🙎‍♂️ Haoyue Tan
📕 slides
📃 Q&A summary, 📺 video

Topic II

💡 [arXiv] IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
🙎‍♂️ Ruibo Liu, Ouxiang Zhou
📕 slides
📺 video

April 21

Topic I

💡 [arXiv] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
🙎‍♂️ Chengjie Tang, Shen Fu
📕 slides
📺 video

April 28

Topic I

💡 [arXiv] PROBE: Co-Balancing Computation and Communication in MoE Inference via Real-Time Predictive Prefetching
🙎‍♂️ Qinghe Wang
📕 slides

Topic II

💡 [arXiv] FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving
🙎‍♂️ Long Zhao
📕 slides

May 12

💡 Comprehensive introduction of DeepSeek-V4 technical report (PART I)
🙎‍♂️ Chengru Yang, Chengjie Tang, Ouxiang Zhou, Shen Fu, Ruibo Liu, Yinhe Chen
📕 slides

May 19

💡 Comprehensive introduction of DeepSeek-V4 technical report (PART II)
🙎‍♂️ Congkun Ai, Yuzhe Li, Jiahui Tan, Chenhan Wang, Chizheng Fang
📕 slides

May 26

💡 [arXiv] JANUS: Disaggregating Attention and Experts for Scalable MoE Inference
🙎‍♂️ Chizheng Fang
📕 slides

June 2

Topic I

💡 [MLSys26] AccelOpt: Self-improving Agents for AI Accelerator Kernel Optimization
🙎‍♂️ Yuhang Wang, Jingwen Sun
📕 slides

Topic II

💡 [arXiv] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
🙎‍♂️ Xingye You, Ziqi Chen

June 9

Topic I

💡 [ICLR26] SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
🙎‍♂️ Jiahui Tan

Topic II

💡 [PLDI26] Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs
🙎‍♂️ Congkun Ai, Zewen Jin

June 16

Topic I

💡 基于昇腾分离式架构的Attention融合算子优化
🙎‍♂️ Huaman ZHou

Topic II

💡 [MLSys26] BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
🙎‍♂️ Zijian Dai, Sen Han
📕 slides

June 23

Topic I

💡 [SIGMOD26] Twenty Years of Bigtable
🙎‍♂️ Chuannan Zhang
📕 slides

June 30

Topic I

💡 [FAST26] OdinANN: Direct Insert for Consistently Stable Performance in Billion-Scale Graph-Based Vector Search
🙎‍♂️ Mulong Li
📕 slides

Topic II

💡 [SOSP25] Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market
🙎‍♂️ Bosen Yang, Jiyang Wang
📕 slides

July 7

Topic I

💡 [arXiv] ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving
🙎‍♂️ Jiaan Zhu
📕 slides

Topic II

💡 [OSDI25] To PRI or Not To PRI, That’s the question
🙎‍♂️ Yicheng Zhang, Luofan Chen
📕 slides

July 14

Topic I

💡 [OSDI26] Strata: Hierarchical Context Caching for Long Context Language Model Serving
🙎‍♂️ Mingxuan Liu
📕 slides

Topic II

💡 [MLSys26] fabric-lib: RDMA Point-to-Point Communication for LLM Systems
🙎‍♂️ Mingxuan Liu
📕 slides