2025 Spring

Specific Requirements

We focus on the latest papers from SOSP and OSDI, as well as papers released on arXiv. Each time presenters select one paper from SOSP or OSDI and one from arXiv.
The presentation follows a "1+N" format, where one person delivers the main content while supporting members assist with preparation and manage the Q&A session. These supporting members are also encouraged to contribute to the presentation.
The discussion should provide a thorough analysis of the paper’s strengths and weaknesses, along with a comprehensive review of related work from the past three years. The presentation must be at least 45 minutes long.

Other Information

The playback video and text summary will be uploaded to bilibili and zhihu as soon as possible.

Schedule

February 25

💡 Kick-off meeting
🙎‍♂️ Jiyang Wang, Kunzhao Xu and Cheng Li
📕 slides

March 11

💡 Comprehensive introduction of DeepSeek-AI's technical report (PART Ⅰ)
🙎‍♂️ Xin Ren, Tonghuan Xiao, Jiahui Tan, Yandong Shi, Kunzhao Xu, Yifei Liu, Chongzhuo Yang, Jiaan Zhu, Zewen Jin, Yinhe Chen, Ping Gong, Guanbin Xu, Haiquan Wang, Quan Zhou and Chaoyi Ruan
📕 MLA slides, 📕 DualPipe slides, 📕 FP8 Training slides, 📕 MTP slides
📃 Q&A summary, 📺 video

March 18

Topic Ⅰ

💡 Comprehensive introduction of DeepSeek-AI's technical report (PART Ⅱ)
🙎‍♂️ Xin Ren, Tonghuan Xiao, Jiahui Tan, Yandong Shi, Kunzhao Xu, Yifei Liu, Chongzhuo Yang, Jiaan Zhu, Zewen Jin, Yinhe Chen, Ping Gong, Guanbin Xu, Haiquan Wang, Quan Zhou and Chaoyi Ruan
📕 RL slides, 📕 3fs slides
📃 Q&A summary, 📺 video

Topic Ⅱ

💡 [OSDI'24] Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation
🙎‍♂️ Chengru Yang
📕 slides
📃 Q&A summary, 📺 video

March 25

Topic Ⅰ

💡 [OSDI'24] FairyWren: A Sustainable Cache for Emerging Write-Read-Erase Flash Interfaces
🙎‍♂️ Qingyuan Chen
📕 slides

Topic Ⅱ

💡 [arXiv] fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving
🙎‍♂️ Jia He, Jiaqi Ruan
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

April 1

Topic Ⅰ

💡 [SOSP'24] CHIME: A Cache-Efficient and High-Performance Hybrid Index on Disaggregated Memory
🙎‍♂️ Sen Han
📕 slides

Topic Ⅱ

💡 [arXiv] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
🙎‍♂️ Tonghuan Xiao, Xin Ren
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

April 8

Topic Ⅰ

💡 [OSDI'25] Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD
🙎‍♂️ Hengyu Liang

Topic Ⅱ

💡 [arXiv] Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
🙎‍♂️ Jiawei Yi
📕 slides
📃 Q&A summary, 📺 video

April 15

💡 [arXiv] Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot
🙎‍♂️ Juncheng Zhang
📕 slides
📃 Q&A summary, 📺 video

April 22

Topic Ⅰ

💡 [OSDI'24] Llumnix: Dynamic Scheduling for Large Language Model Serving
🙎‍♂️ Kunzhao Xu
📕 slides

Topic Ⅱ

💡 [SOSP'24] Enabling Parallelism Hot Switching for Efficient Training of Large Language Models
🙎‍♂️ Qinghe Wang
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

April 29

Topic Ⅰ

💡 [SOSP'24] Tiered Memory Management: Access Latency is the Key!
🙎‍♂️ Lijun Miao

Topic Ⅱ

💡 [arXiv] ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
🙎‍♂️ Long Zhao

Summary and Video

📃 Q&A Summary, 📺 video

May 6

Topic Ⅰ

💡 [OSDI'25] Fast and Live Model Auto Scaling with O(1) Host Caching
🙎‍♂️ Chenhan Wang

Topic Ⅱ

💡 [arXiv] Training-free and Adaptive Sparse Attention for Efficient Long Video Generation
🙎‍♂️ Shiyi Wang

May 13

Topic Ⅰ

💡 [SOSP'24] OZZ: Identifying Kernel Out-of-Order Concurrency Bugs with In-Vivo Memory Access Reordering
🙎‍♂️ Jiyang Wang
📕 slides

Topic Ⅱ

💡 [arXiv] AsyncFS: Metadata Updates Made Asynchronous for Distributed Filesystems with In-Network Coordination
🙎‍♂️ Chongzhuo Yang
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

May 20

💡 [arXiv] Down with the Hierarchy: The ‘H’ in HNSW Stands for “Hubs”
🙎‍♂️ Bosen Yang
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

May 27

Topic Ⅰ

💡 [OSDI'24] dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
🙎‍♂️ Chizheng Fang
📕 slides

Topic Ⅱ

💡 [arXiv] CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
🙎‍♂️ Yicheng Zhang
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

Jun 3

Topic Ⅰ

💡 [SOSP24] Reducing Cross-Cloud/Region Costs with the Auto-Configuring MACARON Cache
🙎‍♂️ Chao Bi
📕 slides

Topic Ⅱ

💡 [arXiv] RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
🙎‍♂️ Xiaoqi Li
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

Jun 10

Topic Ⅰ

💡 [SOSP24] LazyLog: A New Shared Log Abstraction for Low-Latency Applications
🙎‍♂️ Jiaxuan Liu
📕 slides

Topic Ⅱ

💡 [arXiv] FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
🙎‍♂️ Zewen Jin
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

Jun 17

Topic Ⅰ

💡 [OSDI25] WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training
🙎‍♂️ Shen Fu

Topic Ⅱ

💡 [arXiv] Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
🙎‍♂️ Ouxiang Zhou
📕 slides

Summary and Video

📃 Q&A Summary, 📺 video

Jun 24

Topic Ⅰ

💡 [SOSP24] VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS Clouds
🙎‍♂️ Zheng yang
📕 slides

Topic Ⅱ

💡 [arXiv] StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
🙎‍♂️ Muxin Liu
📕 slides

Summary and Video

📺 video