This seminar explores the evolution of software systems that treat an entire datacenter as a single, powerful computer. Over the past two decades, a revolution in systems design has moved us away from managing individual servers to orchestrating vast fleets of machines with sophisticated software control planes. Understanding this journey is critical for any practitioner building or operating modern cloud-native applications, as the abstractions developed during this era—from distributed file systems to serverless functions—are the foundation of the cloud.
In this course, we will read and discuss the seminal research papers that defined this field. We will trace the path from the foundational abstractions for large-scale data processing to the modern paradigms of cluster orchestration and serverless computing. Throughout our discussions, we will focus on identifying a set of cross-cutting principles that provide a framework for reasoning about these complex systems.
Broader Themes
We will continuously revisit five key themes:
- The Enduring Power of Abstraction: How new layers of abstraction hide complexity and enable new programming models.
- Complexity Doesn’t Disappear, It Moves: How solving one problem often creates a new, higher-level challenge elsewhere in the stack.
- The Tension Between Generality and Specialization: The trade-offs between building one system for all workloads versus specialized systems for specific niches.
- Performance is Redefined at Every Layer: How the definition of “high performance” evolves as we move up the abstraction stack.
- The Unit of Trust is Shrinking: The progression towards more granular and fine-grained security models.
Week 1: The Blueprint for a Warehouse-Scale Computer
- Core Question: How do you build reliable storage and computation on top of thousands of unreliable machines?
- Required Reading:
- The Google File System (SOSP ‘03) - Ghemawat, Gobioff, and Leung.
- MapReduce: Simplified Data Processing on Large Clusters (OSDI ‘04) - Dean and Ghemawat.
- Discussion Focus: We begin with the two papers that arguably kickstarted the cloud computing era. We will dissect the design principles: designing for failure, optimizing for huge files and sequential reads, and the power of a simple programming model.
- Broader Themes: The Enduring Power of Abstraction; The Tension Between Generality and Specialization.
Week 2: A Database for Planet-Scale Data
- Core Question: Once you have a file system and a computation model, how do you store and serve structured data at a global scale?
- Required Reading:
- Bigtable: A Distributed Storage System for Structured Data (OSDI ‘06) - Chang et al.
- Optional Reading:
- Dynamo: Amazon’s Highly Available Key-value Store (SOSP ‘07) - DeCandia et al.
- Discussion Focus: We will explore the Bigtable data model and architecture and how it builds upon GFS and other Google infrastructure. The discussion will contrast its design with traditional RDBMS and, if people read the optional paper, with Dynamo’s “always-on” philosophy.
- Broader Themes: Complexity Doesn’t Disappear, It Moves; The Enduring Power of Abstraction.
Week 3: The Datacenter Operating System
- Core Question: How do you manage resources and schedule millions of jobs from thousands of users across an entire datacenter?
- Required Reading:
- Large-scale cluster management at Google with Borg (EuroSys ‘15) - Verma et al.
- Omega: flexible, scalable schedulers for large compute clusters (EuroSys ‘13) - Schwarzkopf et al.
- Discussion Focus: This session introduces the “datacenter OS.” We will spend most of our time on Borg, the spiritual ancestor to Kubernetes. We will then use Omega to discuss the architectural trade-offs in scheduler design (monolithic vs. parallel) and the implications for scalability and flexibility.
- Broader Themes: The Tension Between Generality and Specialization; Complexity Doesn’t Disappear, It Moves.
Week 4: Evolving Workloads and Schedulers
- Core Question: How must schedulers adapt when workloads shift from long-running jobs to short, latency-sensitive tasks?
- Required Reading:
- Sparrow: Distributed, Low Latency Scheduling (SOSP ‘13) - Ousterhout et al.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (NSDI ‘12) - Zaharia et al.
- Discussion Focus: Sparrow presents a radical departure from the Borg/Omega model to meet the demands of interactive analytics. We’ll pair this with the Spark paper to understand the kind of workloads that drove the need for such schedulers, and to see the next evolution of the MapReduce computation model.
- Broader Themes: Performance is Redefined at Every Layer; The Tension Between Generality and Specialization.
Week 5: The Serverless Paradigm & The Cold Start Problem
- Core Question: What does it take to provide strong isolation for thousands of functions, yet start them in milliseconds?
- Required Reading:
- Firecracker: Lightweight Virtualization for Serverless Applications (NSDI ‘20) - Agache et al.
- Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems (OSDI ‘25) - Chai et al.
- Discussion Focus: We pivot to the serverless abstraction. The Firecracker paper is a masterclass in focused engineering and VMM design. We’ll couple it with the “Fork in the Road” paper to get a real-world view of where cold start latency actually comes from beyond just the runtime.
- Broader Themes: Performance is Redefined at Every Layer; The Unit of Trust is Shrinking.
Week 6: Making Serverless Stateful
- Core Question: If functions are ephemeral, where do you put the data? How can we make state access fast and efficient?
- Required Reading:
- Beldi: A Key-Value Store for Scalable, Tiered Serverless Applications (OSDI ‘20) - Sreekanti et al.
- Catalyst: A Serverless Framework for Scalable and Transparent Data Caching (SOSP ‘21) - Wang et al.
- Discussion Focus: We tackle the Achilles’ heel of serverless: state. These two papers present complementary solutions. Beldi designs a new key-value store optimized for serverless access patterns, while Catalyst designs a transparent caching layer.
- Broader Themes: Complexity Doesn’t Disappear, It Moves.
Week 7: Orchestrating Serverless Workflows
- Core Question: How do you compose simple functions into complex, reliable, and high-performance applications?
- Required Reading:
- DataFlower: Exploiting the Data-flow Paradigm for Serverless Workflow Orchestration (ASPLOS ‘24) - Li et al.
- Halfmoon: Log-Optimal Fault-Tolerant Stateful Serverless Computing (SOSP ‘23) - Qi, Liu, and Jin.
- Discussion Focus: We explore the next level of the serverless challenge: building entire applications. We will contrast DataFlower’s performance-oriented data-flow model with Halfmoon’s focus on fault-tolerance and consistency for stateful workflows.
- Broader Themes: The Enduring Power of Abstraction; Complexity Doesn’t Disappear, It Moves.
Week 8: Securing the Serverless Application
- Core Question: Beyond simple isolation, how do you protect a serverless application’s logic and data from sophisticated attacks?
- Required Reading:
- Guarding Serverless Applications with Kalium (USENIX Security ‘23) - Jegan et al.
- Optional Reading:
- Discussion Focus: Kalium addresses a threat unique to serverless: attacks on the control flow between functions. This paper forces us to think about security at the application-logic layer, not just the infrastructure layer.
- Broader Themes: The Unit of Trust is Shrinking.
Week 9: Security at Cloud Scale
- Core Question: Zooming back out, how do you manage permissions globally? And can you trust the cloud provider itself?
- Required Reading:
- Zanzibar: Google’s Consistent, Global Authorization System (USENIX ATC ‘19) - Ting et al.
- Confidential Computing for the Public Cloud (IEEE Security & Privacy ‘21) - Rane.
- Discussion Focus: This week serves as a capstone for security. Zanzibar tackles the monumental challenge of distributed authorization. The Confidential Computing survey then looks to the future, asking how we can protect data even from a compromised hypervisor or a malicious cloud operator.
- Broader Themes: The Unit of Trust is Shrinking; Complexity Doesn’t Disappear, It Moves.
Week 10: Synthesis and Critical Perspectives
- Core Question: After two decades of abstraction, are our systems actually better? What are the fundamental costs of this complexity?
- Required Reading:
- Serverless Computing: One Step Forward, Two Steps Back (OSDI ‘18) - Hellerstein et al.
- Optional Reading:
- A Plea for Leaner Software (IEEE Computer ‘21) - Ousterhout.
- Discussion Focus: We end the seminar with a critical reflection. The Hellerstein et al. paper is a well-argued critique of the serverless paradigm, forcing us to weigh the pros and cons of the abstractions we’ve studied. This final session will be dedicated to synthesizing the broader themes and debating the trajectory of systems research.
- Broader Themes: This final session will synthesize all five themes and facilitate a critical discussion on the trajectory of systems design and research.