Theme 3: Fine-grained Communication and Coordination

The trend to design data centers with millions of accelerators, supporting a wide variety of heterogeneous functionality will quickly accelerate, introducing networking challenges. We envision hierarchical, flexible and reconfigurable network topologies that leverage accelerators for protocol and infrastructure tasks. This evolvable hardware will be matched with an evolvable communication software stack that specializes to the accelerators available—substantially reducing the Datacenter Tax. Furthermore, to prevent accelerators from sitting idle because the scheduler fails to assign them work or because they are waiting for remote data to arrive, we envision a novel runtime. The runtime bundles computation in small buckets and ships them to where the data lives. Moreover, accelerators in network switches and smartNICs  use their unique vantage points to perform a variety of computations, further improving efficiency. Finally, we propose new memory and accelerator designs to speed-up geo-distributed data stores---since they are a fundamental component of the cloud infrastructure.

WAN switches, datacenter switches, and SmartNICs can perform in-network computing.
WAN switches, datacenter switches, and SmartNICs can perform in-network computing (Courtesy of Manya Ghobadi).

The computing infrastructure will include highly-heterogeneous distributed memory and storage resources. As workloads relentlessly increase their data needs, the memory reachable by processors as local memory will expand across an entire rack–creating a formidable memory wall that we will meet with novel processor structures and gracefully-degrading coherence mechanisms. To utilize heterogeneous memory and storage assets efficiently, we will develop new abstractions that allow applications to select the type of asset needed. Moreover, we will develop theory-grounded scalable algorithms to apportion these assets efficiently among thousands of competing applications in the datacenter and billions of allocation requests. Ubiquitous intelligent memory and storage blocks distributed across the memory hierarchy will be harnessed to operate in a coordinated manner.

Heterogeneous Intelligent Memory and Storage (IMS) blocks present in multiple locations of the memory hierarchy of a distributed machine.
Heterogeneous Intelligent Memory and Storage (IMS) blocks present in multiple locations of the memory hierarchy of a distributed machine (Courtesy of Steven Swanson).

Papers and Presentations:

Homunculus: Auto-Generating Efficient Data-Plane ML Pipelines for Datacenter Networks

Tushar Swamy, Annus Zulfiqar, Luigi Nardi, Muhammad Shahbaz, Kunle Olukotun

ASPLOS 2023, March 2023

10.1145/3582016.3582022

 

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

William Won; Taekyung Heo; Saeed Rashidi; Srinivas Sridharan; Sudarshan Srinivasan; Tushar Krishna

2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2023

10.1109/ISPASS57527.2023.00035

 

The Slow Path Needs an Accelerator Too!

Annus Zulfiqar, Ben Pfaff, William Tu, Gianni Antichi, Muhammad Shahbaz

ACM SIGCOMM Computer Communication Review, Vol 53, Issue 1, April 2023

10.1145/3594255.3594259

 

eZNS: An Elastic Zoned Namespace for Commodity ZNS SSDs

Jaehong Min, Chenxingyu Zhao, Ming Liu, Arvind Krishnamurthy

Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation. April 2023

 

Electrode: Accelerating Distributed Protocols with eBPF

Yang Zhou, Zezhou Wang, Sowmya Dharanipragada, Minlan Yu

Proceedings of the 20th USENIX Symposium on Network Systems Design and Implementation, April 2023

 

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, and Christos Kozyrakis

Proceedings of the 6th Conference on Machine Learning and Systems, June 2023

arXiv:2211.05239

 

Tectonic-Shift: A Composite Storage Fabric for Large- Scale ML Training

Mark Zhao, Satadru Pan, Niket Agarwal, Zhaoduo Wen, David Su, Anand Natarajan, Pavan Kumar, Shiva Shankar P, Ritesh Tijoriwala, Karan Asher, Hao Wu, Aarti Basant, Daniel Ford, Deli David, Nezih Yigitbasi, Pratap Sing, Carole-Jean Wu, Christos Kozyrakis

Proceedings of the 2023 USENIX Annual Technical Conference

July 2023

 

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, and Christos Kozyrakis

Proceedings of the 6th Conference on Machine Learning and Systems, June 2023

arXiv:2211.05239

 

Cloud-Native 5G Mobile Core

Jingqi Huang, Bilal Saleem, Jiayi Meng, Iftekhar Alam, Ajay Thakur, Christian Maciocco, Muhammad Shahbaz, and Y. Charlie Hu

SRC TechCon September 2023

 

Direct Telemetry Access

Jonatan Langlet, Ran Ben Basat, Gabriele Oliaro, Michael Mitzenmacher, Minln Yu, Gianni Antichi

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference, September 2023

10.1145/3603269.3604827

 

Dissecting Overheads of Service Mesh Sidecars

Xiangfeng Zhu, Guozhen She, Bowen Xue, Yu Zhang, Yogsu Zhang, Xuan Kelvin Zou, XiongChun Duan, Peng He, Arvind Krishnamurthy, Matthew Lentz, Danyang Zhuo Ratul Mahajan

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing, October 2023

10.1145/3620678.3624652

 

Modeling and Generating Control-Plane Traffic for Cellular Networks

Jiayi Meng, Jingqi Huang, Y. Charlie Hu, Yaron Koral, Xiaojun Lin, Muhammad Shahbaz, Abhigyan Sharm

IMC '23: Proceedings of the 2023 ACM on Internet Measurement Conference, October 2023

10.1145/3618257.3624808

 

Anticipatory Resource Allocation for ML Training

Tapan Chugh, Srikanth Kandula, Arvind Krsihnamurthy, Ratul Mahajan, Ishai Menache

SoCC '23: Proceedings of the 2023 ACM Symposium on Cloud Computing, October 2023

10.1145/3620678.3624669

 

On-Fiber Photonic Computing

Mingran Yang, Zhizhen Zhong, Manya Ghobadi

HotNets ’23, November 2023

10.1145/3626111.3628177

 

Optimal Oblivious Routing with Concave Objectives for Structured Networks

Kanatip Chitavisutthivong; Sucha Supittayapornpong; Pooria Namyar; Mingyang Zhang; Minlan Yu; Ramesh Govindan

IEEE/ACM Transactions on Networking December 2023

10.1109/TNET.2023.3264632

 

2024

SoK Paper: Power Side-Channel Malware Detection
Alexander Cathis, Ge Li, Shijia Wei, Michael Orshansky, Mohit Tiwari, Andreas Gerstlauer
HASP '24: Proceedings of the International Workshop on Hardware and Architectural Support for Security and Privacy 2024
10.1145/3696843.3696849

 

Obsidian: Cooperative State-Space Exploration for performant inference on secure ML accelerators
Banerjee, Sarbartha & Wei, Shijia & Ramrakhyani, Prakash & Tiwari, Mohit.
10.48550/arXiv.2409.02817

 

SoK: A Systems Perspective on Compound AI Threats and Countermeasures
Sarbartha Banerjee, Prateek Sahu, Mulong Luo, Anjo Vahldiek-Oberwagner, Neeraja J. Yadwadkar, Mohit Tiwari
10.48550/arXiv.2411.13459

 

RLDetect: Using Reinforcement Learning for Timing Leakage Detection in Constant Time Security Primitives TECHCON
John P Ali, Radu Teodorescu, Carter Yagemann
TechCon 2024

 

MICROSAMPLER: A Framework for Microarchitecture-Level Leakage Detection in Constant Time Execution TECHCON
Moein Ghaniyoun, Kristin Barber, Saikat Majumdar, Tinqian Zhang, Radu Teordorescu
TechCon 2024

 

Quick, Thorough and Scalable Pre-Silicon Verification with G-QED
Saranyu Chattopadhyay and Subhasish Mitra
TechCon 2024

 

ConfusedPilot: Compromising Enterprise Information Integrity andConfidentiality with Copilot for Microsoft 365
Ayush RoyChowdhury, Mulong Luo, Prateek Sahu, Sarbartha Banerjee, Mohit Tiwari
10.48550/arXiv.2408.04870

 

RTL Verification for Secure Speculation Using Contract Shadow Logic
Qinhan Tan, Yuheng Yang, Thomas Bourgeat, Sharad Malik, Mengjia Yan
ASPLOS '25: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1
10.1145/3669940.3707243

 

Industrial Tutorial GQED for PreSilicon Verification
Subhasish Mitra, Saranyu Chattopadhyay, Mohammad Rahmani Fadiheh
Design and Verification Conference Europe 2024

 

NetBlocks: Staging Layouts for High-Performance Custom Host Network Stacks
Ajay Brahmakshatriya, Chris Rinard, Manya Ghobadi, Saman Amarasinghe
Proceedings of the ACM on Programming Languages, 8 (PLDI)
10.1145/3656396

 

Voltage Noise-Based Adversarial Attacks on Machine Learning Inference in Multi-Tenant FPGA Accelerators
Saikat Majumdar & Radu Teodorescu
 2024 IEEE International Symposium on Hardware Oriented Security and Trust (HOST)
10.1109/HOST55342.2024.10545401

 

NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks
Wenxi Wang, Yang Hu, Mohit Tiwari, Sarfraz Khurshid, Kenneth McMillan, Risto Miikkulainen
ICLR '24
10.48550/arXiv.2110.14053

 

Interactive Greybox Penetration Testing for Cloud Access Control using IAM Modeling and Deep Reinforcement Learning
Y Hu, W Wang, S Khurshid, M Tiwari
arXiv:2304.14540

 

Fixing Privilege Escalations in Cloud Access Control with MaxSAT and Graph Neural Networks
Y. Hu, W. Wang, S. Khurshid, K. L. McMillan and M. Tiwari
2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)
10.1109/ASE56229.2023.00167

 

DINT: Fast In-Kernel Distributed Transactions with eBPF
Yang Zhou, Xingyu Xiang, Matthew Kiley, Sowmya Dharanipragada, Minlan Yu
NSDI'24: Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation
10.5555/3691825.3691848