Announcements
2024/02/19: Course site is up. Potential students please come back and check.
Course Instructor Course Assistants Class Time and Location
Office hour: right after the class
Course Description
AIOps stands for Autonomous IT Operations or Artificial Intelligence for IT Operations. It is a interdisciplinary research field between Machine Learning and Systems/Networking, which is why this course had this historical title “Advanced Network Management”. If you are interested in learning how a large distributed system can be better run with the help of machine learning, this course is for you. If you want to learn how machine learn can help solve challenging problems in a very complex system, this course is for you. If you are interested in learning how to apply Large Language Models (LLMs) to solve Ops challenges in real-world systems, this course is for you.
Imagine that you are running a large Internet-based service with hundreds of thousands of servers and many software modules. You want to achieve 99.999% service reliability, but the terabytes of machine-generated monitoring data and hundreds of operators (IT operation engineers) alone won’t get you there, because of the high complexity and sheer scale of the software/hardware system and the vast amount of machine-generated data. What do we do? Machine learning and large foundation models to the rescue!
This course will cover the latest progress in major topics of AIOps using case studies from recent research papers in top conferences in all major computer science fields, including Machine Learning, Data Mining, Large Language Model and its Applications, System/Networking, Software Engineering, Database, Multimedia, etc. See below figure 🙂
Through these case studies, we will show how the latest Machine Learning Algorithms and LLMs are applied to solve the unique challenges in AIOps. The basics of these Machine Learning algorithms and LLMs will be briefly reviewed in an easy-to-understand way, without going through the detailed theory behind them. Thus by the end of the course, you will be able to learn roughly how these algorithms work, and how it can be applied to solve real-world problems.
- Deep Learning
- Deep Neural Networks for Time Series or Sequence
- Deep Generative Model (VAE, GAN)
- Natural Language Processing
- Large Language Models (Agents, Retrieval Augmented Generation, etc)
- Causal Inference
The major topics of AIOps often coincide with its more general counterparts in Machine Learning and LLMs, and the major difference is the data in AIOps are machine generated, while those in Machine Learning and LLMs can be more general:
- Anomaly Detection in Time Series, Logs (semi-structured text), Traces (program execution trace), and Graphs
- Anomaly Localization
- Causal Inference and its application in Root Cause Analysis.
- Metric Foundation Models and Ops Foundation Models.
This course is a graduate course and is primarily project-oriented.
Grading Policies
Attendance: 10%;
Project (Metric Foundation Models): 90%;
Course Information
《MIT 6.S191 Introduction to Deep Learning 》 with video and slides.
《Site Reliability Engineering –How Google Runs Production Systems》, by Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy