AIOps Spring2024 – Course Home

 



Announcements

2024/02/19:  Course site is up. Potential students please come back and check.


Course Instructor
Course Assistants
Class Time and Location

Zhe Xie
Class Time: Wednesday 9:50am – 12:15pm (please see detailed schedule in the course syllabus

Associate Professor
Email: xiez22(at)mails(dot)tsinghua(dot)edu(dot)cn

Department of Computer Science and Technology
Office: East Main Building 9-323

Office: East Main Building 9-319
Office hours:

Phone:+86(10)62792837

Office hour: right after the class

   

   

 

 

 

 

 


Course Description

AIOps stands for Autonomous IT Operations or Artificial Intelligence for IT Operations. It is a interdisciplinary research field between Machine Learning and Systems/Networking, which is why this course had this historical title “Advanced Network Management”.    If you are interested in learning how a large distributed system can be better run with the help of machine learning, this course is for you. If you want to learn how machine learn can help solve challenging problems in a very complex system, this course is for you. If you are interested in learning how to apply Large Language Models (LLMs) to solve Ops challenges in real-world systems, this course is for you.

Imagine that you are running a large Internet-based service with hundreds of thousands of servers and many software modules. You want to achieve 99.999% service reliability, but the terabytes of machine-generated monitoring data and hundreds of operators (IT operation engineers) alone won’t get you there, because of the high complexity and sheer scale of the software/hardware system and the vast amount of machine-generated data. What do we do? Machine learning and large foundation models to the rescue!

This course will cover the latest progress in major topics of AIOps using case studies from recent research papers in top conferences in all major computer science fields, including Machine Learning, Data Mining, Large Language Model and its Applications, System/Networking, Software Engineering, Database, Multimedia, etc. See below figure 🙂

Through these case studies, we will show how  the latest Machine Learning Algorithms and LLMs are applied to solve the unique challenges in AIOps. The basics of these Machine Learning algorithms and LLMs will be briefly reviewed in an easy-to-understand way, without going through the detailed theory behind them. Thus by the end of the course, you will be able to learn roughly how these algorithms work, and how it can be applied to solve real-world problems.

  1. Deep Learning
  2. Deep Neural Networks for Time Series or Sequence
  3. Deep Generative Model (VAE, GAN)
  4. Natural Language Processing
  5. Large Language Models (Agents, Retrieval Augmented Generation, etc)
  6. Causal Inference

The major topics of AIOps often coincide with its more general counterparts in Machine Learning and LLMs, and the major difference is the data in AIOps are machine generated, while those in Machine Learning and LLMs can be more general:

  1. Anomaly Detection in Time Series, Logs (semi-structured text), Traces (program execution trace), and Graphs
  2. Anomaly Localization
  3. Causal Inference and its application in Root Cause Analysis.
  4. Metric Foundation Models and Ops Foundation Models.

This course is a graduate course and is primarily project-oriented.


Grading Policies

Attendance: 10%;

Project (Metric Foundation Models): 90%;


Course Information

Course Number
80240663

Credit
3

Required text
None.

Reference texts
《Data Science for Business–What you need to know about data mining and data-analytical thinking》Foster Provost & Tom Fawcett

MIT 6.S191 Introduction to Deep Learning 》 with video and slides.

《Site Reliability Engineering –How Google Runs Production Systems》, by Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Richard Murphy

Prerequisites
You are expected to be familiar with at least one programming language.

 


Previous Courses


 
 
 
Scroll Up