Get in Touch

Course Outline

Introduction to AIOps

  • Understanding AIOps and its significance.
  • Contrasting traditional monitoring with AIOps-driven observability.
  • Exploring AIOps architecture and essential components.

Collecting and Normalizing Operational Data

  • Identifying types of observability data: metrics, logs, and traces.
  • Ingesting data from diverse sources such as servers, containers, and cloud environments.
  • Utilizing agents and exporters like Prometheus, Beats, and Fluentd.

Data Correlation and Anomaly Detection

  • Employing time series correlation and statistical methods.
  • Applying ML models for effective anomaly detection.
  • Identifying incidents across distributed systems.

Alerting and Noise Reduction

  • Designing intelligent alert rules and thresholds.
  • Implementing suppression, deduplication, and alert grouping strategies.
  • Integrating with platforms such as Alertmanager, Slack, PagerDuty, or Opsgenie.

Root Cause Analysis and Visualization

  • Utilizing dashboards to visualize metrics and identify trends.
  • Examining events and timelines to facilitate RCA (Root Cause Analysis).
  • Tracing issues across layers using distributed tracing tools.

Automation and Remediation

  • Triggering automated scripts or workflows triggered by incidents.
  • Integrating with ITSM systems like ServiceNow and Jira.
  • Reviewing use cases such as self-healing, scaling, and traffic rerouting.

Open Source and Commercial AIOps Platforms

  • Overview of tools including Prometheus, Grafana, ELK, Moogsoft, and Dynatrace.
  • Establishing evaluation criteria for selecting an appropriate AIOps platform.
  • Participating in a demo and hands-on session with a selected stack.

Summary and Next Steps

Requirements

  • A foundational understanding of IT operations and system monitoring concepts.
  • Prior experience with monitoring tools or dashboards.
  • Familiarity with basic log and metric formats.

Audience

  • Operations teams managing infrastructure and applications.
  • Site Reliability Engineers (SREs).
  • Teams focused on IT monitoring and observability.
 14 Hours

Number of participants


Price per participant

Upcoming Courses

Related Categories