Get in Touch

Course Outline

Fundamentals of Agentic Systems in Production

  • Agentic architectures: loops, tools, memory, and orchestration layers.
  • The agent lifecycle: from development and deployment to continuous operation.
  • Challenges associated with managing agents at a production scale.

Infrastructure and Deployment Models

  • Deploying agents within containerized and cloud environments.
  • Scaling patterns: horizontal versus vertical scaling, concurrency, and throttling.
  • Multi-agent orchestration and workload balancing.

Monitoring and Observability

  • Key metrics: latency, success rate, memory consumption, and agent call depth.
  • Tracing agent activity and call graphs.
  • Instrumenting observability using Prometheus, OpenTelemetry, and Grafana.

Logging, Auditing, and Compliance

  • Centralized logging and structured event collection.
  • Ensuring compliance and auditability within agentic workflows.
  • Designing audit trails and replay mechanisms for debugging purposes.

Performance Tuning and Resource Optimization

  • Reducing inference overhead and optimizing agent orchestration cycles.
  • Model caching and lightweight embeddings for accelerated retrieval.
  • Load testing and stress scenarios for AI pipelines.

Cost Control and Governance

  • Understanding cost drivers for agents: API calls, memory, compute, and external integrations.
  • Tracking agent-level costs and implementing chargeback models.
  • Establishing automation policies to prevent agent sprawl and idle resource consumption.

CI/CD and Rollout Strategies for Agents

  • Integrating agent pipelines into CI/CD systems.
  • Testing, versioning, and rollback strategies for iterative agent updates.
  • Progressive rollouts and safe deployment mechanisms.

Failure Recovery and Reliability Engineering

  • Designing for fault tolerance and graceful degradation.
  • Implementing retry, timeout, and circuit breaker patterns for agent reliability.
  • Incident response and post-mortem frameworks for AI operations.

Capstone Project

  • Build and deploy an agentic AI system with comprehensive monitoring and cost tracking.
  • Simulate load, measure performance, and optimize resource usage.
  • Present the final architecture and monitoring dashboard to peers.

Summary and Next Steps

Requirements

  • Proficient knowledge of MLOps and production machine learning environments.
  • Hands-on experience with containerized deployments (Docker and Kubernetes).
  • Familiarity with cloud cost optimization strategies and observability tools.

Target Audience

  • MLOps Engineers.
  • Site Reliability Engineers (SREs).
  • Engineering leaders responsible for AI infrastructure.
 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories