Get in Touch

Course Outline

Core Performance Concepts and Metrics

  • Analysis of latency, throughput, power consumption, and resource utilization
  • Distinguishing between system-level and model-level bottlenecks
  • Differentiating profiling approaches for inference versus training

Profiling on Huawei Ascend

  • Leveraging CANN Profiler and MindInsight
  • Diagnostics for kernels and operators
  • Analyzing offload patterns and memory mapping

Profiling on Biren GPUs

  • Utilizing Biren SDK performance monitoring capabilities
  • Optimizing kernel fusion, memory alignment, and execution queues
  • Conducting power and temperature-aware profiling

Profiling on Cambricon MLU

  • Employing BANGPy and Neuware performance utilities
  • Gaining kernel-level visibility and interpreting logs
  • Integrating the MLU profiler with deployment frameworks

Optimizing at the Graph and Model Level

  • Strategies for graph pruning and quantization
  • Operator fusion and restructuring computational graphs
  • Standardizing input sizes and tuning batch configurations

Memory and Kernel Optimization

  • Enhancing memory layout and reuse patterns
  • Managing buffers efficiently across different chipsets
  • Applying platform-specific kernel tuning techniques

Best Practices for Cross-Platform Environments

  • Ensuring performance portability through abstraction strategies
  • Developing shared tuning pipelines for multi-chip setups
  • Case Study: Tuning an object detection model across Ascend, Biren, and MLU

Conclusion and Next Steps

Requirements

  • Hands-on experience with AI model training or deployment pipelines
  • Foundational knowledge of GPU/MLU compute principles and model optimization techniques
  • Basic proficiency with performance profiling tools and metrics

Target Audience

  • Performance engineers
  • Machine learning infrastructure teams
  • AI system architects
 21 Hours

Number of participants


Price per participant

Upcoming Courses

Related Categories