Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
AI Sovereignty and Local LLM Deployment
- Identifying risks associated with cloud LLMs: data retention policies, training on user inputs, and foreign jurisdictional issues.
- Understanding Ollama's architecture: the model server, registry, and its OpenAI-compatible API layer.
- Comparing Ollama with alternatives such as vLLM, llama.cpp, and Text Generation Inference.
- Reviewing model licensing for Llama, Mistral, Qwen, and Gemma.
Installation and Hardware Configuration
- Deploying Ollama on Linux with CUDA and ROCm compatibility.
- Implementing CPU-only fallback strategies and optimizing with AVX/AVX2 instructions.
- Setting up Docker deployment with persistent volume mapping.
- Configuring multi-GPU environments and managing VRAM allocation.
Model Management
- Downloading models from the Ollama registry using commands like 'ollama pull llama3'.
- Importing GGUF models sourced from HuggingFace and TheBloke.
- Evaluating quantization levels: balancing precision in Q4_K_M, Q5_K_M, and Q8_0 formats.
- Managing model switching and understanding limits for concurrent model loading.
Custom Modelfiles
- Crafting Modelfile syntax using directives like FROM, PARAMETER, SYSTEM, and TEMPLATE.
- Tuning key parameters such as temperature, top_p, and repeat_penalty.
- Engineering system prompts to define role-specific model behaviors.
- Creating and publishing bespoke models to the local registry.
API Integration
- Utilizing the OpenAI-compatible /v1/chat/completions endpoint.
- Implementing streaming responses and enforcing JSON mode.
- Integrating local models with LangChain, LlamaIndex, and custom applications.
- Managing authentication and rate limiting via reverse proxies.
Performance Optimization
- Configuring context window sizes and managing KV cache efficiency.
- Executing batch inference and handling parallel requests.
- Allocating CPU threads and ensuring NUMA (Non-Uniform Memory Access) awareness.
- Monitoring GPU utilization and tracking memory pressure.
Security and Compliance
- Establishing network isolation for model serving endpoints.
- Setting up input filtering and output moderation pipelines.
- Maintaining audit logs for prompts and generated completions.
- Verifying model provenance through hash checks.
Requirements
- Intermediate proficiency in Linux and container administration.
- A conceptual understanding of machine learning principles and transformer models.
- Familiarity with REST APIs and JSON data formats.
Target Audience
- AI engineers and developers looking to migrate away from cloud LLM APIs.
- Organizations handling sensitive data that restricts the use of public cloud models.
- Government and defense units requiring air-gapped language models.
14 Hours