Get in Touch

Course Outline

AI Sovereignty and Local LLM Deployment

  • Identifying risks associated with cloud LLMs: data retention policies, training on user inputs, and foreign jurisdictional issues.
  • Understanding Ollama's architecture: the model server, registry, and its OpenAI-compatible API layer.
  • Comparing Ollama with alternatives such as vLLM, llama.cpp, and Text Generation Inference.
  • Reviewing model licensing for Llama, Mistral, Qwen, and Gemma.

Installation and Hardware Configuration

  • Deploying Ollama on Linux with CUDA and ROCm compatibility.
  • Implementing CPU-only fallback strategies and optimizing with AVX/AVX2 instructions.
  • Setting up Docker deployment with persistent volume mapping.
  • Configuring multi-GPU environments and managing VRAM allocation.

Model Management

  • Downloading models from the Ollama registry using commands like 'ollama pull llama3'.
  • Importing GGUF models sourced from HuggingFace and TheBloke.
  • Evaluating quantization levels: balancing precision in Q4_K_M, Q5_K_M, and Q8_0 formats.
  • Managing model switching and understanding limits for concurrent model loading.

Custom Modelfiles

  • Crafting Modelfile syntax using directives like FROM, PARAMETER, SYSTEM, and TEMPLATE.
  • Tuning key parameters such as temperature, top_p, and repeat_penalty.
  • Engineering system prompts to define role-specific model behaviors.
  • Creating and publishing bespoke models to the local registry.

API Integration

  • Utilizing the OpenAI-compatible /v1/chat/completions endpoint.
  • Implementing streaming responses and enforcing JSON mode.
  • Integrating local models with LangChain, LlamaIndex, and custom applications.
  • Managing authentication and rate limiting via reverse proxies.

Performance Optimization

  • Configuring context window sizes and managing KV cache efficiency.
  • Executing batch inference and handling parallel requests.
  • Allocating CPU threads and ensuring NUMA (Non-Uniform Memory Access) awareness.
  • Monitoring GPU utilization and tracking memory pressure.

Security and Compliance

  • Establishing network isolation for model serving endpoints.
  • Setting up input filtering and output moderation pipelines.
  • Maintaining audit logs for prompts and generated completions.
  • Verifying model provenance through hash checks.

Requirements

  • Intermediate proficiency in Linux and container administration.
  • A conceptual understanding of machine learning principles and transformer models.
  • Familiarity with REST APIs and JSON data formats.

Target Audience

  • AI engineers and developers looking to migrate away from cloud LLM APIs.
  • Organizations handling sensitive data that restricts the use of public cloud models.
  • Government and defense units requiring air-gapped language models.
 14 Hours

Number of participants


Price per participant

Upcoming Courses

Related Categories