Running Ollama on Kubernetes: A Complete Guide to Local LLM Deployment

Collabnix

Collabnix Team

24 juin 2025 à 18:25

Learn how to deploy and scale Ollama LLM models on Kubernetes clusters for production-ready AI applications

Building RAG Applications with Ollama and Python: Complete 2025 Tutorial

Collabnix

Collabnix Team

24 juin 2025 à 16:30

Retrieval-Augmented Generation (RAG) has revolutionized how we build intelligent applications that can access and reason over external knowledge bases. In this comprehensive tutorial, we’ll explore how to build production-ready RAG applications using Ollama and Python, leveraging the latest techniques and best practices for 2025. What is RAG and Why Use Ollama? Retrieval-Augmented Generation combines the […]

AI in Real-World Applications: Beyond Code Generation

Collabnix

Collabnix Team

24 juin 2025 à 13:20

A technical exploration of autonomous AI systems that move beyond content generation to real-world execution

Portainer + Talos Linux

Bret Fisher

23 juin 2025 à 19:21

Portainer now manages Talos Linux Kubernetes

🍿 YouTube Video

At KubeCon London, I took a workshop on using Portainer to deploy and manage Talos Linux and Kubernetes together. Portainer now manages Talos Linux, and I got to spend some time with these two tools for a one-stop shop of deploying a very unique Talos OS (from Sidero Labs, Inc.) and setting up a Kubernetes cluster with just a few clicks inside Portainer.

The Portainer Workshop: https://academy.portainer.io/yachtops/

Talos Linux: https://www.talos.dev/

Get CNDO Weekly

Cloud Native DevOps education. Bestselling courses, live streams, and podcasts on DevOps, platform engineering, and containers, from a Docker Captain and Cloud Native Ambassador.

Email sent! Check your inbox to complete your signup.

No spam. Unsubscribe anytime.

👀 In case you missed a newsletter, read them at bret.news

Agentic AI in Customer Service: The Complete Technical Implementation Guide for 2025

Collabnix

Collabnix Team

23 juin 2025 à 06:37

Let’s get one thing straight—if you’re still deploying rule-based chatbots in 2025, you’re essentially bringing a flip phone to a smartphone convention. I’ve been in the trenches with AI implementations for years, and I can tell you that the shift from reactive customer service bots to autonomous agentic AI isn’t just evolutionary—it’s revolutionary. And frankly, […]

Docker Scout Tutorial: Build Secure Container Images

Collabnix

Collabnix Team

23 juin 2025 à 05:19

Learn how to implement comprehensive security scanning in your Docker workflow to identify vulnerabilities before they reach production.

10 Agentic AI Tools That Will Replace ChatGPT in 2025

Collabnix

Collabnix Team

23 juin 2025 à 04:34

Stop settling for AI that just answers questions. The future belongs to AI that actually does the work. If you’re still using ChatGPT like it’s 2023, you’re about to be left behind. While you’ve been asking ChatGPT to write emails, a revolutionary shift is happening in the AI world—and it’s called Agentic AI. Here’s the […]

GitHub Copilot VS Code Setup: Docker MCP Toolkit Tutorial 2025

Collabnix

Collabnix Team

22 juin 2025 à 18:02

VS Code developers using GitHub Copilot are already experiencing the power of AI-assisted development. But what if your AI assistant could do more than just write code?

Ollama vs ChatGPT 2025: Complete Technical Comparison Guide

Collabnix

Collabnix Team

19 juin 2025 à 03:28

Ollama vs ChatGPT 2025: A Comprehensive Comparison A comprehensive technical analysis comparing local LLM deployment via Ollama against cloud-based ChatGPT APIs, including performance benchmarks, cost analysis, and implementation strategies The artificial intelligence landscape has reached a critical inflection point in 2025. Organizations worldwide face a fundamental strategic decision that will define their AI capabilities for […]

Best Ollama Models 2025: Performance Comparison Guide

Collabnix

Collabnix Team

19 juin 2025 à 03:09

Top Picks for Best Ollama Models 2025 A comprehensive technical analysis of the most powerful local language models available through Ollama, including benchmarks, implementation guides, and optimization strategies Introduction to Ollama’s 2025 Ecosystem The landscape of local language model deployment has dramatically evolved in 2025, with Ollama establishing itself as the de facto standard for […]

Docker Multi-Stage Builds for Python Developers: A Complete Guide

Collabnix

Collabnix Team

19 juin 2025 à 02:44

As a Python developer, you’ve probably experienced the pain of slow Docker builds, bloated images filled with build tools, and the frustration of waiting 10+ minutes for a simple code change to rebuild. Docker multi-stage builds solve these problems elegantly, and they’re particularly powerful for Python applications. In this comprehensive guide, we’ll explore how to […]

Optimize Your AI Containers with Docker Multi-Stage Builds: A Complete Guide

Collabnix

Collabnix Team

19 juin 2025 à 02:08

If you’re developing AI applications, you’ve probably experienced the frustration of slow Docker builds, bloated container images, and inefficient caching. Every time you tweak your model code, you’re stuck waiting for dependencies to reinstall, and your production images are loaded with unnecessary build tools. Docker multi-stage builds solve these problems elegantly, and they’re particularly powerful […]

Docker State of App Dev: Security

Docker

Rebecca Floyd

18 juin 2025 à 15:51

Security is a team sport: why everyone owns it now

Six security takeaways from Docker’s 2025 State of Application Development Report.

In the evolving world of software development, one thing is clear — security is no longer a siloed specialty. It’s a team sport, especially when vulnerabilities strike. That’s one of several key security findings in the 2025 Docker State of Application Development Survey.

Here’s what else we learned about security from our second annual report, which was based on an online survey of over 4,500 industry professionals.

1. Security isn’t someone else’s problem

Forget the myth that only “security people” handle security. Across orgs big and small, roles are blending. If you’re writing code, you’re in the security game. As one respondent put it, “We don’t have dedicated teams — we all do it.” According to the survey, just 1 in 5 organizations outsource security. And it’s top of mind at most others: only 1% of respondents say security is not a concern at their organization.

One exception to this trend: In larger organizations (50 or more employees), software security is more likely to be the exclusive domain of security engineers, with other types of engineers playing less of a role.

2. Everyone thinks they’re in charge of security

Team leads from multiple corners report that they’re the ones focused on security. Seasoned developers are as likely to zero in on it as are mid-career security engineers. And they’re both right. Security has become woven into every function — devs, leads, and ops alike.

3. When vulnerabilities hit, it’s all hands on deck

No turf wars here. When scan alerts go off, everyone pitches in — whether it’s security engineers helping experienced devs to decode scan results, engineering managers overseeing the incident, or DevOps engineers filling in where needed.

Fixing vulnerabilities is also a major time suck. Among security-related tasks that respondents routinely deal with, it was the most selected option across all roles. Worth noting: Last year’s State of Application Development Survey identified security/vulnerability remediation tools as a key area where better tools were needed in the development process.

4. Security isn’t the bottleneck — planning and execution are

Surprisingly, security doesn’t crack the top 10 issues holding teams back. Planning and execution-type activities are bigger sticking points. Translation? Security is better integrated into the workflow than many give it credit for.

5. Shift-left is yesterday’s news

The once-pervasive mantra of “shift security left” is now only the 9th most important trend. Has the shift left already happened? Is AI and cloud complexity drowning it out? Or is this further evidence that security is, by necessity, shifting everywhere?

Again, perhaps security tools have gotten better, making it easier to shift left. (Our 2024 survey identified the shift-left approach as a possible source of frustration for developers and an area where more effective tools could make a difference.) Or perhaps there’s simply broader acceptance of the shift-left trend.

6. Shifting security left may not be the buzziest trend, but it’s still influential

The impact of shifting security left pales beside more dominant trends such as Generative AI and infrastructure as code. But it’s still a strong influence for developers in leadership roles.

Bottom line: Security is no longer a roadblock; it’s a reflex. Teams aren’t asking, “Who owns security?” — they’re asking, “How can we all do it better?”

Why Docker Chose OCI Artifacts for AI Model Packaging

Docker

Emily Casey

18 juin 2025 à 13:32

As AI development accelerates, developers need tools that let them move fast without having to reinvent their workflows. Docker Model Runner introduces a new specification for packaging large language models (LLMs) as OCI artifacts — a format developers already know and trust. It brings model sharing into the same workflows used for containers, with support for OCI registries like Docker Hub.

By using OCI artifacts, teams can skip custom toolchains and work with models the same way they do with container images. In this post, we’ll share why we chose OCI artifacts, how the format works, and what it unlocks for GenAI developers.

Why OCI artifacts?

One of Docker’s goals is to make genAI application development accessible to a larger community of developers. We can do this by helping models become first-class citizens within the cloud native ecosystem.

When models are packaged as OCI artifacts, developers can get started with AI development without the need to learn, vet, and adopt a new distribution toolchain. Instead, developers can discover new models on Hub and distribute variants publicly or privately via existing OCI registries, just like they do with container images today! For teams using Docker Hub, enterprise features like Registry Access Management (RAM) provide policy-based controls and guardrails to help enforce secure, consistent access.

Packaging models as OCI artifacts also paves the way for deeper integration between inference runners like Docker Model Runner and existing tools like containerd and Kubernetes.

Understanding OCI images and artifacts

Many of these advantages apply equally to OCI images and OCI artifacts. To understand why images can be a less optimal fit for LLMs and why a custom artifact specification conveys additional advantages, it helps to first revisit the components of an OCI image and its generic cousin, the OCI artifact.

What are OCI images?

OCI images are a standardized format for container images, defined by the Open Container Initiative (OCI). They package everything needed to run a container: metadata, configuration, and filesystem layers.

An OCI image is composed of three main components:

An image manifest – a JSON file containing references to an image configuration and a set of filesystem layers.
An image configuration – a JSON file containing the layer ordering and OCI runtime configuration.
One or more layers – TAR archives (typically compressed), containing filesystem changesets that, applied in order, produce a container root filesystem.

Below is an example manifest from the busybox image:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:7b4721e214600044496305a20ca3902677e572127d4d976ed0e54da0137c243a",
    "size": 477
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:189fdd1508372905e80cc3edcdb56cdc4fa216aebef6f332dd3cba6e300238ea",
      "size": 1844697
    }
  ],
  "annotations": {
    "org.opencontainers.image.url": "https://github.com/docker-library/busybox",
    "org.opencontainers.image.version": "1.37.0-glibc"
  }
}

Because the image manifest contains content-addressable references to all image components, the hash of the manifest file, otherwise known as the image digest, can be used to uniquely identify an image.

What are OCI artifacts?

OCI artifacts offer a way to extend the OCI image format to support distributing content beyond container images. They follow the same structure: a manifest, a config file, and one or more layers.

The artifact guidance in the OCI image specifications describes how this same basic structure (manifest + config + layers) can be used to distribute other types of content.

The artifact type is designated by the config file’s media type. For example, in the manifest below config.mediaType is set to application/vnd.cncf.helm.config.v1+json. This indicates to registries and other tooling that the artifact is a Helm chart and should be parsed accordingly.

{
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.cncf.helm.config.v1+json",
    "digest": "sha256:8ec7c0f2f6860037c19b54c3cfbab48d9b4b21b485a93d87b64690fdb68c2111",
    "size": 117
  },
  "layers": [
    {
      "mediaType": "application/vnd.cncf.helm.chart.content.v1.tar+gzip",
      "digest": "sha256:1b251d38cfe948dfc0a5745b7af5ca574ecb61e52aed10b19039db39af6e1617",
      "size": 2487
    }
  ]
}

In an OCI artifact, layers may be of any media type and are not restricted to filesystem changesets. Whoever defines the artifact type defines the supported layer types and determines how the contents should be used and interpreted.

Using container images vs. custom artifact types

With this background in mind, while we could have packaged LLMs as container images, defining a custom type has some important advantages:

A custom artifact type allows us to define a domain-specific config schema. Programmatic access to key metadata provides a support structure for an ecosystem of useful tools specifically tailored to AI use-cases.
A custom artifact type allows us to package content in formats other than compressed TAR archives, thus avoiding performance issues that arise when LLMs are packaged as image layers. For more details on how model layers are different and why it matters, see the Layers section below.
A custom type ensures that models are packaged and distributed separately from inference engines. This separation is important because it allows users to consume the variant of the inference engine optimized for their system without requiring every model to be packaged in combination with every engine.
A custom artifact type frees us from the expectations that typically accompany a container image. Standalone models are not executable without an inference engine. Packaging as a custom type makes clear that they are not independently runnable, thus avoiding confusion and unexpected errors.

Docker Model Artifacts

Now that we understand the high-level goals, let’s dig deeper into the details of the format.

Media Types

The model specification defines the following media types:

application/vnd.docker.ai.model.config.v0.1+json – identifies a model config JSON file. This value in config.mediaType in a manifest identifies an artifact as a Docker model with config file adhering to v0.1 of the specification.
application/vnd.docker.ai.gguf.v3 – indicates that a layer contains a model packaged as a GGUF file.
application/vnd.docker.ai.license – indicates that a layer contains a plain text software license file.

Expect more media types to be defined in the future as we add runtime configuration, add support for new features like projectors and LoRA adaptors, and expand the supported packaging formats for model files.

Manifest

A model manifest is formatted like an image manifest and distinguished by the config.MediaType. The following example manifest, taken from the ai/gemma3, references a model config JSON and two layers, one containing a GGUF file and the other containing the model’s license.

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.docker.ai.model.config.v0.1+json",
    "size": 372,
    "digest": "sha256:22273fd2f4e6dbaf5b5dae5c5e1064ca7d0ff8877d308eb0faf0e6569be41539"
  },
  "layers": [
    {
      "mediaType": "application/vnd.docker.ai.gguf.v3",
      "size": 2489757856,
      "digest": "sha256:09b370de51ad3bde8c3aea3559a769a59e7772e813667ddbafc96ab2dc1adaa7"
    },
    {
      "mediaType": "application/vnd.docker.ai.license",
      "size": 8346,
      "digest": "sha256:a4b03d96571f0ad98b1253bb134944e508a4e9b9de328909bdc90e3f960823e5"
    }
  ]
}

Model ID

The manifest digest uniquely identifies the model and is used by Docker Model Runner as the model ID.

Model Config JSON

The model configuration is a JSON file that surfaces important metadata about the model, such as size, parameter count, quantization, as well as metadata about the artifact provenance (like the creation timestamp).
The following example comes from the ai/gemma model on Dockerhub:

{
  "config": {
    "format": "gguf",
    "quantization": "IQ2_XXS/Q4_K_M",
    "parameters": "3.88 B",
    "architecture": "gemma3",
    "size": "2.31 GiB"
  },
  "descriptor": {
    "created": "2025-03-26T09:57:32.086694+01:00"
  },
  "rootfs": {
    "type": "rootfs",
    "diff_ids": [
      "sha256:09b370de51ad3bde8c3aea3559a769a59e7772e813667ddbafc96ab2dc1adaa7",
      "sha256:a4b03d96571f0ad98b1253bb134944e508a4e9b9de328909bdc90e3f960823e5"
    ]
  }
}

By defining a domain-specific configuration schema, we allow tools to access and use model metadata cheaply — by fetching and parsing a small JSON file — only fetching the model itself when needed.

For example, a registry frontend like Docker Hub can directly surface this data to users who can, in turn, use it to compare models or select based on system capabilities and requirements. Tooling might use this data to estimate memory requirements for a given model. It could then assist in the selection process by suggesting the best variant that is compatible with the available resources.

Layers

Layers in a model artifact differ from layers within an OCI image in two important respects.

Unlike an image layer, where compression is recommended, model layers are always uncompressed. Because models are large, high-entropy files, compressing them provides a negligible reduction in size, while (un)compressing is time and compute-intensive.

In contrast to a layer in an OCI image, which contains multiple files in an archive, each “layer” in a model artifact must contain a single raw file. This allows runtimes like Docker Model Runner to reduce disk usage on the client machine by storing a single uncompressed copy of the model. This file can then be directly memory mapped by the inference engine at runtime.

The lack of file names, hierarchy, and metadata (e.g. modification time) ensures that identical model files always result in identical reusable layer blobs. This prevents unnecessary duplication, which is particularly important when working with LLMs, given the file size.

You may have noticed that these “layers” are not really filesystem layers at all. They are files, but they do not specify a filesystem. So, how does this work at runtime? When Docker Model runner runs a model, instead of finding the GGUF file by name in a model filesystem, the desired file is identified by its media type (application/vnd.docker.ai.gguf.v3) and fetched from the model store. For more information on the Model Runner architecture, please see the architecture overview in this accompanying blog post.

Distribution

Like OCI images and other OCI artifacts, Docker model artifacts are distributed via registries like Dockerhub, Artifactory, or Azure Container Registry that comply with the OCI distribution specification.

Discovery

Docker Hub

The Docker Hub Gen AI catalog aids in the discovery of popular models. These models are packaged in the format described here and are compatible with Docker Model Runner and any other runtime that supports the OCI specification.

Hugging Face

If you are accustomed to exploring models on Hugging Face, there’s good news! Hugging Face now supports on-demand conversion to the Docker Model Artifact format when you pull from Hugging Face with docker model pull.

What’s Next?

Hopefully, you now have a better understanding of the Docker OCI Model format and how it supports our goal of making AI app development more accessible to developers via familiar workflows and commands. But this version of the artifact format is just the beginning! In the future, you can expect the enhancements to the packaging format to bring this level of accessibility and flexibility to a broader range of use cases. Future versions will support:

Additional runtime configuration options like templates, context size, and default parameters. This will allow users to configure models for specific use cases and distribute that config alongside the model, as a single immutable artifact.
LoRA adapters, allowing users to extend existing model artifacts with use-case-specific fine-tuning.
Multi-modal projectors, enabling users to package multi-modal such as language-and-vision models using LLaVA-style projectors.
Model index files that provide a set of models with different parameter count and quantizations, allowing runtimes the best option for the available resources.

In addition to adding features, we are committed to fostering an open ecosystem. Expect:

Deeper integrations into containerd for a more native runtime experience.
Efforts to harmonize with ModelPack and other model packaging standards to improve interoperability.

These advancements show our ongoing commitment to making the OCI artifact a versatile and flexible way to package and run AI models, delivering the same ease and reliability developers already expect from Docker.

Learn more

Get an inside look at the design architecture of the Docker Model Runner.
Read our quickstart guide to Docker Model Runner.
Find documentation for Model Runner.
Subscribe to the Docker Navigator Newsletter.
New to Docker? Create an account.
Have questions? The Docker community is here to help.

Behind the scenes: How we designed Docker Model Runner and what’s next

Docker

Jacob Howard

18 juin 2025 à 13:30

The last few years have made it clear that AI models will continue to be a fundamental component of many applications. The catch is that they’re also a fundamentally different type of component, with complex software and hardware requirements that don’t (yet) fit neatly into the constraints of container-oriented development lifecycles and architectures. To help address this problem, Docker launched the Docker Model Runner with Docker Desktop 4.40. Since then, we’ve been working aggressively to expand Docker Model Runner with additional OS and hardware support, deeper integration with popular Docker tools, and improvements to both performance and usability.
For those interested in Docker Model Runner and its future, we offer a behind-the-scenes look at its design, development, and roadmap.

Note: Docker Model Runner is really two components: the model runner and the model distribution specification. In this article, we’ll be covering the former, but be sure to check out the companion blog post by Emily Casey for the equally important distribution side of the story.

Design goals

Docker Model Runner’s primary design goal was to allow users to run AI models locally and to access them from both containers and host processes. While that’s simple enough to articulate, it still leaves an enormous design space in which to find a solution. Fortunately, we had some additional constraints: we were a small engineering team, and we had some ambitious timelines. Most importantly, we didn’t want to compromise on UX, even if we couldn’t deliver it all at once. In the end, this motivated design decisions that have so far allowed us to deliver a viable solution while leaving plenty of room for future improvement.

Multiple backends

One thing we knew early on was that we weren’t going to write our own inference engine (Docker’s wheelhouse is containerized development, not low-level inference engines). We’re also big proponents of open-source, and there were just so many great existing solutions! There’s llama.cpp, vLLM, MLX, ONNX, and PyTorch, just to name a few.

Of course, being spoiled for choice can also be a curse — which to choose? The obvious answer was: as many as possible, but not all at once.

We decided to go with llama.cpp for our initial implementation, but we intentionally designed our APIs with an additional, optional path component (the {name} in /engines/{name}) to allow users to take advantage of multiple future backends. We also designed interfaces and stubbed out implementations for other backends to enforce good development hygiene and to avoid becoming tethered to one “initial” implementation.

OpenAI API compatibility

The second design choice we had to make was how to expose inference to consumers in containers. While there was also a fair amount of choice in the inference API space, we found that the OpenAI API standard seemed to offer the best initial tooling compatibility. We were also motivated by the fact that several teams inside Docker were already using this API for various real-world products. While we may support additional APIs in the future, we’ve so far found that this API surface is sufficient for most applications. One gap that we know exists is full compatibility with this API surface, which is something we’re working on iteratively.

This decision also drove our choice of llama.cpp as our initial backend. The llama.cpp project already offered a turnkey option for OpenAI API compatibility through its server implementation. While we had to make some small modifications (e.g. Unix domain socket support), this offered us the fastest path to a solution. We’ve also started contributing these small patches upstream, and we hope to expand our contributions to these projects in the future.

First-class citizenship for models in the Docker API

While the OpenAI API standard was the most ubiquitous option amongst existing tooling, we also knew that we wanted models to be first-class citizens in the Docker Engine API. Models have a fundamentally different execution lifecycle than the processes that typically make up the ENTRYPOINTs of containers, and thus, they don’t fit well under the standard /containers endpoints of the Docker Engine API. However, much like containers, images, networks, and volumes, models are such a fundamental component that they really deserve their own API resource type. This motivated the addition of a set of /models endpoints, closely modeled after the /images endpoints, but separate for reasons that are best discussed in the distribution blog post.

GPU acceleration

Another critical design goal was support for GPU acceleration of inference operations. Even the smallest useful models are extremely computationally demanding, while more sophisticated models (such as those with tool-calling capabilities) would be a stretch to fit onto local hardware at all. GPU support was going to be non-negotiable for a useful experience.

Unfortunately, passing GPUs across the VM boundary in Docker Desktop, especially in a way that would be reliable across platforms and offer a usable computation API inside containers, was going to be either impossible or very flaky.

As a compromise, we decided to run inference operations outside of the Docker Desktop VM and simply proxy API calls from the VM to the host. While there are some risks with this approach, we are working on initiatives to mitigate these with containerd-hosted sandboxing on macOS and Windows. Moreover, with Docker-provided models and application-provided prompts, the risk is somewhat lower, especially given that inference consists primarily of numerical operations. We assess the risk in Docker Desktop to be about on par with accessing host-side services via host.docker.internal (something already enabled by default).

However, agents that drive tool usage with model output can cause more significant side effects, and that’s something we needed to address. Fortunately, using the Docker MCP Toolkit, we’re able to perform tool invocation inside ephemeral containers, offering reliable encapsulation of the side effects that models might drive. This hybrid approach allows us to offer the best possible local performance with relative peace of mind when using tools.

Outside the context of Docker Desktop, for example, in Docker CE, we’re in a significantly better position due to the lack of a VM boundary (or at least a very transparent VM boundary in the case of a hypervisor) between the host hardware and containers. When running in standalone mode in Docker CE, the Docker Model Runner will have direct access to host hardware (e.g. via the NVIDIA Container Toolkit) and will run inference operations within a container.

Modularity, iteration, and open-sourcing

As previously mentioned, the Docker Model Runner team is relatively small, which meant that we couldn’t rely on a monolithic architecture if we wanted to effectively parallelize the development work for Docker Model Runner. Moreover, we had an early and overarching directive: open-source as much as possible.

We decided on three high-level components around which we could organize development work: the model runner, the model distribution tooling, and the model CLI plugin.

Breaking up these components allowed us to divide work more effectively, iterate faster, and define clean API boundaries between different concerns. While there have been some tricky dependency hurdles (in particular when integrating with closed-source components), we’ve found that the modular approach has facilitated faster incremental changes and support for new platforms.

The High-Level Architecture

At a high level, the Docker Model Runner architecture is composed of the three components mentioned above (the runner, the distribution code, and the CLI), but there are also some interesting sub-components within each:

Figure 1: Docker Model Runner high-level architecture

How these components are packaged and hosted (and how they interact) also depends on the platform where they’re deployed. In each case it looks slightly different. Sometimes they run on the host, sometimes they run in a VM, sometimes they run in a container, but the overall architecture looks the same.

Model storage and client

The core architectural component is the model store. This component, provided by the model distribution code, is where the actual model tensor files are stored. These files are stored differently (and separately) from images because (1) they’re high-entropy and not particularly compressible and (2) the inference engine needs direct access to the files so that it can do things like mapping them into its virtual address space via mmap(). For more information, see the accompanying model distribution blog post.

The model distribution code also provides the model distribution client. This component performs operations (such as pulling models) using the model distribution protocol against OCI registries.

Model runner

Built on top of the model store is the model runner. The model runner maps inbound inference API requests (e.g. /v1/chat/completions or /v1/embeddings requests) to processes hosting pairs of inference engines and models. It includes scheduler, loader, and runner components that coordinate the loading of models in and out of memory so that concurrent requests can be serviced, even if models can’t be loaded simultaneously (e.g. due to resource constraints). This makes the execution lifecycle of models different from that of containers, with engines and models operating as ephemeral processes (mostly hidden from users) that can be terminated and unloaded from memory as necessary (or when idle). A different backend process is run for each combination of engine (e.g. llama.cpp) and model (e.g. ai/qwen3:8B-Q4_K_M) as required by inference API requests (though multiple requests targeting the same pair will reuse the same runner and backend processes if possible).

The runner also includes an installer service that can dynamically download backend binaries and libraries, allowing users to selectively enable features (such as CUDA support) that might require downloading hundreds of MBs of dependencies.

Finally, the model runner serves as the central server for all Docker Model Runner APIs, including the /models APIs (which it routes to the model distribution code) and the /engines APIs (which it routes to its scheduler). This API server will always opt to hold in-flight requests until the resources (primarily RAM or VRAM) are available to service them, rather than returning something like a 503 response. This is critical for a number of usage patterns, such multiple agents running with different models or concurrent requests for both embedding and completion.

Model CLI

The primary user-facing component of the Docker Model Runner architecture is the model CLI. This component is a standard Docker CLI plugin that offers an interface very similar to the docker image command. While the lifecycle of model execution is different from that of containers, the concepts (such as pushing, pulling, and running) should be familiar enough to existing Docker users.

The model CLI communicates with the model runner’s APIs to perform almost all of its operations (though the transport for that communication varies by platform). The model CLI is context-aware, allowing it to determine if it’s talking to a Docker Desktop model runner, Docker CE model runner, or a model runner on some custom platform. Because we’re using the standard Docker CLI plugin framework, we get all of the standard Docker Context functionality for free, making this detection much easier.

API design and routing

As previously mentioned, the Docker Model Runner comprises two sets of APIs: the Docker-style APIs and the OpenAI-compatible APIs. The Docker-style APIs (modeled after the /image APIs) include the following endpoints:

POST /models/create (Model pulling)
GET /models (Model listing)
GET /models/{namespace}/{name} (Model metadata)
DELETE /models/{namespace}/{name} (Model deletion)

The bodies for these requests look very similar to their image analogs. There’s no documentation at the moment, but you can get a glimpse of the format by looking at their corresponding Go types.

In contrast, the OpenAI endpoints follow a different but still RESTful convention:

GET /engines/{engine}/v1/models (OpenAI-format model listing)
GET /engines/{engine}/v1/models/{namespace}/{name} (OpenAI-format model metadata)
POST /engines/{engine}/v1/chat/completions (Chat completions)
POST /engines/{engine}/v1/completions (Chat completions (legacy endpoint))
POST /engines/{engine}/v1/embeddings (Create embeddings)

At this point in time, only one {engine} value is supported (llama.cpp), and it can also be omitted to use the default (llama.cpp) engine.

We make these APIs available on several different endpoints:

First, in Docker Desktop, they’re available on the Docker socket (/var/run/docker.sock), both inside and outside containers. This is in service of our design goal of having models as a first-class citizen in the Docker Engine API. At the moment, these endpoints are prefixed with a /exp/vDD4.40 path (to avoid dependencies on APIs that may evolve during development), but we’ll likely remove this prefix in the next few releases since these APIs have now mostly stabilized and will evolve in a backward-compatible way.

Second, also in Docker Desktop, we make the APIs available on a special model-runner.docker.internal endpoint that’s accessible just from containers (though not currently from ECI containers, because we want to have inference sandboxing implemented first). This TCP-based endpoint exposes just the /models and /engines API endpoints (not the whole Docker API) and is designed to serve existing tooling (which likely can’t access APIs via a Unix domain socket). No /exp/vDD4.40 prefix is used in this case.

Finally, in both Docker Desktop and Docker CE, we make the /models and /engines API endpoints available on a host TCP endpoint (localhost:12434, by default, again without any /exp/vDD4.40 prefix). In Docker Desktop this is optional and not enabled by default. In Docker CE, it’s a critical component of how the API endpoints are accessed, because we currently lack the integration to add endpoints to Docker CE’s /var/run/docker.sock or to inject a custom model-runner.docker.internal hostname, so we advise using the standard 172.17.0.1 host gateway address to access this localhost-exposed port (e.g. setting your OpenAI API base URL to http://172.17.0.1:12434/engines/v1). Hopefully we’ll be able to unify this across Docker platforms in the near future (see our roadmap below).

First up: Docker Desktop

The natural first step for Docker Model Runner was integration into Docker Desktop. In Docker Desktop, we have more direct control over integration with the Docker Engine, as well as existing processes that we can use to host the model runner components. In this case, the model runner and model distribution components live in the Docker Desktop host backend process (the com.docker.backend process you may have seen running) and we use special middleware and networking magic to route requests on /var/run/docker.sock and model-runner.docker.internal to the model runner’s API server. Since the individual inference backend processes run as subprocesses of com.docker.backend, there’s no risk of a crash in Docker Desktop if, for example, an inference backend is killed by an Out Of Memory (OOM) error.

We started initially with support for macOS on Apple Silicon, because it provided the most uniform platform for developing the model runner functionality, but we implemented most of the functionality along the way to build and test for all Docker Desktop platforms. This made it significantly easier to port to Windows on AMD64 and ARM64 platforms, as well as the GPU variations that we found there.

The one complexity with Windows was the larger size of the supporting library dependencies for the GPU-based backends. It wouldn’t have been feasible (or tolerated) if we added another 500 MB – 1 GB to the Docker Desktop for Windows installer, so we decided to default to a CPU-based backend in Docker Desktop for Windows with optional support for the GPU backend. This was the primary motivating factor for the dynamic installer component of the model runner (in addition to our desire for incremental updates to different backends).

This all sounds like a very well-planned exercise, and we did indeed start with a three-component design and strictly enforced API boundaries, but in truth we started with the model runner service code as a sub-package of the Docker Desktop source code. This made it much easier to iterate quickly, especially as we were exploring the architecture for the different services. Fortunately, by sticking to a relatively strict isolation policy for the code, and enforcing clean dependencies through APIs and interfaces, we were able to easily extract the code (kudos to the excellent git-filter-repo tool) into a separate repository for the purposes of open-sourcing.

Next stop: Docker CE

Aside from Docker’s penchant for open-sourcing, one of the main reasons that we wanted to make the Docker Model Runner source code publicly available was to support integration into Docker CE. Our goal was to package the docker model command in the same way as docker buildx and docker compose.

The trick with Docker CE is that we wanted to ship Docker Model Runner as a “vanilla” Docker CLI plugin (i.e. without any special privileges or API access), which meant that we didn’t have a backend process that could host the model runner service. However, in the Docker CE case, the boundary between host hardware and container processes is much less disruptive, meaning that we could actually run Docker Model Runner in a container and simply make any accelerator hardware available to it directly. So, much like a standalone BuildKit builder container, we run the Docker Model Runner as a standalone container in Docker CE, with a special named volume for model storage (meaning you can uninstall the runner without having to re-pull models). This “installation” is performed by the model CLI automatically (and when necessary) by pulling the docker/model-runner image and starting a container. Explicit configuration for the runner can also be specified using the docker model install-runner command. If you want, you can also remove the model runner (and optionally the model storage) using docker model uninstall-runner.

This unfortunately leads to one small compromise with the UX: we don’t currently support the model runner APIs on /var/run/docker.sock or on the special model-runner.docker.internal URL. Instead, the model runner API server listens on the host system’s loopback interface at localhost:12434 (by default), which is available inside most containers at 172.17.0.1:12434. If desired, users can also make this available on model-runner.docker.internal:12434 by utilizing something like –add-host=model-runner.docker.internal:host-gateway when running docker run or docker create commands. This can also be achieved by using the extra_hosts key in a Compose YAML file. We have plans to make this more ergonomic in future releases.

The road ahead…

The status quo is Docker Model Runner support in Docker Desktop on macOS and Windows and support for Docker CE on Linux (including WSL2), but that’s definitely not the end of the story. Over the next few months, we have a number of initiatives planned that we think will reshape the user experience, performance, and security of Docker Model Runner.

Additional GUI and CLI functionality

The most visible functionality coming out over the next few months will be in the model CLI and the “Models” tab in the Docker Desktop dashboard. Expect to see new commands (such as df, ps, and unload) that will provide more direct support for monitoring and controlling model execution. Also, expect to see new and expanded layouts and functionality in the Models tab.

Expanded OpenAI API support

A less-visible but equally important aspect of the Docker Model Runner user experience is our compatibility with the OpenAI API. There are dozens of endpoints and parameters to support (and we already support many), so we will work to expand API surface compatibility with a focus on practical use cases and prioritization of compatibility with existing tools.

containerd and Moby integration

One of the longer-term initiatives that we’re looking at is integration with containerd. containerd already provides a modular runtime system that allows for task execution coordinated with storage. We believe this is the right way forward and that it will allow us to better codify the relationship between model storage, model execution, and model execution sandboxing.

In combination with the containerd work, we would also like tighter integration with the Moby project. While our existing Docker CE integration offers a viable and performant solution, we believe that better ergonomics could be achieved with more direct integration. In particular, niceties like support for model-runner.docker.internal DNS resolution in Docker CE are on our radar. Perhaps the biggest win from this tighter integration would be to expose Docker Model Runner APIs on the Docker socket and to include the API endpoints (e.g. /models) in the official Docker Engine API documentation.

Kubernetes

One of the product goals for Docker Model Runner was a consistent experience from development inner loop to production, and Kubernetes is inarguably a part of that path. The existing Docker Model Runner images that we’re using for Docker CE will also work within a Kubernetes cluster, and we’re currently developing instructions to set up a Docker Model Runner instance in a Kubernetes cluster. The big difference with Kubernetes is the variety of cluster and application architectures in use, so we’ll likely end up with different “recipes” for how to configure the Docker Model Runner in different scenarios.

vLLM

One of the things we’ve heard from a number of customers is that vLLM forms a core component of their production stack. This was also the first alternate backend that we stubbed out in the model runner repository, and the time has come to start poking at an implementation.

Even more to come…

Finally, there are some bits that we just can’t talk about yet, but they will fundamentally shift the way that developers interact with models. Be sure to tune-in to Docker’s sessions at WeAreDevelopers from July 9–11 for some exciting announcements around AI-related initiatives at Docker.

Learn more

Explore the story behind our model distribution specification
Read our quickstart guide to Docker Model Runner.
Find documentation for Model Runner.
Subscribe to the Docker Navigator Newsletter.
New to Docker? Create an account.
Have questions? The Docker community is here to help.

Understanding the n8 app and Its Solutions

Collabnix

Tanvir Kour

18 juin 2025 à 06:44

In today’s digital world, we use dozens of different apps and services every day. Email, Slack, Google Sheets, databases, social media, CRM systems – the list goes on. While each tool serves its purpose, getting them to work together smoothly can be a nightmare. Enter n8n (pronounced “n-eight-n”), a powerful workflow automation platform that connects […]

LM Studio vs Ollama: Picking the Right Tool for Local LLM Use

Collabnix

Tanvir Kour

16 juin 2025 à 11:31

LM Studio prioritizes ease of use with a polished GUI ideal for beginners, while Ollama offers greater flexibility and control through its developer-friendly command-line interface and REST API. Choose LM Studio if you want a plug-and-play experience with visual controls, or Ollama if you prefer command-line power and deeper customization options. The landscape of local […]

Neural Networks and AI’s Impact on the Evolution of Gaming Experience

Collabnix

Tanvir Kour

16 juin 2025 à 07:57

Artificial intelligence is no longer just a background tool in the gaming world. Neural networks now shape how stories unfold, how enemies react, and how levels are generated. Players are seeing more personalized, dynamic, and unpredictable experiences thanks to AI integration. But what happens when systems collect too much personal data during gameplay? What if […]

How to Build, Run, and Package AI Models Locally with Docker Model Runner

Docker

Vladimir Mikhalev

12 juin 2025 à 16:00

Introduction

As a Senior DevOps Engineer and Docker Captain, I’ve helped build AI systems for everything from retail personalization to medical imaging. One truth stands out: AI capabilities are core to modern infrastructure.

This guide will show you how to run and package local AI models with Docker Model Runner — a lightweight, developer-friendly tool for working with AI models pulled from Docker Hub or Hugging Face. You’ll learn how to run models in the CLI or via API, publish your own model artifacts, and do it all without setting up Python environments or web servers.

What is AI in Development?

Artificial Intelligence (AI) refers to systems that mimic human intelligence, including:

Making decisions via machine learning
Understanding language through NLP
Recognizing images with computer vision
Learning from new data automatically

Common Types of AI in Development:

Machine Learning (ML): Learns from structured and unstructured data
Deep Learning: Neural networks for pattern recognition
Natural Language Processing (NLP): Understands/generates human language
Computer Vision: Recognizes and interprets images

Why Package and Run Your Own AI Model?

Local model packaging and execution offer full control over your AI workflows. Instead of relying on external APIs, you can run models directly on your machine — unlocking:

Faster inference with local compute (no latency from API calls)
Greater privacy by keeping data and prompts on your own hardware
Customization through packaging and versioning your own models
Seamless CI/CD integration with tools like Docker and GitHub Actions
Offline capabilities for edge use cases or constrained environments

Platforms like Docker and Hugging Face make cutting-edge AI models instantly accessible without building from scratch. Running them locally means lower latency, better privacy, and faster iteration.

Real-World Use Cases for AI

Chatbots & Virtual Assistants: Automate support (e.g., ChatGPT, Alexa)
Generative AI: Create text, art, music (e.g., Midjourney, Lensa)
Dev Tools: Autocomplete and debug code (e.g., GitHub Copilot)
Retail Intelligence: Recommend products based on behavior
Medical Imaging: Analyze scans for faster diagnosis

How to Package and Run AI Models Locally with Docker Model Runner

Prerequisites:

Docker Desktop 4.40+ installed
Experimental features and Model Runner enabled in Docker Desktop settings
(Recommended) Windows 11 with NVIDIA GPU or Mac with Apple Silicon
Internet access for downloading models from Docker Hub or Hugging Face

Step 0 — Enable Docker Model Runner

Open Docker Desktop

Go to Settings → Features in development

Under the Experimental features tab, enable Access experimental features

Click Apply and restart

Quit and reopen Docker Desktop to ensure changes take effect

Reopen Settings → Features in development

Switch to the Beta tab and check Enable Docker Model Runner

(Optional) Enable host-side TCP support to access the API from localhost

Once enabled, you can use the docker model CLI and manage models in the Models tab.

Screenshot of Docker Desktop’s Features in development tab with Docker Model Runner and Dev Environments enabled.

Step 1: Pull a Model

From Docker Hub:

docker model pull ai/smollm2

Or from Hugging Face (GGUF format):

docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

Note: Only GGUF models are supported. GGUF (GPT-style General Use Format) is a lightweight binary file format designed for efficient local inference, especially with CPU-optimized runtimes like llama.cpp. It includes the model weights, tokenizer, and metadata all in one place, making it ideal for packaging and distributing LLMs in containerized environments.

Step 2: Tag and Push to Local Registry (Optional)

If you want to push models to a private or local registry:

Tag model with your registry’s address:

docker model tag hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF localhost:5000/foobar

Run a local Docker registry:

docker run -d -p 6000:5000 --name registry registry:2

Push the model to the local registry:

docker model push localhost:6000/foobar

Check your local models with:

docker model list

Step 3: Run the Model

Run a prompt (one-shot)

docker model run ai/smollm2 "What is Docker?"

Interactive chat mode

docker model run ai/smollm2

Note: Models are loaded into memory on demand and unloaded after 5 minutes of inactivity.

Step 4: Test via OpenAI-Compatible API

To call the model from the host:

Enable TCP host access for Model Runner (via Docker Desktop GUI or CLI)

Screenshot of Docker Desktop’s Features in development tab showing host-side TCP support enabled for Docker Model Runner.

docker desktop enable model-runner --tcp 12434

Send a prompt using the OpenAI-compatible chat endpoint:

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me about the fall of Rome."}
    ]
  }'

Note: No API key required — this runs locally and securely on your machine.

Step 5: Package Your Own Model

You can package your own pre-trained GGUF model as a Docker-compatible artifact if you already have a .gguf file — such as one downloaded from Hugging Face or converted using tools like llama.cpp.

Note: This guide assumes you already have a .gguf model file. It does not cover how to train or convert models to GGUF.

docker model package \
  --gguf "$(pwd)/model.gguf" \
  --license "$(pwd)/LICENSE.txt" \
  --push registry.example.com/ai/custom-llm:v1

This is ideal for custom-trained or private models. You can now pull it like any other model:

docker model pull registry.example.com/ai/custom-llm:v1

Step 6: Optimize & Iterate

Use docker model logs to monitor model usage and debug issues
Set up CI/CD to automate pulls, scans, and packaging
Track model lineage and training versions to ensure consistency
Use semantic versioning (:v1, :2025-05, etc.) instead of latest when packaging custom models
Only one model can be loaded at a time; requesting a new model will unload the previous one.

Compose Integration (Optional)

Docker Compose v2.35+ (included in Docker Desktop 4.41+) introduces support for AI model services using a new provider.type: model. You can define models directly in your compose.yml and reference them in app services using depends_on.

During docker compose up, Docker Model Runner automatically pulls the model and starts it on the host system, then injects connection details into dependent services using environment variables such as MY_MODEL_URL and MY_MODEL_MODEL, where MY_MODEL matches the name of the model service.

This enables seamless multi-container AI applications — with zero extra glue code. Learn more.

Navigating AI Development Challenges

Latency: Use quantized GGUF models
Security: Never run unknown models; validate sources and attach licenses
Compliance: Mask PII, respect data consent
Costs: Run locally to avoid cloud compute bills

Best Practices

Prefer GGUF models for optimal CPU inference
Use the --license flag when packaging custom models to ensure compliance
Use versioned tags (e.g., :v1, :2025-05) instead of latest
Monitor model logs using docker model logs
Validate model sources before pulling or packaging
Only pull models from trusted sources (e.g., Docker Hub’s ai/ namespace or verified Hugging Face repos).
Review the license and usage terms for each model before packaging or deploying.

The Road Ahead

Support for Retrieval-Augmented Generation (RAG)
Expanded multimodal support (text + images, video, audio)
LLMs as services in Docker Compose (Requires Docker Compose v2.35+)
More granular Model Dashboard features in Docker Desktop
Secure packaging and deployment pipelines for private AI models

Docker Model Runner lets DevOps teams treat models like any other artifact — pulled, tagged, versioned, tested, and deployed.

Final Thoughts

You don’t need a GPU cluster or external API to build AI apps. Learn more and explore everything you can do with Docker Model Runner:

Pull prebuilt models from Docker Hub or Hugging Face
Run them locally using the CLI, API, or Docker Desktop’s Model tab
Package and push your own models as OCI artifacts
Integrate with your CI/CD pipelines securely

You can also find other helpful information to get started at:

You’re not just deploying containers — you’re delivering intelligence.

Learn more

Read our quickstart guide to Docker Model Runner.
Find documentation for Model Runner.
Subscribe to the Docker Navigator Newsletter.
New to Docker? Create an account.
Have questions? The Docker community is here to help.

Vue lecture

Portainer now manages Talos Linux Kubernetes

Get CNDO Weekly

Why OCI artifacts?

Understanding OCI images and artifacts

What are OCI images?

What are OCI artifacts?

Using container images vs. custom artifact types

Docker Model Artifacts

Media Types

Manifest

Model ID

Model Config JSON

Layers

Distribution

Discovery

Docker Hub

Hugging Face

What’s Next?

Learn more

Design goals

Multiple backends

OpenAI API compatibility

First-class citizenship for models in the Docker API

GPU acceleration

Modularity, iteration, and open-sourcing

The High-Level Architecture

Model storage and client

Model runner

Model CLI

API design and routing

First up: Docker Desktop

Next stop: Docker CE

The road ahead…

Additional GUI and CLI functionality

Expanded OpenAI API support

containerd and Moby integration

Kubernetes

vLLM

Even more to come…

Learn more

Introduction

What is AI in Development?

Common Types of AI in Development:

Why Package and Run Your Own AI Model?

Real-World Use Cases for AI

How to Package and Run AI Models Locally with Docker Model Runner

Prerequisites:

Step 0 — Enable Docker Model Runner

Step 1: Pull a Model

Step 2: Tag and Push to Local Registry (Optional)

Step 3: Run the Model

Step 4: Test via OpenAI-Compatible API

Step 5: Package Your Own Model

Step 6: Optimize & Iterate

Compose Integration (Optional)

Navigating AI Development Challenges

Best Practices

The Road Ahead

Final Thoughts

Learn more