Vue normale

Il y a de nouveaux articles disponibles, cliquez pour rafraîchir la page.
Aujourd’hui — 25 juin 2025Flux principal

Building RAG Applications with Ollama and Python: Complete 2025 Tutorial

24 juin 2025 à 16:30
Retrieval-Augmented Generation (RAG) has revolutionized how we build intelligent applications that can access and reason over external knowledge bases. In this comprehensive tutorial, we’ll explore how to build production-ready RAG applications using Ollama and Python, leveraging the latest techniques and best practices for 2025. What is RAG and Why Use Ollama? Retrieval-Augmented Generation combines the […]
À partir d’avant-hierFlux principal

Agentic AI in Customer Service: The Complete Technical Implementation Guide for 2025

23 juin 2025 à 06:37
Let’s get one thing straight—if you’re still deploying rule-based chatbots in 2025, you’re essentially bringing a flip phone to a smartphone convention. I’ve been in the trenches with AI implementations for years, and I can tell you that the shift from reactive customer service bots to autonomous agentic AI isn’t just evolutionary—it’s revolutionary. And frankly, […]

10 Agentic AI Tools That Will Replace ChatGPT in 2025

23 juin 2025 à 04:34
Stop settling for AI that just answers questions. The future belongs to AI that actually does the work. If you’re still using ChatGPT like it’s 2023, you’re about to be left behind. While you’ve been asking ChatGPT to write emails, a revolutionary shift is happening in the AI world—and it’s called Agentic AI. Here’s the […]

Ollama vs ChatGPT 2025: Complete Technical Comparison Guide

19 juin 2025 à 03:28
Ollama vs ChatGPT 2025: A Comprehensive Comparison A  comprehensive technical analysis comparing local LLM deployment via Ollama against cloud-based ChatGPT APIs, including performance benchmarks, cost analysis, and implementation strategies The artificial intelligence landscape has reached a critical inflection point in 2025. Organizations worldwide face a fundamental strategic decision that will define their AI capabilities for […]

Best Ollama Models 2025: Performance Comparison Guide

19 juin 2025 à 03:09
Top Picks for Best Ollama Models 2025 A comprehensive technical analysis of the most powerful local language models available through Ollama, including benchmarks, implementation guides, and optimization strategies Introduction to Ollama’s 2025 Ecosystem The landscape of local language model deployment has dramatically evolved in 2025, with Ollama establishing itself as the de facto standard for […]

Why Docker Chose OCI Artifacts for AI Model Packaging

Par : Emily Casey
18 juin 2025 à 13:32

As AI development accelerates, developers need tools that let them move fast without having to reinvent their workflows. Docker Model Runner introduces a new specification for packaging large language models (LLMs) as OCI artifacts — a format developers already know and trust. It brings model sharing into the same workflows used for containers, with support for OCI registries like Docker Hub.

By using OCI artifacts, teams can skip custom toolchains and work with models the same way they do with container images. In this post, we’ll share why we chose OCI artifacts, how the format works, and what it unlocks for GenAI developers.

Why OCI artifacts?

One of Docker’s goals is to make genAI application development accessible to a larger community of developers. We can do this by helping models become first-class citizens within the cloud native ecosystem. 

When models are packaged as OCI artifacts, developers can get started with AI development without the need to learn, vet, and adopt a new distribution toolchain. Instead, developers can discover new models on Hub and distribute variants publicly or privately via existing OCI registries, just like they do with container images today! For teams using Docker Hub, enterprise features like Registry Access Management (RAM) provide policy-based controls and guardrails to help enforce secure, consistent access.

Packaging models as OCI artifacts also paves the way for deeper integration between inference runners like Docker Model Runner and existing tools like containerd and Kubernetes.

Understanding OCI images and artifacts

Many of these advantages apply equally to OCI images and OCI artifacts. To understand why images can be a less optimal fit for LLMs and why a custom artifact specification conveys additional advantages, it helps to first revisit the components of an OCI image and its generic cousin, the OCI artifact.

What are OCI images?

OCI images are a standardized format for container images, defined by the Open Container Initiative (OCI). They package everything needed to run a container: metadata, configuration, and filesystem layers.

An OCI image is composed of three main components:

  • An image manifest – a JSON file containing references to an image configuration and a set of filesystem layers.
  • An image configuration – a JSON file containing the layer ordering and OCI runtime configuration.
  • One or more layers – TAR archives (typically compressed), containing filesystem changesets that, applied in order, produce a container root filesystem.

Below is an example manifest from the busybox image:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:7b4721e214600044496305a20ca3902677e572127d4d976ed0e54da0137c243a",
    "size": 477
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:189fdd1508372905e80cc3edcdb56cdc4fa216aebef6f332dd3cba6e300238ea",
      "size": 1844697
    }
  ],
  "annotations": {
    "org.opencontainers.image.url": "https://github.com/docker-library/busybox",
    "org.opencontainers.image.version": "1.37.0-glibc"
  }
}

Because the image manifest contains content-addressable references to all image components, the hash of the manifest file, otherwise known as the image digest, can be used to uniquely identify an image.

What are OCI artifacts?

OCI artifacts offer a way to extend the OCI image format to support distributing content beyond container images. They follow the same structure: a manifest, a config file, and one or more layers. 

The artifact guidance in the OCI image specifications describes how this same basic structure (manifest + config + layers) can be used to distribute other types of content.

The artifact type is designated by the config file’s media type. For example, in the manifest below config.mediaType is set to application/vnd.cncf.helm.config.v1+json. This indicates to registries and other tooling that the artifact is a Helm chart and should be parsed accordingly.

{
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.cncf.helm.config.v1+json",
    "digest": "sha256:8ec7c0f2f6860037c19b54c3cfbab48d9b4b21b485a93d87b64690fdb68c2111",
    "size": 117
  },
  "layers": [
    {
      "mediaType": "application/vnd.cncf.helm.chart.content.v1.tar+gzip",
      "digest": "sha256:1b251d38cfe948dfc0a5745b7af5ca574ecb61e52aed10b19039db39af6e1617",
      "size": 2487
    }
  ]
}

In an OCI artifact, layers may be of any media type and are not restricted to filesystem changesets. Whoever defines the artifact type defines the supported layer types and determines how the contents should be used and interpreted.

Using container images vs. custom artifact types

With this background in mind, while we could have packaged LLMs as container images, defining a custom type has some important advantages:

  1. A custom artifact type allows us to define a domain-specific config schema. Programmatic access to key metadata provides a support structure for an ecosystem of useful tools specifically tailored to AI use-cases.
  2. A custom artifact type allows us to package content in formats other than compressed TAR archives, thus avoiding performance issues that arise when LLMs are packaged as image layers. For more details on how model layers are different and why it matters, see the Layers section below.
  3. A custom type ensures that models are packaged and distributed separately from inference engines. This separation is important because it allows users to consume the variant of the inference engine optimized for their system without requiring every model to be packaged in combination with every engine.
  4. A custom artifact type frees us from the expectations that typically accompany a container image. Standalone models are not executable without an inference engine. Packaging as a custom type makes clear that they are not independently runnable, thus avoiding confusion and unexpected errors.

Docker Model Artifacts

Now that we understand the high-level goals, let’s dig deeper into the details of the format.

Media Types

The model specification defines the following media types:

  • application/vnd.docker.ai.model.config.v0.1+json – identifies a model config JSON file. This value in config.mediaType in a manifest identifies an artifact as a Docker model with config file adhering to v0.1 of the specification.
  • application/vnd.docker.ai.gguf.v3 – indicates that a layer contains a model packaged as a GGUF file.
  • application/vnd.docker.ai.license – indicates that a layer contains a plain text software license file.

Expect more media types to be defined in the future as we add runtime configuration, add support for new features like projectors and LoRA adaptors, and expand the supported packaging formats for model files.

Manifest

A model manifest is formatted like an image manifest and distinguished by the config.MediaType. The following example manifest, taken from the ai/gemma3, references a model config JSON and two layers, one containing a GGUF file and the other containing the model’s license.

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.docker.ai.model.config.v0.1+json",
    "size": 372,
    "digest": "sha256:22273fd2f4e6dbaf5b5dae5c5e1064ca7d0ff8877d308eb0faf0e6569be41539"
  },
  "layers": [
    {
      "mediaType": "application/vnd.docker.ai.gguf.v3",
      "size": 2489757856,
      "digest": "sha256:09b370de51ad3bde8c3aea3559a769a59e7772e813667ddbafc96ab2dc1adaa7"
    },
    {
      "mediaType": "application/vnd.docker.ai.license",
      "size": 8346,
      "digest": "sha256:a4b03d96571f0ad98b1253bb134944e508a4e9b9de328909bdc90e3f960823e5"
    }
  ]
}


Model ID

The manifest digest uniquely identifies the model and is used by Docker Model Runner as the model ID.

Model Config JSON

The model configuration is a JSON file that surfaces important metadata about the model, such as size, parameter count, quantization, as well as metadata about the artifact provenance (like the creation timestamp).
The following example comes from the ai/gemma model on Dockerhub:

{
  "config": {
    "format": "gguf",
    "quantization": "IQ2_XXS/Q4_K_M",
    "parameters": "3.88 B",
    "architecture": "gemma3",
    "size": "2.31 GiB"
  },
  "descriptor": {
    "created": "2025-03-26T09:57:32.086694+01:00"
  },
  "rootfs": {
    "type": "rootfs",
    "diff_ids": [
      "sha256:09b370de51ad3bde8c3aea3559a769a59e7772e813667ddbafc96ab2dc1adaa7",
      "sha256:a4b03d96571f0ad98b1253bb134944e508a4e9b9de328909bdc90e3f960823e5"
    ]
  }
}

By defining a domain-specific configuration schema, we allow tools to access and use model metadata cheaply — by fetching and parsing a small JSON file — only fetching the model itself when needed.

For example, a registry frontend like Docker Hub can directly surface this data to users who can, in turn, use it to compare models or select based on system capabilities and requirements. Tooling might use this data to estimate memory requirements for a given model. It could then assist in the selection process by suggesting the best variant that is compatible with the available resources.

Layers

Layers in a model artifact differ from layers within an OCI image in two important respects.

Unlike an image layer, where compression is recommended, model layers are always uncompressed. Because models are large, high-entropy files, compressing them provides a negligible reduction in size, while (un)compressing is time and compute-intensive.

In contrast to a layer in an OCI image, which contains multiple files in an archive, each “layer” in a model artifact must contain a single raw file. This allows runtimes like Docker Model Runner to reduce disk usage on the client machine by storing a single uncompressed copy of the model. This file can then be directly memory mapped by the inference engine at runtime.

The lack of file names, hierarchy, and metadata (e.g. modification time) ensures that identical model files always result in identical reusable layer blobs. This prevents unnecessary duplication, which is particularly important when working with LLMs, given the file size.

You may have noticed that these “layers” are not really filesystem layers at all. They are files, but they do not specify a filesystem. So, how does this work at runtime? When Docker Model runner runs a model, instead of finding the GGUF file by name in a model filesystem, the desired file is identified by its media type (application/vnd.docker.ai.gguf.v3) and fetched from the model store. For more information on the Model Runner architecture, please see the architecture overview in this accompanying blog post.

Distribution

Like OCI images and other OCI artifacts, Docker model artifacts are distributed via registries like Dockerhub, Artifactory, or Azure Container Registry that comply with the OCI distribution specification.

Discovery

Docker Hub

The Docker Hub Gen AI catalog aids in the discovery of popular models. These models are packaged in the format described here and are compatible with Docker Model Runner and any other runtime that supports the OCI specification.

Hugging Face

If you are accustomed to exploring models on Hugging Face, there’s good news! Hugging Face now supports on-demand conversion to the Docker Model Artifact format when you pull from Hugging Face with docker model pull.

What’s Next?

Hopefully, you now have a better understanding of the Docker OCI Model format and how it supports our goal of making AI app development more accessible to developers via familiar workflows and commands. But this version of the artifact format is just the beginning! In the future, you can expect the enhancements to the packaging format to bring this level of accessibility and flexibility to a broader range of use cases. Future versions will support:

  • Additional runtime configuration options like templates, context size, and default parameters. This will allow users to configure models for specific use cases and distribute that config alongside the model, as a single immutable artifact.
  • LoRA adapters, allowing users to extend existing model artifacts with use-case-specific fine-tuning.
  • Multi-modal projectors, enabling users to package multi-modal such as language-and-vision models using LLaVA-style projectors.
  • Model index files that provide a set of models with different parameter count and quantizations, allowing runtimes the best option for the available resources.

In addition to adding features, we are committed to fostering an open ecosystem. Expect:

  • Deeper integrations into containerd for a more native runtime experience.
  • Efforts to harmonize with ModelPack and other model packaging standards to improve interoperability.

These advancements show our ongoing commitment to making the OCI artifact a versatile and flexible way to package and run AI models, delivering the same ease and reliability developers already expect from Docker.

Learn more

Behind the scenes: How we designed Docker Model Runner and what’s next

Par : Jacob Howard
18 juin 2025 à 13:30

The last few years have made it clear that AI models will continue to be a fundamental component of many applications. The catch is that they’re also a fundamentally different type of component, with complex software and hardware requirements that don’t (yet) fit neatly into the constraints of container-oriented development lifecycles and architectures. To help address this problem, Docker launched the Docker Model Runner with Docker Desktop 4.40. Since then, we’ve been working aggressively to expand Docker Model Runner with additional OS and hardware support, deeper integration with popular Docker tools, and improvements to both performance and usability.
For those interested in Docker Model Runner and its future, we offer a behind-the-scenes look at its design, development, and roadmap.

Note: Docker Model Runner is really two components: the model runner and the model distribution specification. In this article, we’ll be covering the former, but be sure to check out the companion blog post by Emily Casey for the equally important distribution side of the story.

Design goals

Docker Model Runner’s primary design goal was to allow users to run AI models locally and to access them from both containers and host processes. While that’s simple enough to articulate, it still leaves an enormous design space in which to find a solution. Fortunately, we had some additional constraints: we were a small engineering team, and we had some ambitious timelines. Most importantly, we didn’t want to compromise on UX, even if we couldn’t deliver it all at once. In the end, this motivated design decisions that have so far allowed us to deliver a viable solution while leaving plenty of room for future improvement.





Multiple backends

One thing we knew early on was that we weren’t going to write our own inference engine (Docker’s wheelhouse is containerized development, not low-level inference engines). We’re also big proponents of open-source, and there were just so many great existing solutions! There’s llama.cpp, vLLM, MLX, ONNX, and PyTorch, just to name a few.

Of course, being spoiled for choice can also be a curse — which to choose? The obvious answer was: as many as possible, but not all at once.

We decided to go with llama.cpp for our initial implementation, but we intentionally designed our APIs with an additional, optional path component (the {name} in /engines/{name}) to allow users to take advantage of multiple future backends. We also designed interfaces and stubbed out implementations for other backends to enforce good development hygiene and to avoid becoming tethered to one “initial” implementation.

OpenAI API compatibility

The second design choice we had to make was how to expose inference to consumers in containers. While there was also a fair amount of choice in the inference API space, we found that the OpenAI API standard seemed to offer the best initial tooling compatibility. We were also motivated by the fact that several teams inside Docker were already using this API for various real-world products. While we may support additional APIs in the future, we’ve so far found that this API surface is sufficient for most applications. One gap that we know exists is full compatibility with this API surface, which is something we’re working on iteratively.

This decision also drove our choice of llama.cpp as our initial backend. The llama.cpp project already offered a turnkey option for OpenAI API compatibility through its server implementation. While we had to make some small modifications (e.g. Unix domain socket support), this offered us the fastest path to a solution. We’ve also started contributing these small patches upstream, and we hope to expand our contributions to these projects in the future.

First-class citizenship for models in the Docker API

While the OpenAI API standard was the most ubiquitous option amongst existing tooling, we also knew that we wanted models to be first-class citizens in the Docker Engine API. Models have a fundamentally different execution lifecycle than the processes that typically make up the ENTRYPOINTs of containers, and thus, they don’t fit well under the standard /containers endpoints of the Docker Engine API. However, much like containers, images, networks, and volumes, models are such a fundamental component that they really deserve their own API resource type. This motivated the addition of a set of /models endpoints, closely modeled after the /images endpoints, but separate for reasons that are best discussed in the distribution blog post.

GPU acceleration

Another critical design goal was support for GPU acceleration of inference operations. Even the smallest useful models are extremely computationally demanding, while more sophisticated models (such as those with tool-calling capabilities) would be a stretch to fit onto local hardware at all. GPU support was going to be non-negotiable for a useful experience.

Unfortunately, passing GPUs across the VM boundary in Docker Desktop, especially in a way that would be reliable across platforms and offer a usable computation API inside containers, was going to be either impossible or very flaky.

As a compromise, we decided to run inference operations outside of the Docker Desktop VM and simply proxy API calls from the VM to the host. While there are some risks with this approach, we are working on initiatives to mitigate these with containerd-hosted sandboxing on macOS and Windows. Moreover, with Docker-provided models and application-provided prompts, the risk is somewhat lower, especially given that inference consists primarily of numerical operations. We assess the risk in Docker Desktop to be about on par with accessing host-side services via host.docker.internal (something already enabled by default).

However, agents that drive tool usage with model output can cause more significant side effects, and that’s something we needed to address. Fortunately, using the Docker MCP Toolkit, we’re able to perform tool invocation inside ephemeral containers, offering reliable encapsulation of the side effects that models might drive. This hybrid approach allows us to offer the best possible local performance with relative peace of mind when using tools.

Outside the context of Docker Desktop, for example, in Docker CE, we’re in a significantly better position due to the lack of a VM boundary (or at least a very transparent VM boundary in the case of a hypervisor) between the host hardware and containers. When running in standalone mode in Docker CE, the Docker Model Runner will have direct access to host hardware (e.g. via the NVIDIA Container Toolkit) and will run inference operations within a container.

Modularity, iteration, and open-sourcing

As previously mentioned, the Docker Model Runner team is relatively small, which meant that we couldn’t rely on a monolithic architecture if we wanted to effectively parallelize the development work for Docker Model Runner. Moreover, we had an early and overarching directive: open-source as much as possible.

We decided on three high-level components around which we could organize development work: the model runner, the model distribution tooling, and the model CLI plugin.

Breaking up these components allowed us to divide work more effectively, iterate faster, and define clean API boundaries between different concerns. While there have been some tricky dependency hurdles (in particular when integrating with closed-source components), we’ve found that the modular approach has facilitated faster incremental changes and support for new platforms.

The High-Level Architecture

At a high level, the Docker Model Runner architecture is composed of the three components mentioned above (the runner, the distribution code, and the CLI), but there are also some interesting sub-components within each:

DMR_architecture@3x_Resized

Figure 1: Docker Model Runner high-level architecture

How these components are packaged and hosted (and how they interact) also depends on the platform where they’re deployed. In each case it looks slightly different. Sometimes they run on the host, sometimes they run in a VM, sometimes they run in a container, but the overall architecture looks the same.

Model storage and client

The core architectural component is the model store. This component, provided by the model distribution code, is where the actual model tensor files are stored. These files are stored differently (and separately) from images because (1) they’re high-entropy and not particularly compressible and (2) the inference engine needs direct access to the files so that it can do things like mapping them into its virtual address space via mmap(). For more information, see the accompanying model distribution blog post.

The model distribution code also provides the model distribution client. This component performs operations (such as pulling models) using the model distribution protocol against OCI registries.

Model runner

Built on top of the model store is the model runner. The model runner maps inbound inference API requests (e.g. /v1/chat/completions or /v1/embeddings requests) to processes hosting pairs of inference engines and models. It includes scheduler, loader, and runner components that coordinate the loading of models in and out of memory so that concurrent requests can be serviced, even if models can’t be loaded simultaneously (e.g. due to resource constraints). This makes the execution lifecycle of models different from that of containers, with engines and models operating as ephemeral processes (mostly hidden from users) that can be terminated and unloaded from memory as necessary (or when idle). A different backend process is run for each combination of engine (e.g. llama.cpp) and model (e.g. ai/qwen3:8B-Q4_K_M) as required by inference API requests (though multiple requests targeting the same pair will reuse the same runner and backend processes if possible).

The runner also includes an installer service that can dynamically download backend binaries and libraries, allowing users to selectively enable features (such as CUDA support) that might require downloading hundreds of MBs of dependencies.

Finally, the model runner serves as the central server for all Docker Model Runner APIs, including the /models APIs (which it routes to the model distribution code) and the /engines APIs (which it routes to its scheduler). This API server will always opt to hold in-flight requests until the resources (primarily RAM or VRAM) are available to service them, rather than returning something like a 503 response. This is critical for a number of usage patterns, such multiple agents running with different models or concurrent requests for both embedding and completion.

Model CLI

The primary user-facing component of the Docker Model Runner architecture is the model CLI. This component is a standard Docker CLI plugin that offers an interface very similar to the docker image command. While the lifecycle of model execution is different from that of containers, the concepts (such as pushing, pulling, and running) should be familiar enough to existing Docker users.

The model CLI communicates with the model runner’s APIs to perform almost all of its operations (though the transport for that communication varies by platform). The model CLI is context-aware, allowing it to determine if it’s talking to a Docker Desktop model runner, Docker CE model runner, or a model runner on some custom platform. Because we’re using the standard Docker CLI plugin framework, we get all of the standard Docker Context functionality for free, making this detection much easier.

API design and routing

As previously mentioned, the Docker Model Runner comprises two sets of APIs: the Docker-style APIs and the OpenAI-compatible APIs. The Docker-style APIs (modeled after the /image APIs) include the following endpoints:

  • POST /models/create (Model pulling)
  • GET /models (Model listing)
  • GET /models/{namespace}/{name} (Model metadata)
  • DELETE /models/{namespace}/{name} (Model deletion)

The bodies for these requests look very similar to their image analogs. There’s no documentation at the moment, but you can get a glimpse of the format by looking at their corresponding Go types.

In contrast, the OpenAI endpoints follow a different but still RESTful convention:

  • GET /engines/{engine}/v1/models (OpenAI-format model listing)
  • GET /engines/{engine}/v1/models/{namespace}/{name} (OpenAI-format model metadata)
  • POST /engines/{engine}/v1/chat/completions (Chat completions)
  • POST /engines/{engine}/v1/completions (Chat completions (legacy endpoint))
  • POST /engines/{engine}/v1/embeddings (Create embeddings)

At this point in time, only one {engine} value is supported (llama.cpp), and it can also be omitted to use the default (llama.cpp) engine.

We make these APIs available on several different endpoints:


First, in Docker Desktop, they’re available on the Docker socket (/var/run/docker.sock), both inside and outside containers. This is in service of our design goal of having models as a first-class citizen in the Docker Engine API. At the moment, these endpoints are prefixed with a /exp/vDD4.40 path (to avoid dependencies on APIs that may evolve during development), but we’ll likely remove this prefix in the next few releases since these APIs have now mostly stabilized and will evolve in a backward-compatible way.

Second, also in Docker Desktop, we make the APIs available on a special model-runner.docker.internal endpoint that’s accessible just from containers (though not currently from ECI containers, because we want to have inference sandboxing implemented first). This TCP-based endpoint exposes just the /models and /engines API endpoints (not the whole Docker API) and is designed to serve existing tooling (which likely can’t access APIs via a Unix domain socket). No /exp/vDD4.40 prefix is used in this case.

Finally, in both Docker Desktop and Docker CE, we make the /models and /engines API endpoints available on a host TCP endpoint (localhost:12434, by default, again without any /exp/vDD4.40 prefix). In Docker Desktop this is optional and not enabled by default. In Docker CE, it’s a critical component of how the API endpoints are accessed, because we currently lack the integration to add endpoints to Docker CE’s /var/run/docker.sock or to inject a custom model-runner.docker.internal hostname, so we advise using the standard 172.17.0.1 host gateway address to access this localhost-exposed port (e.g. setting your OpenAI API base URL to http://172.17.0.1:12434/engines/v1). Hopefully we’ll be able to unify this across Docker platforms in the near future (see our roadmap below).

First up: Docker Desktop

The natural first step for Docker Model Runner was integration into Docker Desktop. In Docker Desktop, we have more direct control over integration with the Docker Engine, as well as existing processes that we can use to host the model runner components. In this case, the model runner and model distribution components live in the Docker Desktop host backend process (the com.docker.backend process you may have seen running) and we use special middleware and networking magic to route requests on /var/run/docker.sock and model-runner.docker.internal to the model runner’s API server. Since the individual inference backend processes run as subprocesses of com.docker.backend, there’s no risk of a crash in Docker Desktop if, for example, an inference backend is killed by an Out Of Memory (OOM) error.

We started initially with support for macOS on Apple Silicon, because it provided the most uniform platform for developing the model runner functionality, but we implemented most of the functionality along the way to build and test for all Docker Desktop platforms. This made it significantly easier to port to Windows on AMD64 and ARM64 platforms, as well as the GPU variations that we found there.

The one complexity with Windows was the larger size of the supporting library dependencies for the GPU-based backends. It wouldn’t have been feasible (or tolerated) if we added another 500 MB – 1 GB to the Docker Desktop for Windows installer, so we decided to default to a CPU-based backend in Docker Desktop for Windows with optional support for the GPU backend. This was the primary motivating factor for the dynamic installer component of the model runner (in addition to our desire for incremental updates to different backends).

This all sounds like a very well-planned exercise, and we did indeed start with a three-component design and strictly enforced API boundaries, but in truth we started with the model runner service code as a sub-package of the Docker Desktop source code. This made it much easier to iterate quickly, especially as we were exploring the architecture for the different services. Fortunately, by sticking to a relatively strict isolation policy for the code, and enforcing clean dependencies through APIs and interfaces, we were able to easily extract the code (kudos to the excellent git-filter-repo tool) into a separate repository for the purposes of open-sourcing.

Next stop: Docker CE

Aside from Docker’s penchant for open-sourcing, one of the main reasons that we wanted to make the Docker Model Runner source code publicly available was to support integration into Docker CE. Our goal was to package the docker model command in the same way as docker buildx and docker compose.

The trick with Docker CE is that we wanted to ship Docker Model Runner as a “vanilla” Docker CLI plugin (i.e. without any special privileges or API access), which meant that we didn’t have a backend process that could host the model runner service. However, in the Docker CE case, the boundary between host hardware and container processes is much less disruptive, meaning that we could actually run Docker Model Runner in a container and simply make any accelerator hardware available to it directly. So, much like a standalone BuildKit builder container, we run the Docker Model Runner as a standalone container in Docker CE, with a special named volume for model storage (meaning you can uninstall the runner without having to re-pull models). This “installation” is performed by the model CLI automatically (and when necessary) by pulling the docker/model-runner image and starting a container. Explicit configuration for the runner can also be specified using the docker model install-runner command. If you want, you can also remove the model runner (and optionally the model storage) using docker model uninstall-runner.

This unfortunately leads to one small compromise with the UX: we don’t currently support the model runner APIs on /var/run/docker.sock or on the special model-runner.docker.internal URL. Instead, the model runner API server listens on the host system’s loopback interface at localhost:12434 (by default), which is available inside most containers at 172.17.0.1:12434. If desired, users can also make this available on model-runner.docker.internal:12434 by utilizing something like –add-host=model-runner.docker.internal:host-gateway when running docker run or docker create commands. This can also be achieved by using the extra_hosts key in a Compose YAML file. We have plans to make this more ergonomic in future releases.

The road ahead…

The status quo is Docker Model Runner support in Docker Desktop on macOS and Windows and support for Docker CE on Linux (including WSL2), but that’s definitely not the end of the story. Over the next few months, we have a number of initiatives planned that we think will reshape the user experience, performance, and security of Docker Model Runner.

Additional GUI and CLI functionality

The most visible functionality coming out over the next few months will be in the model CLI and the “Models” tab in the Docker Desktop dashboard. Expect to see new commands (such as df, ps, and unload) that will provide more direct support for monitoring and controlling model execution. Also, expect to see new and expanded layouts and functionality in the Models tab.

Expanded OpenAI API support

A less-visible but equally important aspect of the Docker Model Runner user experience is our compatibility with the OpenAI API. There are dozens of endpoints and parameters to support (and we already support many), so we will work to expand API surface compatibility with a focus on practical use cases and prioritization of compatibility with existing tools.

containerd and Moby integration

One of the longer-term initiatives that we’re looking at is integration with containerd. containerd already provides a modular runtime system that allows for task execution coordinated with storage. We believe this is the right way forward and that it will allow us to better codify the relationship between model storage, model execution, and model execution sandboxing.

In combination with the containerd work, we would also like tighter integration with the Moby project. While our existing Docker CE integration offers a viable and performant solution, we believe that better ergonomics could be achieved with more direct integration. In particular, niceties like support for model-runner.docker.internal DNS resolution in Docker CE are on our radar. Perhaps the biggest win from this tighter integration would be to expose Docker Model Runner APIs on the Docker socket and to include the API endpoints (e.g. /models) in the official Docker Engine API documentation.

Kubernetes

One of the product goals for Docker Model Runner was a consistent experience from development inner loop to production, and Kubernetes is inarguably a part of that path. The existing Docker Model Runner images that we’re using for Docker CE will also work within a Kubernetes cluster, and we’re currently developing instructions to set up a Docker Model Runner instance in a Kubernetes cluster. The big difference with Kubernetes is the variety of cluster and application architectures in use, so we’ll likely end up with different “recipes” for how to configure the Docker Model Runner in different scenarios.

vLLM

One of the things we’ve heard from a number of customers is that vLLM forms a core component of their production stack. This was also the first alternate backend that we stubbed out in the model runner repository, and the time has come to start poking at an implementation.

Even more to come…

Finally, there are some bits that we just can’t talk about yet, but they will fundamentally shift the way that developers interact with models. Be sure to tune-in to Docker’s sessions at WeAreDevelopers from July 9–11 for some exciting announcements around AI-related initiatives at Docker.

Learn more

Understanding the n8 app and Its Solutions

Par : Tanvir Kour
18 juin 2025 à 06:44
In today’s digital world, we use dozens of different apps and services every day. Email, Slack, Google Sheets, databases, social media, CRM systems – the list goes on. While each tool serves its purpose, getting them to work together smoothly can be a nightmare. Enter n8n (pronounced “n-eight-n”), a powerful workflow automation platform that connects […]

LM Studio vs Ollama: Picking the Right Tool for Local LLM Use

Par : Tanvir Kour
16 juin 2025 à 11:31
LM Studio prioritizes ease of use with a polished GUI ideal for beginners, while Ollama offers greater flexibility and control through its developer-friendly command-line interface and REST API. Choose LM Studio if you want a plug-and-play experience with visual controls, or Ollama if you prefer command-line power and deeper customization options. The landscape of local […]

How to Build, Run, and Package AI Models Locally with Docker Model Runner

12 juin 2025 à 16:00

Introduction

As a Senior DevOps Engineer and Docker Captain, I’ve helped build AI systems for everything from retail personalization to medical imaging. One truth stands out: AI capabilities are core to modern infrastructure.

This guide will show you how to run and package local AI models with Docker Model Runner — a lightweight, developer-friendly tool for working with AI models pulled from Docker Hub or Hugging Face. You’ll learn how to run models in the CLI or via API, publish your own model artifacts, and do it all without setting up Python environments or web servers.

What is AI in Development?

Artificial Intelligence (AI) refers to systems that mimic human intelligence, including:

  • Making decisions via machine learning
  • Understanding language through NLP
  • Recognizing images with computer vision
  • Learning from new data automatically

Common Types of AI in Development:

  • Machine Learning (ML): Learns from structured and unstructured data
  • Deep Learning: Neural networks for pattern recognition
  • Natural Language Processing (NLP): Understands/generates human language
  • Computer Vision: Recognizes and interprets images

Why Package and Run Your Own AI Model?

Local model packaging and execution offer full control over your AI workflows. Instead of relying on external APIs, you can run models directly on your machine — unlocking:

  • Faster inference with local compute (no latency from API calls)
  • Greater privacy by keeping data and prompts on your own hardware
  • Customization through packaging and versioning your own models
  • Seamless CI/CD integration with tools like Docker and GitHub Actions
  • Offline capabilities for edge use cases or constrained environments

Platforms like Docker and Hugging Face make cutting-edge AI models instantly accessible without building from scratch. Running them locally means lower latency, better privacy, and faster iteration.

Real-World Use Cases for AI

  • Chatbots & Virtual Assistants: Automate support (e.g., ChatGPT, Alexa)
  • Generative AI: Create text, art, music (e.g., Midjourney, Lensa)
  • Dev Tools: Autocomplete and debug code (e.g., GitHub Copilot)
  • Retail Intelligence: Recommend products based on behavior
  • Medical Imaging: Analyze scans for faster diagnosis

How to Package and Run AI Models Locally with Docker Model Runner

Prerequisites:

Step 0 — Enable Docker Model Runner

Open Docker Desktop

Go to Settings → Features in development

Under the Experimental features tab, enable Access experimental features

Click Apply and restart

Quit and reopen Docker Desktop to ensure changes take effect

Reopen Settings → Features in development

Switch to the Beta tab and check Enable Docker Model Runner

(Optional) Enable host-side TCP support to access the API from localhost

Once enabled, you can use the docker model CLI and manage models in the Models tab.

Screenshot of Docker Desktop’s Features in development tab with Docker Model Runner and Dev Environments enabled.

Screenshot of Docker Desktop’s Features in development tab with Docker Model Runner and Dev Environments enabled.

Step 1: Pull a Model

From Docker Hub:

docker model pull ai/smollm2

Or from Hugging Face (GGUF format):

docker model pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

Note: Only GGUF models are supported. GGUF (GPT-style General Use Format) is a lightweight binary file format designed for efficient local inference, especially with CPU-optimized runtimes like llama.cpp. It includes the model weights, tokenizer, and metadata all in one place, making it ideal for packaging and distributing LLMs in containerized environments.

Step 2: Tag and Push to Local Registry (Optional)

If you want to push models to a private or local registry:

Tag model with your registry’s address:

docker model tag hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF localhost:5000/foobar

Run a local Docker registry:

docker run -d -p 6000:5000 --name registry registry:2

Push the model to the local registry:

docker model push localhost:6000/foobar

Check your local models with:

docker model list

Step 3: Run the Model

Run a prompt (one-shot)

docker model run ai/smollm2 "What is Docker?"

Interactive chat mode

docker model run ai/smollm2

Note: Models are loaded into memory on demand and unloaded after 5 minutes of inactivity.

Step 4: Test via OpenAI-Compatible API

To call the model from the host:

  1. Enable TCP host access for Model Runner (via Docker Desktop GUI or CLI)
Screenshot of Docker Desktop’s Features in development tab showing host-side TCP support enabled for Docker Model Runner.

Screenshot of Docker Desktop’s Features in development tab showing host-side TCP support enabled for Docker Model Runner.

docker desktop enable model-runner --tcp 12434
  1. Send a prompt using the OpenAI-compatible chat endpoint:
curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/smollm2",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Tell me about the fall of Rome."}
    ]
  }'

Note: No API key required — this runs locally and securely on your machine.

Step 5: Package Your Own Model

You can package your own pre-trained GGUF model as a Docker-compatible artifact if you already have a .gguf file — such as one downloaded from Hugging Face or converted using tools like llama.cpp.

Note: This guide assumes you already have a .gguf model file. It does not cover how to train or convert models to GGUF.

docker model package \
  --gguf "$(pwd)/model.gguf" \
  --license "$(pwd)/LICENSE.txt" \
  --push registry.example.com/ai/custom-llm:v1

This is ideal for custom-trained or private models. You can now pull it like any other model:

docker model pull registry.example.com/ai/custom-llm:v1

Step 6: Optimize & Iterate

  • Use docker model logs to monitor model usage and debug issues
  • Set up CI/CD to automate pulls, scans, and packaging
  • Track model lineage and training versions to ensure consistency
  • Use semantic versioning (:v1, :2025-05, etc.) instead of latest when packaging custom models
  • Only one model can be loaded at a time; requesting a new model will unload the previous one.

Compose Integration (Optional)

Docker Compose v2.35+ (included in Docker Desktop 4.41+) introduces support for AI model services using a new provider.type: model. You can define models directly in your compose.yml and reference them in app services using depends_on.

During docker compose up, Docker Model Runner automatically pulls the model and starts it on the host system, then injects connection details into dependent services using environment variables such as MY_MODEL_URL and MY_MODEL_MODEL, where MY_MODEL matches the name of the model service.

This enables seamless multi-container AI applications — with zero extra glue code. Learn more.

Navigating AI Development Challenges

  • Latency: Use quantized GGUF models
  • Security: Never run unknown models; validate sources and attach licenses
  • Compliance: Mask PII, respect data consent
  • Costs: Run locally to avoid cloud compute bills

Best Practices

  • Prefer GGUF models for optimal CPU inference
  • Use the --license flag when packaging custom models to ensure compliance
  • Use versioned tags (e.g., :v1, :2025-05) instead of latest
  • Monitor model logs using docker model logs
  • Validate model sources before pulling or packaging
  • Only pull models from trusted sources (e.g., Docker Hub’s ai/ namespace or verified Hugging Face repos).
  • Review the license and usage terms for each model before packaging or deploying.

The Road Ahead

  • Support for Retrieval-Augmented Generation (RAG)
  • Expanded multimodal support (text + images, video, audio)
  • LLMs as services in Docker Compose (Requires Docker Compose v2.35+)
  • More granular Model Dashboard features in Docker Desktop
  • Secure packaging and deployment pipelines for private AI models

Docker Model Runner lets DevOps teams treat models like any other artifact — pulled, tagged, versioned, tested, and deployed.

Final Thoughts

You don’t need a GPU cluster or external API to build AI apps. Learn more and explore everything you can do with Docker Model Runner:

  • Pull prebuilt models from Docker Hub or Hugging Face
  • Run them locally using the CLI, API, or Docker Desktop’s Model tab
  • Package and push your own models as OCI artifacts
  • Integrate with your CI/CD pipelines securely

You can also find other helpful information to get started at: 

You’re not just deploying containers — you’re delivering intelligence.

Learn more

Publishing AI models to Docker Hub

Par : Kevin Wittek
11 juin 2025 à 12:16

When we first released Docker Model Runner, it came with built-in support for running AI models published and maintained by Docker on Docker Hub. This made it simple to pull a model like llama3.2 or gemma3 and start using it locally with familiar Docker-style commands.

Model Runner now supports three new commands: tag, push, and package. These enable you to share models with your team, your organization, or the wider community. Whether you’re managing your own fine-tuned models or curating a set of open-source models, Model Runner now lets you publish them to Docker Hub or any other OCI Artifact compatible Container Registry.  For teams using Docker Hub, enterprise features like Registry Access Management (RAM) provide policy-based controls and guardrails to help enforce secure, consistent access.

Tagging and pushing to Docker Hub

Let’s start by republishing an existing model from Docker Hub under your own namespace.

# Step 1: Pull the model from Docker Hub
$ docker model pull ai/smollm2

# Step 2: Tag it for your own organization
$ docker model tag ai/smollm2 myorg/smollm2

# Step 3: Push it to Docker Hub
$ docker model push myorg/smollm2

That’s it! Your model is now available at myorg/smollm2 and ready to be consumed using Model Runner by anyone with access.

Pushing to other container registries

Model Runner supports other container registries beyond Docker Hub, including GitHub Container Registry (GHCR).

# Step 1: Tag for GHCR
$ docker model tag ai/smollm2 ghcr.io/myorg/smollm2

# Step 2: Push to GHCR
$ docker model push ghcr.io/myorg/smollm2

Authentication and permissions work just like they do with regular Docker images in the context of GHCR, so you can leverage your existing workflow for managing registry credentials.

Packaging a custom GGUF file

Want to publish your own model file? You can use the package command to wrap a .gguf file into a Docker-compatible OCI artifact and directly push it into a Container Registry, such as Docker Hub.

# Step 1: Download a model, e.g. from HuggingFace
$ curl -L -o model.gguf https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf

# Step 2: Package and push it
$ docker model package --gguf "$(pwd)/model.gguf" --push myorg/mistral-7b-v0.1:Q4_K_M

You’ve now turned a raw model file in GGUF format into a portable, versioned, and sharable artifact that works seamlessly with docker model run.

Conclusion

We’ve seen how easy it is to publish your own models using Docker Model Runner’s new tag, push, and package commands. These additions bring the familiar Docker developer experience to the world of AI model sharing. Teams and enterprises using Docker Hub can securely manage access and control for their models, just like with container images, making it easier to scale GenAI applications across teams.

Stay tuned for more improvements to Model Runner that will make packaging and running models even more powerful and flexible.

Learn more

What is Agentic AI?

11 juin 2025 à 09:37
So you’ve probably heard the buzz about “Agentic AI” floating around tech circles lately, right? Maybe you’re wondering if it’s just another fancy buzzword or if there’s actually something revolutionary happening here. Well, let me tell you – this isn’t just hype. We’re looking at what might be the biggest shift in how AI works […]

Docker Desktop 4.42: Native IPv6, Built-In MCP, and Better Model Packaging

10 juin 2025 à 16:35

Docker Desktop 4.42 introduces powerful new capabilities that enhance network flexibility, improve security, and deepen AI toolchain integration, all while reducing setup friction. With native IPv6 support, a fully integrated MCP Toolkit, and major upgrades to Docker Model Runner and our AI agent Gordon, this release continues our commitment to helping developers move faster, ship smarter, and build securely across any environment. Whether you’re managing enterprise-grade networks or experimenting with agentic workflows, Docker Desktop 4.42 brings the tools you need right into your development workflows. 

2400x1260_4.42-rectangle-docker-desktop-release

IPv6 support 

Docker Desktop now provides IPv6 networking capabilities with customization options to better support diverse network environments. You can now choose between dual IPv4/IPv6 (default), IPv4-only, or IPv6-only networking modes to align with your organization’s network requirements. The new intelligent DNS resolution behavior automatically detects your host’s network stack and filters unsupported record types, preventing connectivity timeouts in IPv4-only or IPv6-only environments. 

These ipv6 settings are available in Docker Desktop Settings > Resources > Network section and can be enforced across teams using Settings Management, making Docker Desktop more reliable in complex enterprise network configurations including IPv6-only deployments.

Further documentation here.

Screenshot of Docker Desktop IPv6 settings

Figure 1: Docker Desktop IPv6 settings

Docker MCP Toolkit integrated into Docker Desktop

Last month, we launched the Docker MCP Catalog and Toolkit to help developers easily discover MCP servers and securely connect them to their favorite clients and agentic apps. We’re humbled by the incredible support from the community. User growth is up by over 50%, and we’ve crossed 1 million pulls! Now, we’re excited to share that the MCP Toolkit is built right into Docker Desktop, no separate extension required.

You can now access more than 100 MCP servers, including GitHub, MongoDB, Hashicorp, and more, directly from Docker Desktop – just enable the servers you need, configure them, and connect to clients like Claude Desktop, Cursor, Continue.dev, or Docker’s AI agent Gordon.

Unlike typical setups that run MCP servers via npx or uvx processes with broad access to the host system, Docker Desktop runs these servers inside isolated containers with well-defined security boundaries. All container images are cryptographically signed, with proper isolation of secrets and configuration data. 

Screenshot of the MCP Toolkit tab on Docker Desktop, showing a list of downloadable and connected clients.

Figure 2: Docker MCP Toolkit is now integrated natively into Docker Desktop

To meet developers where they are, we’re bringing Docker MCP support to the CLI, using the same command structure you’re already familiar with. With the new docker mcp commands, you can launch, configure, and manage MCP servers directly from the terminal. The CLI plugin offers comprehensive functionality, including catalog management, client connection setup, and secret management.

Screenshot of the available Docker MCP CLI commands, including catalog, client, config, and more.

Figure 3:  Docker MCP CLI commands.

Docker AI Agent Gordon Now Supports MCP Toolkit Integration

In this release, we’ve upgraded Gordon, Docker’s AI agent, with direct integration to the MCP Toolkit in Docker Desktop. To enable it, open Gordon, click the “Tools” button, and toggle on the “MCP” Toolkit option. Once activated, the MCP Toolkit tab will display tools available from any MCP servers you’ve configured.

Screenshot of Gordon working with MCP Toolkit

Figure 4: Docker’s AI Agent Gordon now integrates with Docker’s MCP Toolkit, bringing 100+ MCP servers

This integration gives you immediate access to 100+ MCP servers with no extra setup, letting you experiment with AI capabilities directly in your Docker workflow. Gordon now acts as a bridge between Docker’s native tooling and the broader AI ecosystem, letting you leverage specialized tools for everything from screenshot capture to data analysis and API interactions – all from a consistent, unified interface.

Screenshot of Gordon calling Github

Figure 5: Docker’s AI Agent Gordon uses the GitHub MCP server to pull issues and suggest solutions.

Finally, we’ve also improved the Dockerize feature with expanded support for Java, Kotlin, Gradle, and Maven projects. These improvements make it easier to containerize a wider range of applications with minimal configuration. With expanded containerization capabilities and integrated access to the MCP Toolkit, Gordon is more powerful than ever. It streamlines container workflows, reduces repetitive tasks, and gives you access to specialized tools, so you can stay focused on building, shipping, and running your applications efficiently.

Docker Model Runner adds Qualcomm support, Docker Engine Integration, and UX Upgrades

Staying true to our philosophy of giving developers more flexibility and meeting them where they are, the latest version of Docker Model Runner adds broader OS support, deeper integration with popular Docker tools, and improvements in both performance and usability.

In addition to supporting Apple Silicon and Windows systems with NVIDIA GPUs, Docker Model Runner now works on Windows devices with Qualcomm chipsets. Under the hood, we’ve upgraded our inference engine to use the latest version of llama.cpp, bringing significantly enhanced tool calling capabilities to your AI applications.Docker Model Runner can now be installed directly in Docker Engine Community Edition across multiple Linux distributions supported by Docker Engine. This integration is particularly valuable for developers looking to incorporate AI capabilities into their CI/CD pipelines and automated testing workflows. To get started, check out our documentation for the setup guide.

Get Up and Running with Models Faster

The Docker Model Runner user experience has been upgraded with expanded GUI functionality in Docker Desktop. All of these UI enhancements are designed to help you get started with Model Runner quickly and build applications faster. A dedicated interface now includes three new tabs that simplify model discovery, management, and streamline troubleshooting workflows. Additionally, Docker Desktop’s updated GUI introduces a more intuitive onboarding experience with streamlined “two-click” actions.

After clicking on the Model tab, you’ll see three new sub-tabs. The first, labeled “Local,” displays a set of models in various sizes that you can quickly pull. Once a model is pulled, you can launch a chat interface to test and experiment with it immediately.

Screenshot of the Models menu within Docker Desktop, along with suggested models.

Figure 6: Access a set of models of various sizes to get quickly started in Models menu of Docker Desktop

The second tab ”Docker Hub” offers a comprehensive view for browsing and pulling models from Docker Hub’s AI Catalog, making it easy to get started directly within Docker Desktop, without switching contexts.

Screenshot of the Docker Hub tab within the Docker Desktop Models menu.

Figure 7: A shortcut to the Model catalog from Docker Hub in Models menu of Docker Desktop

The third tab “Logs” offers real-time access to the inference engine’s log tail, giving developers immediate visibility into model execution status and debugging information directly within the Docker Desktop interface.

model debug

Figure 8: Gain visibility into model execution status and debugging information in Docker Desktop

Model Packaging Made Simple via CLI

As part of the Docker Model CLI, the most significant enhancement is the introduction of the docker model package command. This new command enables developers to package their models from GGUF format into OCI-compliant artifacts, fundamentally transforming how AI models are distributed and shared. It enables seamless publishing to both public and private and OCI-compatible repositories such as Docker Hub and establishes a standardized, secure workflow for model distribution, using the same trusted Docker tools developers already rely on. See our docs for more details. 

Conclusion 

From intelligent networking enhancements to seamless AI integrations, Docker Desktop 4.42 makes it easier than ever to build with confidence. With native support for IPv6, in-app access to 100+ MCP servers, and expanded platform compatibility for Docker Model Runner, this release is all about meeting developers where they are and equipping them with the tools to take their work further. Update to the latest version today and unlock everything Docker Desktop 4.42 has to offer.

Learn more

What is the Difference Between Generative AI and Agentic AI? A Complete Guide

4 juin 2025 à 18:00
As artificial intelligence continues to transform industries and reshape how we work, two key terms have emerged that often confuse both technical professionals and business leaders: generative AI and agentic AI. While these technologies may seem similar on the surface, they serve fundamentally different purposes and operate in distinct ways. Understanding the difference between generative […]

What is Agentic AI? A Deep Dive into MCP and the Modern Agent Ecosystem

4 juin 2025 à 03:39
The artificial intelligence landscape is undergoing a fundamental transformation. While traditional AI systems excel at responding to prompts and generating content, a new paradigm is emerging: Agentic AI. These systems don’t just respond—they reason, plan, and act autonomously to achieve complex objectives. At the heart of this revolution lies groundbreaking infrastructure like the Model Context […]

How to Make an AI Chatbot from Scratch using Docker Model Runner

Par : Harsh Manvar
3 juin 2025 à 18:40

Today, we’ll show you how to build a fully functional Generative AI chatbot using Docker Model Runner and powerful observability tools, including Prometheus, Grafana, and Jaeger. We’ll walk you through the common challenges developers face when building AI-powered applications, demonstrate how Docker Model Runner solves these pain points, and then guide you step-by-step through building a production-ready chatbot with comprehensive monitoring and metrics.

By the end of this guide, you’ll know how to make an AI chatbot and run it locally. You’ll also learn how to set up real-time monitoring insights, streaming responses, and a modern React interface — all orchestrated through familiar Docker workflows.

The current challenges with GenAI development

Generative AI (GenAI) is revolutionizing software development, but creating AI-powered applications comes with significant challenges. First, the current AI landscape is fragmented — developers must piece together various libraries, frameworks, and platforms that weren’t designed to work together. Second, running large language models efficiently requires specialized hardware configurations that vary across platforms, while AI model execution remains disconnected from standard container workflows. This forces teams to maintain separate environments for their application code and AI models.

Third, without standardized methods for storing, versioning, and serving models, development teams struggle with inconsistent deployment practices. Meanwhile, relying on cloud-based AI services creates financial strain through unpredictable costs that scale with usage. Additionally, sending data to external AI services introduces privacy and security risks, especially for applications handling sensitive information.

These challenges combine to create a frustrating developer experience that hinders experimentation and slows innovation precisely when businesses need to accelerate their AI adoption. Docker Model Runner addresses these pain points by providing a streamlined solution for running AI models locally, right within your existing Docker workflow.

How Docker is solving these challenges

Docker Model Runner offers a revolutionary approach to GenAI development by integrating AI model execution directly into familiar container workflows. 

dmr-genai-comparison

Figure 1: Comparison diagram showing complex multi-step traditional GenAI setup versus simplified Docker Model Runner single-command workflow

Many developers successfully use containerized AI models, benefiting from integrated workflows, cost control, and data privacy. Docker Model Runner builds on these strengths by making it even easier and more efficient to work with models. By running models natively on your host machine while maintaining the familiar Docker interface, Model Runner delivers.

  • Simplified Model Execution: Run AI models locally with a simple Docker CLI command, no complex setup required.
  • Hardware Acceleration: Direct access to GPU resources without containerization overhead
  • Integrated Workflow: Seamless integration with existing Docker tools and container development practices
  • Standardized Packaging: Models are distributed as OCI artifacts through the same registries you already use
  • Cost Control: Eliminate unpredictable API costs by running models locally
  • Data Privacy: Keep sensitive data within your infrastructure with no external API calls

This approach fundamentally changes how developers can build and test AI-powered applications, making local development faster, more secure, and dramatically more efficient.

How to create an AI chatbot with Docker

In this guide, we’ll build a comprehensive GenAI application that showcases how to create a fully-featured chat interface powered by Docker Model Runner, complete with advanced observability tools to monitor and optimize your AI models.

Project overview

The project is a complete Generative AI interface that demonstrates how to:

  1. Create a responsive React/TypeScript chat UI with streaming responses
  2. Build a Go backend server that integrates with Docker Model Runner
  3. Implement comprehensive observability with metrics, logging, and tracing
  4. Monitor AI model performance with real-time metrics

Architecture

The application consists of these main components:

  1. The frontend sends chat messages to the backend API
  2. The backend formats the messages and sends them to the Model Runner
  3. The LLM processes the input and generates a response
  4. The backend streams the tokens back to the frontend as they’re generated
  5. The frontend displays the incoming tokens in real-time
  6. Observability components collect metrics, logs, and traces throughout the process

arch

Figure 2: Architecture diagram showing data flow between frontend, backend, Model Runner, and observability tools like Prometheus, Grafana, and Jaeger.

Project structure

The project has the following structure:

tree -L 2
.
├── Dockerfile
├── README-model-runner.md
├── README.md
├── backend.env
├── compose.yaml
├── frontend
..
├── go.mod
├── go.sum
├── grafana
│   └── provisioning
├── main.go
├── main_branch_update.md
├── observability
│   └── README.md
├── pkg
│   ├── health
│   ├── logger
│   ├── metrics
│   ├── middleware
│   └── tracing
├── prometheus
│   └── prometheus.yml
├── refs
│   └── heads
..


21 directories, 33 files

We’ll examine the key files and understand how they work together throughout this guide.

Prerequisites

Before we begin, make sure you have:

  • Docker Desktop (version 4.40 or newer) 
  • Docker Model Runner enabled
  • At least 16GB of RAM for running AI models efficiently
  • Familiarity with Go (for backend development)
  • Familiarity with React and TypeScript (for frontend development)

Getting started

To run the application:

  1. Clone the repository: 
git clone 
https://github.com/dockersamples/genai-model-runner-metrics

cd genai-model-runner-metrics


  1. Enable Docker Model Runner in Docker Desktop:
  • Go to Settings > Features in Development > Beta tab
  • Enable “Docker Model Runner”
  • Select “Apply and restart”
enable-dmr

Figure 3: Screenshot of Docker Desktop Beta Features settings panel with Docker AI, Docker Model Runner, and TCP support enabled.

  1. Download the model

For this demo, we’ll use Llama 3.2, but you can substitute any model of your choice:

docker model pull ai/llama3.2:1B-Q8_0


Just like viewing containers, you can manage your downloaded AI models directly in Docker Dashboard under the Models section. Here you can see model details, storage usage, and manage your local AI model library.

model-ui

Figure 4: View of Docker Dashboard showing locally downloaded AI models with details like size, parameters, and quantization.

  1. Start the application: 
docker compose up -d --build
list-of-containers

Figure 5: List of active running containers in Docker Dashboard, including Jaeger, Prometheus, backend, frontend, and genai-model-runner-metrics.

  1. Open your browser and navigate to the frontend URL at http://localhost:3000 . You’ll be greeted with a modern chat interface (see screenshot) featuring: 
  • Clean, responsive design with dark/light mode toggle
  • Message input area ready for your first prompt
  • Model information displayed in the footer
expand

Figure 6: GenAI chatbot interface showing live metrics panel with input/output tokens, response time, and error rate.

  1. Click on Expand to view the metrics like:
  • Input tokens
  • Output tokens
  • Total Requests
  • Average Response Time
  • Error Rate
metrics

Figure 7: Expanded metrics view with input and output tokens, detailed chat prompt, and response generated by Llama 3.2 model.

Grafana allows you to visualize metrics through customizable dashboards. Click on View Detailed Dashboard to open up Grafana dashboard.

chatbot

Figure 8: Chat interface showing metrics dashboard with prompt and response plus option to view detailed metrics in Grafana.

Log in with the default credentials (enter “admin” as user and password) to explore pre-configured AI performance dashboards (see screenshot below) showing real-time metrics like tokens per second, memory usage, and model performance. 

Select Add your first data source. Choose Prometheus as a data source. Enter “http://prometheus:9090” as Prometheus Server URL. Scroll down to the end of the site and click “Save and test”. By now, you should see “Successfully queried the Prometheus API” as an acknowledgement. Select Dashboard and click Re-import for all these dashboards.

By now, you should have a Prometheus 2.0 Stats dashboard up and running.

grafana

Figure 9: Grafana dashboard with multiple graph panels monitoring GenAI chatbot performance, displaying time-series charts for memory consumption, processing speeds, and application health

Prometheus allows you to collect and store time-series metrics data. Open the Prometheus query interface http://localhost:9091 and start typing “genai” in the query box to explore all available AI metrics (as shown in the screenshot below). You’ll see dozens of automatically collected metrics, including tokens per second, latency measurements, and llama.cpp-specific performance data. 

prometheus

Figure 10: Prometheus web interface showing dropdown of available GenAI metrics including genai_app_active_requests and genai_app_token_latency

Jaeger provides a visual exploration of request flows and performance bottlenecks. You can access it via http://localhost:16686

Implementation details

Let’s explore how the key components of the project work:

  1. Frontend implementation

The React frontend provides a clean, responsive chat interface built with TypeScript and modern React patterns. The core App.tsx component manages two essential pieces of state: dark mode preferences for user experience and model metadata fetched from the backend’s health endpoint. 

When the component mounts, the useEffect hook automatically retrieves information about the currently running AI model. It displays details like the model name directly in the footer to give users transparency about which LLM is powering their conversations.

// Essential App.tsx structure
function App() {
  const [darkMode, setDarkMode] = useState(false);
  const [modelInfo, setModelInfo] = useState<ModelMetadata | null>(null);

  // Fetch model info from backend
  useEffect(() => {
    fetch('http://localhost:8080/health')
      .then(res => res.json())
      .then(data => setModelInfo(data.model_info));
  }, []);

  return (
    <div className="min-h-screen bg-white dark:bg-gray-900">
      <Header toggleDarkMode={() => setDarkMode(!darkMode)} />
      <ChatBox />
      <footer>
        Powered by Docker Model Runner running {modelInfo?.model}
      </footer>
    </div>
  );
}

The main App component orchestrates the overall layout while delegating specific functionality to specialized components like Header for navigation controls and ChatBox for the actual conversation interface. This separation of concerns makes the codebase maintainable while the automatic model info fetching demonstrates how the frontend seamlessly integrates with the Docker Model Runner through the Go backend’s API, creating a unified user experience that abstracts away the complexity of local AI model execution.

  1. Backend implementation: Integration with Model Runner

The core of this application is a Go backend that communicates with Docker Model Runner. Let’s examine the key parts of our main.go file:

client := openai.NewClient(
    option.WithBaseURL(baseURL),
    option.WithAPIKey(apiKey),
)

This demonstrates how we leverage Docker Model Runner’s OpenAI-compatible API. The Model Runner exposes endpoints that match OpenAI’s API structure, allowing us to use standard clients. Depending on your connection method, baseURL is set to either:

  • http://model-runner.docker.internal/engines/llama.cpp/v1/ (for Docker socket)
  • http://host.docker.internal:12434/engines/llama.cpp/v1/ (for TCP)

How metrics flow from host to containers

One key architectural detail worth understanding: llama.cpp runs natively on your host (via Docker Model Runner), while Prometheus and Grafana run in containers. Here’s how they communicate:

The Backend as Metrics Bridge:

  • Connects to llama.cpp via Model Runner API (http://localhost:12434)
  • Collects performance data from each API call (response times, token counts)
  • Calculates metrics like tokens per second and memory usage
  • Exposes all metrics in Prometheus format at http://backend:9090/metrics
  • Enables containerized Prometheus to scrape metrics without host access

This hybrid architecture gives you the performance benefits of native model execution with the convenience of containerized observability.

LLama.cpp metrics integration

The project provides detailed real-time metrics specifically for llama.cpp models:

Metric

Description

Implementation in Code

Tokens per Second

Measure of model generation speed

LlamaCppTokensPerSecond in metrics.go

Context Window Size

Maximum context length in tokens

LlamaCppContextSize in metrics.go

Prompt Evaluation Time

Time spent processing input prompt

LlamaCppPromptEvalTime in metrics.go

Memory per Token

Memory efficiency measurement

LlamaCppMemoryPerToken in metrics.go

Thread Utilization

Number of CPU threads used

LlamaCppThreadsUsed in metrics.go

Batch Size

Token processing batch size

LlamaCppBatchSize in metrics.go

One of the most powerful features is our detailed metrics collection for llama.cpp models. These metrics help optimize model performance and identify bottlenecks in your inference pipeline.

// LlamaCpp metrics
llamacppContextSize = promautoFactory.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "genai_app_llamacpp_context_size",
        Help: "Context window size in tokens for llama.cpp models",
    },
    []string{"model"},
)

llamacppTokensPerSecond = promautoFactory.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "genai_app_llamacpp_tokens_per_second",
        Help: "Tokens generated per second",
    },
    []string{"model"},
)

// More metrics definitions...


These metrics are collected, processed, and exposed both for Prometheus scraping and for real-time display in the front end. This gives us unprecedented visibility into how the llama.cpp inference engine is performing.

Chat implementation with streaming

The chat endpoint implements streaming for real-time token generation:


// Set up streaming with a proper SSE format
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")

// Stream each chunk as it arrives
if len(chunk.Choices) > 0 && chunk.Choices[0].Delta.Content != "" {
    outputTokens++
    _, err := fmt.Fprintf(w, "%s", chunk.Choices[0].Delta.Content)
    if err != nil {
        log.Printf("Error writing to stream: %v", err)
        return
    }
    w.(http.Flusher).Flush()
}

This streaming implementation ensures that tokens appear in real-time in the user interface, providing a smooth and responsive chat experience. You can also measure key performance metrics like time to first token and tokens per second.

Performance measurement

You can measure various performance aspects of the model:

// Record first token time
if firstTokenTime.IsZero() && len(chunk.Choices) > 0 && 
chunk.Choices[0].Delta.Content != "" {
    firstTokenTime = time.Now()
    
    // For llama.cpp, record prompt evaluation time
    if strings.Contains(strings.ToLower(model), "llama") || 
       strings.Contains(apiBaseURL, "llama.cpp") {
        promptEvalTime := firstTokenTime.Sub(promptEvalStartTime)
        llamacppPromptEvalTime.WithLabelValues(model).Observe(promptEvalTime.Seconds())
    }
}

// Calculate tokens per second for llama.cpp metrics
if strings.Contains(strings.ToLower(model), "llama") || 
   strings.Contains(apiBaseURL, "llama.cpp") {
    totalTime := time.Since(firstTokenTime).Seconds()
    if totalTime > 0 && outputTokens > 0 {
        tokensPerSecond := float64(outputTokens) / totalTime
        llamacppTokensPerSecond.WithLabelValues(model).Set(tokensPerSecond)
    }
}

These measurements help us understand the model’s performance characteristics and optimize the user experience.

Metrics collection

The metrics.go file is a core component of our observability stack for the Docker Model Runner-based chatbot. This file defines a comprehensive set of Prometheus metrics that allow us to monitor both the application performance and the underlying llama.cpp model behavior.

Core metrics architecture

The file establishes a collection of Prometheus metric types:

  • Counters: For tracking cumulative values (like request counts, token counts)
  • Gauges: For tracking values that can increase and decrease (like active requests)
  • Histograms: For measuring distributions of values (like latencies)

Each metric is created using the promauto factory, which automatically registers metrics with Prometheus.

Categories of metrics

The metrics can be divided into three main categories:

1. HTTP and application metrics

// RequestCounter counts total HTTP requests
RequestCounter = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Name: "genai_app_http_requests_total",
        Help: "Total number of HTTP requests",
    },
    []string{"method", "endpoint", "status"},
)

// RequestDuration measures HTTP request durations
RequestDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "genai_app_http_request_duration_seconds",
        Help:    "HTTP request duration in seconds",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "endpoint"},
)

These metrics monitor the HTTP server performance, tracking request counts, durations, and error rates. The metrics are labelled with dimensions like method, endpoint, and status to enable detailed analysis.

2. Model performance metrics

// ChatTokensCounter counts tokens in chat requests and responses
ChatTokensCounter = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Name: "genai_app_chat_tokens_total",
        Help: "Total number of tokens processed in chat",
    },
    []string{"direction", "model"},
)

// ModelLatency measures model response time
ModelLatency = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "genai_app_model_latency_seconds",
        Help:    "Model response time in seconds",
        Buckets: []float64{0.1, 0.5, 1, 2, 5, 10, 20, 30, 60},
    },
    []string{"model", "operation"},
)

These metrics track the LLM usage patterns and performance, including token counts (both input and output) and overall latency. The FirstTokenLatency metric is particularly important as it measures the time to get the first token from the model, which is a critical user experience factor.

3. llama.cpp specific metrics

// LlamaCppContextSize measures the context window size
LlamaCppContextSize = promauto.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "genai_app_llamacpp_context_size",
        Help: "Context window size in tokens for llama.cpp models",
    },
    []string{"model"},
)

// LlamaCppTokensPerSecond measures generation speed
LlamaCppTokensPerSecond = promauto.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "genai_app_llamacpp_tokens_per_second",
        Help: "Tokens generated per second",
    },
    []string{"model"},
)

These metrics capture detailed performance characteristics specific to the llama.cpp inference engine used by Docker Model Runner. They include:

1. Context Size: 

It represents the token window size used by the model, typically ranging from 2048 to 8192 tokens. The optimization goal is balancing memory usage against conversation quality.  When memory usage becomes problematic, reduce context size to 2048 tokens for faster processing

2. Prompt Evaluation Time

It measures the time spent processing input before generating tokens, essentially your time-to-first-token latency with a target of under 2 seconds. The optimization focus is minimizing user wait time for the initial response. If evaluation time exceeds 3 seconds, reduce context size or implement prompt compression techniques.

3. Tokens Per Second

It measures the time spent processing input before generating tokens, essentially your time-to-first-token latency with a target of under 2 seconds. The optimization focus is minimizing user wait time for the initial response. If evaluation time exceeds 3 seconds, reduce context size or implement prompt compression techniques. 

4. Tokens Per Second

It indicates generation speed, with a target of 8+ TPS for good user experience. This metric requires balancing response speed with model quality. When TPS drops below 5, switch to more aggressive quantization (Q4 instead of Q8) or use a smaller model variant. 

5. Memory Per Token

It tracks RAM consumption per generated token, with optimization aimed at preventing out-of-memory crashes and optimizing resource usage. When memory consumption exceeds 100MB per token, implement aggressive conversation pruning to reduce memory pressure. If memory usage grows over time during extended conversations, add automatic conversation resets after a set number of exchanges. 

6. Threads Used

It monitors the number of CPU cores actively processing model operations, with the goal of maximizing throughput without overwhelming the system. If thread utilization falls below 50% of available cores, increase the thread count for better performance. 

7. Batch Size

It controls how many tokens are processed simultaneously, requiring optimization based on your specific use case balancing latency versus throughput. For real-time chat applications, use smaller batches of 32-64 tokens to minimize latency and provide faster response times.

In nutshell, these metrics are crucial for understanding and optimizing llama.cpp performance characteristics, which directly affect the user experience of the chatbot.

Docker Compose: LLM as a first-class service

With Docker Model Runner integration, Compose makes AI model deployment as simple as any other service. One docker-compose.yml file defines your entire AI application:

  • Your AI models (via Docker Model Runner)
  • Application backend and frontend
  • Observability stack (Prometheus, Grafana, Jaeger)
  • All networking and dependencies

The most innovative aspect is the llm service using Docker’s model provider, which simplifies model deployment by directly integrating with Docker Model Runner without requiring complex configuration. This composition creates a complete, scalable AI application stack with comprehensive observability.

  llm:
    provider:
      type: model
      options:
        model: ${LLM_MODEL_NAME:-ai/llama3.2:1B-Q8_0}


​​This configuration tells Docker Compose to treat an AI model as a standard service in your application stack, just like a database or web server. 

  • The provider syntax is Docker’s new way of handling AI models natively. Instead of building containers or pulling images, Docker automatically manages the entire model-serving infrastructure for you. 
  • The model: ${LLM_MODEL_NAME:-ai/llama3.2:1B-Q8_0} line uses an environment variable with a fallback, meaning it will use whatever model you specify in LLM_MODEL_NAME, or default to Llama 3.2 1B if nothing is set.

Docker Compose: One command to run your entire stack

Why is this revolutionary? Before this, deploying an LLM required dozens of lines of complex configuration – custom Dockerfiles, GPU device mappings, volume mounts for model files, health checks, and intricate startup commands.

Now, those four lines replace all of that complexity. Docker handles downloading the model, configuring the inference engine, setting up GPU access, and exposing the API endpoints automatically. Your other services can connect to the LLM using simple service names, making AI models as easy to use as any other infrastructure component. This transforms AI from a specialized deployment challenge into standard infrastructure-as-code.

Here’s the full compose.yml file that orchestrates the entire application:

 services:
  backend:
    env_file: 'backend.env'
    build:
      context: .
      target: backend
    ports:
      - '8080:8080'
      - '9090:9090'  # Metrics port
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock  # Add Docker socket access
    healthcheck:
      test: ['CMD', 'wget', '-qO-', 'http://localhost:8080/health']
      interval: 3s
      timeout: 3s
      retries: 3
    networks:
      - app-network
    depends_on:
      - llm

  frontend:
    build:
      context: ./frontend
    ports:
      - '3000:3000'
    depends_on:
      backend:
        condition: service_healthy
    networks:
      - app-network

  prometheus:
    image: prom/prometheus:v2.45.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - '9091:9090'
    networks:
      - app-network

  grafana:
    image: grafana/grafana:10.1.0
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_DOMAIN=localhost
    ports:
      - '3001:3000'
    depends_on:
      - prometheus
    networks:
      - app-network

  jaeger:
    image: jaegertracing/all-in-one:1.46
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
    ports:
      - '16686:16686'  # UI
      - '4317:4317'    # OTLP gRPC
      - '4318:4318'    # OTLP HTTP
    networks:
      - app-network

  # New LLM service using Docker Compose's model provider
  llm:
    provider:
      type: model
      options:
        model: ${LLM_MODEL_NAME:-ai/llama3.2:1B-Q8_0}

volumes:
  grafana-data:

networks:
  app-network:
    driver: bridge

This compose.yml defines a complete microservices architecture for the application with integrated observability tools and Model Runner support:

backend

  • Go-based API server with Docker socket access for container management
  • Implements health checks and exposes both API (8080) and metrics (9090) ports

frontend

  • React-based user interface for an interactive chat experience
  • Waits for backend health before starting to ensure system reliability

prometheus

  • Time-series metrics database for collecting and storing performance data
  • Configured with custom settings for monitoring application behavior

grafana

  • Data visualization platform for metrics with persistent dashboard storage
  • Pre-configured with admin access and connected to the Prometheus data source

jaeger

  • Distributed tracing system for visualizing request flows across services
  • Supports multiple protocols (gRPC/HTTP) with UI on port 16686

How Docker Model Runner integration works

The project integrates with Docker Model Runner through the following mechanisms:

  1. Connection Configuration:
    • Using internal DNS: http://model-runner.docker.internal/engines/llama.cpp/v1/
    • Using TCP via host-side support: localhost:12434
  2. Docker’s Host Networking:
    • The extra_hosts configuration maps host.docker.internal to the host’s gateway IP
  3. Environment Variables:
    • BASE_URL: URL for the model runner
    • MODEL: Model identifier (e.g., ai/llama3.2:1B-Q8_0)
  4. API Communication:
    • The backend formats messages and sends them to Docker Model Runner
    • It then streams tokens back to the frontend in real-time

Why this approach excels

Building GenAI applications with Docker Model Runner and comprehensive observability offers several advantages:

  • Privacy and Security: All data stays on your local infrastructure
  • Cost Control: No per-token or per-request API charges
  • Performance Insights: Deep visibility into model behavior and efficiency
  • Developer Experience: Familiar Docker-based workflow with powerful monitoring
  • Flexibility: Easy to experiment with different models and configurations

Conclusion

The genai-model-runner-metrics project demonstrates a powerful approach to building AI-powered applications with Docker Model Runner while maintaining visibility into performance characteristics. By combining local model execution with comprehensive metrics, you get the best of both worlds: the privacy and cost benefits of local execution with the observability needed for production applications.

Whether you’re building a customer support bot, a content generation tool, or a specialized AI assistant, this architecture provides the foundation for reliable, observable, and efficient AI applications. The metrics-driven approach ensures you can continuously monitor and optimize your application, leading to better user experiences and more efficient resource utilization.

Ready to get started? Clone the repository, fire up Docker Desktop, and experience the future of AI development — your own local, metrics-driven GenAI application is just a docker compose up away!

Learn more

❌
❌