Vue normale

Il y a de nouveaux articles disponibles, cliquez pour rafraîchir la page.
À partir d’avant-hierFlux principal

How to Make an AI Chatbot from Scratch using Docker Model Runner

Par : Harsh Manvar
3 juin 2025 à 18:40

Today, we’ll show you how to build a fully functional Generative AI chatbot using Docker Model Runner and powerful observability tools, including Prometheus, Grafana, and Jaeger. We’ll walk you through the common challenges developers face when building AI-powered applications, demonstrate how Docker Model Runner solves these pain points, and then guide you step-by-step through building a production-ready chatbot with comprehensive monitoring and metrics.

By the end of this guide, you’ll know how to make an AI chatbot and run it locally. You’ll also learn how to set up real-time monitoring insights, streaming responses, and a modern React interface — all orchestrated through familiar Docker workflows.

The current challenges with GenAI development

Generative AI (GenAI) is revolutionizing software development, but creating AI-powered applications comes with significant challenges. First, the current AI landscape is fragmented — developers must piece together various libraries, frameworks, and platforms that weren’t designed to work together. Second, running large language models efficiently requires specialized hardware configurations that vary across platforms, while AI model execution remains disconnected from standard container workflows. This forces teams to maintain separate environments for their application code and AI models.

Third, without standardized methods for storing, versioning, and serving models, development teams struggle with inconsistent deployment practices. Meanwhile, relying on cloud-based AI services creates financial strain through unpredictable costs that scale with usage. Additionally, sending data to external AI services introduces privacy and security risks, especially for applications handling sensitive information.

These challenges combine to create a frustrating developer experience that hinders experimentation and slows innovation precisely when businesses need to accelerate their AI adoption. Docker Model Runner addresses these pain points by providing a streamlined solution for running AI models locally, right within your existing Docker workflow.

How Docker is solving these challenges

Docker Model Runner offers a revolutionary approach to GenAI development by integrating AI model execution directly into familiar container workflows. 

dmr-genai-comparison

Figure 1: Comparison diagram showing complex multi-step traditional GenAI setup versus simplified Docker Model Runner single-command workflow

Many developers successfully use containerized AI models, benefiting from integrated workflows, cost control, and data privacy. Docker Model Runner builds on these strengths by making it even easier and more efficient to work with models. By running models natively on your host machine while maintaining the familiar Docker interface, Model Runner delivers.

  • Simplified Model Execution: Run AI models locally with a simple Docker CLI command, no complex setup required.
  • Hardware Acceleration: Direct access to GPU resources without containerization overhead
  • Integrated Workflow: Seamless integration with existing Docker tools and container development practices
  • Standardized Packaging: Models are distributed as OCI artifacts through the same registries you already use
  • Cost Control: Eliminate unpredictable API costs by running models locally
  • Data Privacy: Keep sensitive data within your infrastructure with no external API calls

This approach fundamentally changes how developers can build and test AI-powered applications, making local development faster, more secure, and dramatically more efficient.

How to create an AI chatbot with Docker

In this guide, we’ll build a comprehensive GenAI application that showcases how to create a fully-featured chat interface powered by Docker Model Runner, complete with advanced observability tools to monitor and optimize your AI models.

Project overview

The project is a complete Generative AI interface that demonstrates how to:

  1. Create a responsive React/TypeScript chat UI with streaming responses
  2. Build a Go backend server that integrates with Docker Model Runner
  3. Implement comprehensive observability with metrics, logging, and tracing
  4. Monitor AI model performance with real-time metrics

Architecture

The application consists of these main components:

  1. The frontend sends chat messages to the backend API
  2. The backend formats the messages and sends them to the Model Runner
  3. The LLM processes the input and generates a response
  4. The backend streams the tokens back to the frontend as they’re generated
  5. The frontend displays the incoming tokens in real-time
  6. Observability components collect metrics, logs, and traces throughout the process

arch

Figure 2: Architecture diagram showing data flow between frontend, backend, Model Runner, and observability tools like Prometheus, Grafana, and Jaeger.

Project structure

The project has the following structure:

tree -L 2
.
├── Dockerfile
├── README-model-runner.md
├── README.md
├── backend.env
├── compose.yaml
├── frontend
..
├── go.mod
├── go.sum
├── grafana
│   └── provisioning
├── main.go
├── main_branch_update.md
├── observability
│   └── README.md
├── pkg
│   ├── health
│   ├── logger
│   ├── metrics
│   ├── middleware
│   └── tracing
├── prometheus
│   └── prometheus.yml
├── refs
│   └── heads
..


21 directories, 33 files

We’ll examine the key files and understand how they work together throughout this guide.

Prerequisites

Before we begin, make sure you have:

  • Docker Desktop (version 4.40 or newer) 
  • Docker Model Runner enabled
  • At least 16GB of RAM for running AI models efficiently
  • Familiarity with Go (for backend development)
  • Familiarity with React and TypeScript (for frontend development)

Getting started

To run the application:

  1. Clone the repository: 
git clone 
https://github.com/dockersamples/genai-model-runner-metrics

cd genai-model-runner-metrics


  1. Enable Docker Model Runner in Docker Desktop:
  • Go to Settings > Features in Development > Beta tab
  • Enable “Docker Model Runner”
  • Select “Apply and restart”
enable-dmr

Figure 3: Screenshot of Docker Desktop Beta Features settings panel with Docker AI, Docker Model Runner, and TCP support enabled.

  1. Download the model

For this demo, we’ll use Llama 3.2, but you can substitute any model of your choice:

docker model pull ai/llama3.2:1B-Q8_0


Just like viewing containers, you can manage your downloaded AI models directly in Docker Dashboard under the Models section. Here you can see model details, storage usage, and manage your local AI model library.

model-ui

Figure 4: View of Docker Dashboard showing locally downloaded AI models with details like size, parameters, and quantization.

  1. Start the application: 
docker compose up -d --build
list-of-containers

Figure 5: List of active running containers in Docker Dashboard, including Jaeger, Prometheus, backend, frontend, and genai-model-runner-metrics.

  1. Open your browser and navigate to the frontend URL at http://localhost:3000 . You’ll be greeted with a modern chat interface (see screenshot) featuring: 
  • Clean, responsive design with dark/light mode toggle
  • Message input area ready for your first prompt
  • Model information displayed in the footer
expand

Figure 6: GenAI chatbot interface showing live metrics panel with input/output tokens, response time, and error rate.

  1. Click on Expand to view the metrics like:
  • Input tokens
  • Output tokens
  • Total Requests
  • Average Response Time
  • Error Rate
metrics

Figure 7: Expanded metrics view with input and output tokens, detailed chat prompt, and response generated by Llama 3.2 model.

Grafana allows you to visualize metrics through customizable dashboards. Click on View Detailed Dashboard to open up Grafana dashboard.

chatbot

Figure 8: Chat interface showing metrics dashboard with prompt and response plus option to view detailed metrics in Grafana.

Log in with the default credentials (enter “admin” as user and password) to explore pre-configured AI performance dashboards (see screenshot below) showing real-time metrics like tokens per second, memory usage, and model performance. 

Select Add your first data source. Choose Prometheus as a data source. Enter “http://prometheus:9090” as Prometheus Server URL. Scroll down to the end of the site and click “Save and test”. By now, you should see “Successfully queried the Prometheus API” as an acknowledgement. Select Dashboard and click Re-import for all these dashboards.

By now, you should have a Prometheus 2.0 Stats dashboard up and running.

grafana

Figure 9: Grafana dashboard with multiple graph panels monitoring GenAI chatbot performance, displaying time-series charts for memory consumption, processing speeds, and application health

Prometheus allows you to collect and store time-series metrics data. Open the Prometheus query interface http://localhost:9091 and start typing “genai” in the query box to explore all available AI metrics (as shown in the screenshot below). You’ll see dozens of automatically collected metrics, including tokens per second, latency measurements, and llama.cpp-specific performance data. 

prometheus

Figure 10: Prometheus web interface showing dropdown of available GenAI metrics including genai_app_active_requests and genai_app_token_latency

Jaeger provides a visual exploration of request flows and performance bottlenecks. You can access it via http://localhost:16686

Implementation details

Let’s explore how the key components of the project work:

  1. Frontend implementation

The React frontend provides a clean, responsive chat interface built with TypeScript and modern React patterns. The core App.tsx component manages two essential pieces of state: dark mode preferences for user experience and model metadata fetched from the backend’s health endpoint. 

When the component mounts, the useEffect hook automatically retrieves information about the currently running AI model. It displays details like the model name directly in the footer to give users transparency about which LLM is powering their conversations.

// Essential App.tsx structure
function App() {
  const [darkMode, setDarkMode] = useState(false);
  const [modelInfo, setModelInfo] = useState<ModelMetadata | null>(null);

  // Fetch model info from backend
  useEffect(() => {
    fetch('http://localhost:8080/health')
      .then(res => res.json())
      .then(data => setModelInfo(data.model_info));
  }, []);

  return (
    <div className="min-h-screen bg-white dark:bg-gray-900">
      <Header toggleDarkMode={() => setDarkMode(!darkMode)} />
      <ChatBox />
      <footer>
        Powered by Docker Model Runner running {modelInfo?.model}
      </footer>
    </div>
  );
}

The main App component orchestrates the overall layout while delegating specific functionality to specialized components like Header for navigation controls and ChatBox for the actual conversation interface. This separation of concerns makes the codebase maintainable while the automatic model info fetching demonstrates how the frontend seamlessly integrates with the Docker Model Runner through the Go backend’s API, creating a unified user experience that abstracts away the complexity of local AI model execution.

  1. Backend implementation: Integration with Model Runner

The core of this application is a Go backend that communicates with Docker Model Runner. Let’s examine the key parts of our main.go file:

client := openai.NewClient(
    option.WithBaseURL(baseURL),
    option.WithAPIKey(apiKey),
)

This demonstrates how we leverage Docker Model Runner’s OpenAI-compatible API. The Model Runner exposes endpoints that match OpenAI’s API structure, allowing us to use standard clients. Depending on your connection method, baseURL is set to either:

  • http://model-runner.docker.internal/engines/llama.cpp/v1/ (for Docker socket)
  • http://host.docker.internal:12434/engines/llama.cpp/v1/ (for TCP)

How metrics flow from host to containers

One key architectural detail worth understanding: llama.cpp runs natively on your host (via Docker Model Runner), while Prometheus and Grafana run in containers. Here’s how they communicate:

The Backend as Metrics Bridge:

  • Connects to llama.cpp via Model Runner API (http://localhost:12434)
  • Collects performance data from each API call (response times, token counts)
  • Calculates metrics like tokens per second and memory usage
  • Exposes all metrics in Prometheus format at http://backend:9090/metrics
  • Enables containerized Prometheus to scrape metrics without host access

This hybrid architecture gives you the performance benefits of native model execution with the convenience of containerized observability.

LLama.cpp metrics integration

The project provides detailed real-time metrics specifically for llama.cpp models:

Metric

Description

Implementation in Code

Tokens per Second

Measure of model generation speed

LlamaCppTokensPerSecond in metrics.go

Context Window Size

Maximum context length in tokens

LlamaCppContextSize in metrics.go

Prompt Evaluation Time

Time spent processing input prompt

LlamaCppPromptEvalTime in metrics.go

Memory per Token

Memory efficiency measurement

LlamaCppMemoryPerToken in metrics.go

Thread Utilization

Number of CPU threads used

LlamaCppThreadsUsed in metrics.go

Batch Size

Token processing batch size

LlamaCppBatchSize in metrics.go

One of the most powerful features is our detailed metrics collection for llama.cpp models. These metrics help optimize model performance and identify bottlenecks in your inference pipeline.

// LlamaCpp metrics
llamacppContextSize = promautoFactory.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "genai_app_llamacpp_context_size",
        Help: "Context window size in tokens for llama.cpp models",
    },
    []string{"model"},
)

llamacppTokensPerSecond = promautoFactory.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "genai_app_llamacpp_tokens_per_second",
        Help: "Tokens generated per second",
    },
    []string{"model"},
)

// More metrics definitions...


These metrics are collected, processed, and exposed both for Prometheus scraping and for real-time display in the front end. This gives us unprecedented visibility into how the llama.cpp inference engine is performing.

Chat implementation with streaming

The chat endpoint implements streaming for real-time token generation:


// Set up streaming with a proper SSE format
w.Header().Set("Content-Type", "text/event-stream")
w.Header().Set("Cache-Control", "no-cache")
w.Header().Set("Connection", "keep-alive")

// Stream each chunk as it arrives
if len(chunk.Choices) > 0 && chunk.Choices[0].Delta.Content != "" {
    outputTokens++
    _, err := fmt.Fprintf(w, "%s", chunk.Choices[0].Delta.Content)
    if err != nil {
        log.Printf("Error writing to stream: %v", err)
        return
    }
    w.(http.Flusher).Flush()
}

This streaming implementation ensures that tokens appear in real-time in the user interface, providing a smooth and responsive chat experience. You can also measure key performance metrics like time to first token and tokens per second.

Performance measurement

You can measure various performance aspects of the model:

// Record first token time
if firstTokenTime.IsZero() && len(chunk.Choices) > 0 && 
chunk.Choices[0].Delta.Content != "" {
    firstTokenTime = time.Now()
    
    // For llama.cpp, record prompt evaluation time
    if strings.Contains(strings.ToLower(model), "llama") || 
       strings.Contains(apiBaseURL, "llama.cpp") {
        promptEvalTime := firstTokenTime.Sub(promptEvalStartTime)
        llamacppPromptEvalTime.WithLabelValues(model).Observe(promptEvalTime.Seconds())
    }
}

// Calculate tokens per second for llama.cpp metrics
if strings.Contains(strings.ToLower(model), "llama") || 
   strings.Contains(apiBaseURL, "llama.cpp") {
    totalTime := time.Since(firstTokenTime).Seconds()
    if totalTime > 0 && outputTokens > 0 {
        tokensPerSecond := float64(outputTokens) / totalTime
        llamacppTokensPerSecond.WithLabelValues(model).Set(tokensPerSecond)
    }
}

These measurements help us understand the model’s performance characteristics and optimize the user experience.

Metrics collection

The metrics.go file is a core component of our observability stack for the Docker Model Runner-based chatbot. This file defines a comprehensive set of Prometheus metrics that allow us to monitor both the application performance and the underlying llama.cpp model behavior.

Core metrics architecture

The file establishes a collection of Prometheus metric types:

  • Counters: For tracking cumulative values (like request counts, token counts)
  • Gauges: For tracking values that can increase and decrease (like active requests)
  • Histograms: For measuring distributions of values (like latencies)

Each metric is created using the promauto factory, which automatically registers metrics with Prometheus.

Categories of metrics

The metrics can be divided into three main categories:

1. HTTP and application metrics

// RequestCounter counts total HTTP requests
RequestCounter = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Name: "genai_app_http_requests_total",
        Help: "Total number of HTTP requests",
    },
    []string{"method", "endpoint", "status"},
)

// RequestDuration measures HTTP request durations
RequestDuration = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "genai_app_http_request_duration_seconds",
        Help:    "HTTP request duration in seconds",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "endpoint"},
)

These metrics monitor the HTTP server performance, tracking request counts, durations, and error rates. The metrics are labelled with dimensions like method, endpoint, and status to enable detailed analysis.

2. Model performance metrics

// ChatTokensCounter counts tokens in chat requests and responses
ChatTokensCounter = promauto.NewCounterVec(
    prometheus.CounterOpts{
        Name: "genai_app_chat_tokens_total",
        Help: "Total number of tokens processed in chat",
    },
    []string{"direction", "model"},
)

// ModelLatency measures model response time
ModelLatency = promauto.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "genai_app_model_latency_seconds",
        Help:    "Model response time in seconds",
        Buckets: []float64{0.1, 0.5, 1, 2, 5, 10, 20, 30, 60},
    },
    []string{"model", "operation"},
)

These metrics track the LLM usage patterns and performance, including token counts (both input and output) and overall latency. The FirstTokenLatency metric is particularly important as it measures the time to get the first token from the model, which is a critical user experience factor.

3. llama.cpp specific metrics

// LlamaCppContextSize measures the context window size
LlamaCppContextSize = promauto.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "genai_app_llamacpp_context_size",
        Help: "Context window size in tokens for llama.cpp models",
    },
    []string{"model"},
)

// LlamaCppTokensPerSecond measures generation speed
LlamaCppTokensPerSecond = promauto.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "genai_app_llamacpp_tokens_per_second",
        Help: "Tokens generated per second",
    },
    []string{"model"},
)

These metrics capture detailed performance characteristics specific to the llama.cpp inference engine used by Docker Model Runner. They include:

1. Context Size: 

It represents the token window size used by the model, typically ranging from 2048 to 8192 tokens. The optimization goal is balancing memory usage against conversation quality.  When memory usage becomes problematic, reduce context size to 2048 tokens for faster processing

2. Prompt Evaluation Time

It measures the time spent processing input before generating tokens, essentially your time-to-first-token latency with a target of under 2 seconds. The optimization focus is minimizing user wait time for the initial response. If evaluation time exceeds 3 seconds, reduce context size or implement prompt compression techniques.

3. Tokens Per Second

It measures the time spent processing input before generating tokens, essentially your time-to-first-token latency with a target of under 2 seconds. The optimization focus is minimizing user wait time for the initial response. If evaluation time exceeds 3 seconds, reduce context size or implement prompt compression techniques. 

4. Tokens Per Second

It indicates generation speed, with a target of 8+ TPS for good user experience. This metric requires balancing response speed with model quality. When TPS drops below 5, switch to more aggressive quantization (Q4 instead of Q8) or use a smaller model variant. 

5. Memory Per Token

It tracks RAM consumption per generated token, with optimization aimed at preventing out-of-memory crashes and optimizing resource usage. When memory consumption exceeds 100MB per token, implement aggressive conversation pruning to reduce memory pressure. If memory usage grows over time during extended conversations, add automatic conversation resets after a set number of exchanges. 

6. Threads Used

It monitors the number of CPU cores actively processing model operations, with the goal of maximizing throughput without overwhelming the system. If thread utilization falls below 50% of available cores, increase the thread count for better performance. 

7. Batch Size

It controls how many tokens are processed simultaneously, requiring optimization based on your specific use case balancing latency versus throughput. For real-time chat applications, use smaller batches of 32-64 tokens to minimize latency and provide faster response times.

In nutshell, these metrics are crucial for understanding and optimizing llama.cpp performance characteristics, which directly affect the user experience of the chatbot.

Docker Compose: LLM as a first-class service

With Docker Model Runner integration, Compose makes AI model deployment as simple as any other service. One docker-compose.yml file defines your entire AI application:

  • Your AI models (via Docker Model Runner)
  • Application backend and frontend
  • Observability stack (Prometheus, Grafana, Jaeger)
  • All networking and dependencies

The most innovative aspect is the llm service using Docker’s model provider, which simplifies model deployment by directly integrating with Docker Model Runner without requiring complex configuration. This composition creates a complete, scalable AI application stack with comprehensive observability.

  llm:
    provider:
      type: model
      options:
        model: ${LLM_MODEL_NAME:-ai/llama3.2:1B-Q8_0}


​​This configuration tells Docker Compose to treat an AI model as a standard service in your application stack, just like a database or web server. 

  • The provider syntax is Docker’s new way of handling AI models natively. Instead of building containers or pulling images, Docker automatically manages the entire model-serving infrastructure for you. 
  • The model: ${LLM_MODEL_NAME:-ai/llama3.2:1B-Q8_0} line uses an environment variable with a fallback, meaning it will use whatever model you specify in LLM_MODEL_NAME, or default to Llama 3.2 1B if nothing is set.

Docker Compose: One command to run your entire stack

Why is this revolutionary? Before this, deploying an LLM required dozens of lines of complex configuration – custom Dockerfiles, GPU device mappings, volume mounts for model files, health checks, and intricate startup commands.

Now, those four lines replace all of that complexity. Docker handles downloading the model, configuring the inference engine, setting up GPU access, and exposing the API endpoints automatically. Your other services can connect to the LLM using simple service names, making AI models as easy to use as any other infrastructure component. This transforms AI from a specialized deployment challenge into standard infrastructure-as-code.

Here’s the full compose.yml file that orchestrates the entire application:

 services:
  backend:
    env_file: 'backend.env'
    build:
      context: .
      target: backend
    ports:
      - '8080:8080'
      - '9090:9090'  # Metrics port
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock  # Add Docker socket access
    healthcheck:
      test: ['CMD', 'wget', '-qO-', 'http://localhost:8080/health']
      interval: 3s
      timeout: 3s
      retries: 3
    networks:
      - app-network
    depends_on:
      - llm

  frontend:
    build:
      context: ./frontend
    ports:
      - '3000:3000'
    depends_on:
      backend:
        condition: service_healthy
    networks:
      - app-network

  prometheus:
    image: prom/prometheus:v2.45.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - '9091:9090'
    networks:
      - app-network

  grafana:
    image: grafana/grafana:10.1.0
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_DOMAIN=localhost
    ports:
      - '3001:3000'
    depends_on:
      - prometheus
    networks:
      - app-network

  jaeger:
    image: jaegertracing/all-in-one:1.46
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
    ports:
      - '16686:16686'  # UI
      - '4317:4317'    # OTLP gRPC
      - '4318:4318'    # OTLP HTTP
    networks:
      - app-network

  # New LLM service using Docker Compose's model provider
  llm:
    provider:
      type: model
      options:
        model: ${LLM_MODEL_NAME:-ai/llama3.2:1B-Q8_0}

volumes:
  grafana-data:

networks:
  app-network:
    driver: bridge

This compose.yml defines a complete microservices architecture for the application with integrated observability tools and Model Runner support:

backend

  • Go-based API server with Docker socket access for container management
  • Implements health checks and exposes both API (8080) and metrics (9090) ports

frontend

  • React-based user interface for an interactive chat experience
  • Waits for backend health before starting to ensure system reliability

prometheus

  • Time-series metrics database for collecting and storing performance data
  • Configured with custom settings for monitoring application behavior

grafana

  • Data visualization platform for metrics with persistent dashboard storage
  • Pre-configured with admin access and connected to the Prometheus data source

jaeger

  • Distributed tracing system for visualizing request flows across services
  • Supports multiple protocols (gRPC/HTTP) with UI on port 16686

How Docker Model Runner integration works

The project integrates with Docker Model Runner through the following mechanisms:

  1. Connection Configuration:
    • Using internal DNS: http://model-runner.docker.internal/engines/llama.cpp/v1/
    • Using TCP via host-side support: localhost:12434
  2. Docker’s Host Networking:
    • The extra_hosts configuration maps host.docker.internal to the host’s gateway IP
  3. Environment Variables:
    • BASE_URL: URL for the model runner
    • MODEL: Model identifier (e.g., ai/llama3.2:1B-Q8_0)
  4. API Communication:
    • The backend formats messages and sends them to Docker Model Runner
    • It then streams tokens back to the frontend in real-time

Why this approach excels

Building GenAI applications with Docker Model Runner and comprehensive observability offers several advantages:

  • Privacy and Security: All data stays on your local infrastructure
  • Cost Control: No per-token or per-request API charges
  • Performance Insights: Deep visibility into model behavior and efficiency
  • Developer Experience: Familiar Docker-based workflow with powerful monitoring
  • Flexibility: Easy to experiment with different models and configurations

Conclusion

The genai-model-runner-metrics project demonstrates a powerful approach to building AI-powered applications with Docker Model Runner while maintaining visibility into performance characteristics. By combining local model execution with comprehensive metrics, you get the best of both worlds: the privacy and cost benefits of local execution with the observability needed for production applications.

Whether you’re building a customer support bot, a content generation tool, or a specialized AI assistant, this architecture provides the foundation for reliable, observable, and efficient AI applications. The metrics-driven approach ensures you can continuously monitor and optimize your application, leading to better user experiences and more efficient resource utilization.

Ready to get started? Clone the repository, fire up Docker Desktop, and experience the future of AI development — your own local, metrics-driven GenAI application is just a docker compose up away!

Learn more

Leveraging Docker with TensorFlow Models & TensorFlow.js for a Snake AI Game

Par : Harsh Manvar
18 mars 2025 à 16:15

The emergence of containerization has brought about a significant transformation in software development and deployment by providing a consistent and scalable environment on many platforms. For developers trying to optimize their operations, Docker in particular has emerged as the preferred technology.  By using containers as the foundation, developers gain the same key benefits — portability, scalability, efficiency, and security — that they rely on for other workloads, seamlessly extending them to ML and AI applications. Using a real-world example involving a Snake AI game, we’ll examine in this article how TensorFlow.js can be used with Docker to run AI/ML in a web browser.

Why Docker for TensorFlow.js conversion?

With the help of TensorFlow.js, a robust toolkit, machine learning models can be executed in a web browser, opening up a plethora of possibilities for applications such as interactive demonstrations and real-time inference. Docker offers a sophisticated solution by guaranteeing consistency and user-friendliness throughout the conversion process by enclosing it inside a container.

The Snake AI neural network game

The Snake AI game brings a modern twist to the classic Snake game by integrating artificial intelligence that learns and improves its gameplay over time. In this version, you can either play manually using arrow keys or let the AI take control. 

The AI running with TensorFlow continuously improves by making strategic movements to maximize the score while avoiding collisions. The game runs in a browser using TensorFlow.js, allowing you to test different trained models and observe how the AI adapts to various challenges. 

Whether you’re playing for fun or experimenting with AI models, this game is a great way to explore the intersection of gaming and machine learning. In our approach, we’ve used the neural network to play the traditional Snake game.

Before we dive into the AI in this Snake game, let’s understand the basics of a neural network.

What is a neural network?

A neural network is a type of machine learning model inspired by the way the human brain works. It’s made up of layers of nodes (or “neurons”), where each node takes some inputs, processes them, and passes an output to the next layer.

Key components of a neural network:

  • Input layer: Receives the raw data (in our game, the snake’s surroundings).
  • Hidden layers: Process the input and extract patterns.
  • Output layer: Gives the final prediction
Blog layers

Figure 1: Key components of a neural network

Imagine each neuron as a small decision-maker. The more neurons and layers, the better the network can recognize patterns.

Types of neural networks

There are several types of neural networks, each suited for different tasks!

  1. Feedforward Neural Networks (FNNs):
  • The simplest type, where data flows in one direction, from input to output.
  • Great for tasks like classification, regression, and pattern recognition.
  1. Convolutional Neural Networks (CNNs):
  • Designed to work with image data.
  • Uses filters to detect spatial patterns, like edges or textures.
  1. Recurrent Neural Networks (RNNs)
  • Good for sequence prediction (e.g., stock prices, text generation).
  • Remembers previous inputs, allowing it to handle time-series data.
  1. Long Short-Term Memory Networks (LSTMs)
  • A specialized type of RNN that can learn long-term dependencies.
  1. Generative Adversarial Networks (GANs)
  • Used for generating new data, such as creating images or deepfakes.

When to use each type:

  • CNNs: Image classification, object detection, facial recognition.
  • RNNs/LSTMs: Language models, stock market prediction, time-series data.
  • FNNs: Games like Snake, where the task is to predict the next action based on the current state.

How does the game work?

The snake game offers two ways to play:

Manual mode:

  • You control the snake using your keyboard’s arrow keys.
  • The goal is to eat the fruit (red square) without hitting the walls or yourself.
  • Every time you eat a fruit, the snake grows longer, and your score increases.
  • If the snake crashes, the game ends, and you can restart to try again.

AI mode:

  • The game plays itself using a neural network (a type of AI brain) built with TensorFlow.js.
  • The AI looks at the snake’s surroundings (walls, fruit location, and snake body) and predicts the best move: left, forward, or right.
  • After each game, the AI learns from its mistakes and becomes smarter over time.
  • With enough training, the AI can avoid crashing and improve its score.
blog snake game

Figure 2: The classic Nokia snake game with an AI twist

Getting started

Let’s go through how this game is built step by step. You’ll first need to install Docker to run the game in a web browser. Here’s a summary of the steps. 

  • Clone the repository
  • Install Docker Desktop
  • Create a Dockerfile
  • Build the Docker image
  • Run the container
  • Access the game using the web browser

Cloning the repository

git clone https://github.com/dockersamples/snake-game-tensorflow-docker

Install Docker Desktop

Prerequisites:

  • A supported version of Mac, Linux, or Windows
  • At least 4 GB RAM

Click here to download Docker Desktop. Select the version appropriate for your system (Apple Silicon or Intel chip for Mac users, Windows, or Linux distribution)

blog DD menu

Figure 3: Download Docker Desktop from the Docker website for your machine

Quick run

After installing Docker Desktop, run the pre-built Docker image and execute the following command in your command prompt. It’ll pull the image and start a new container running the snake-game:v1 Docker image and expose port 8080 on the host machine. 

Run the following command to bring up the application:

docker compose up

Next, open the browser and go to http://localhost:8080 to see the output of the snake game and start your first game.

Why use Docker to run the snake game?

  • No need to install Nginx on your machine — Docker handles it.
  • The game runs the same way on any system that supports Docker.
  • You can easily share your game as a Docker image, and others can run it with a single command.

The game logic

The index.html file acts as the foundation of the game, defining the layout and structure of the webpage. It fetches the library TensorFlow.js, which powers the AI, along with script.js for handling gameplay logic and ai.js for AI-based movements. The game UI is simple yet functional, featuring a mode selector that lets players switch between manual control (using arrow keys) and AI mode. The scoreboard dynamically updates the score, high score, and generation count when the AI is training. Also, the game itself runs on an HTML <canvas> element, making it highly interactive. As we move forward, we’ll explore how the JavaScript files bring this game to life!

File : index.html

The HTML file sets up the structure of the game, like the game canvas and control buttons. It also fetches the library from Tensorflow which will be further used by the code to train the snake. 

  • Canvas: Where the game is drawn.
  • Mode Selector: Lets you switch between manual and AI gameplay.
  • TensorFlow.js: The library that powers the AI brain!

File : script.js

This file handles everything in the game—drawing the board, moving the snake, placing the fruit, and keeping score.

const canvas = document.getElementById('gameCanvas');
const ctx = canvas.getContext('2d');

let snake = [{ x: 5, y: 5 }];
let fruit = { x: 10, y: 10 };
let direction = { x: 1, y: 0 };
let score = 0;
  • Snake Position: Where the snake starts.
  • Fruit Position: Where the apple is.
  • Direction: Which way the snake is moving.

The game loop

The game loop keeps the game running, updating the snake’s position, checking for collisions, and handling the score.

function gameLoopManual() {
    const head = { x: snake[0].x + direction.x, y: snake[0].y + direction.y };

    if (head.x === fruit.x && head.y === fruit.y) {
        score++;
        fruit = placeFruit();
    } else {
        snake.pop();
    }
    snake.unshift(head);
}
  • Moving the Snake: Adds a new head to the snake and removes the tail (unless it eats an apple).
  • Collision: If the head hits the wall or its own body, the game ends.

Switching between modes

document.getElementById('mode').addEventListener('change', function() {
    gameMode = this.value;
});
  • Manual Mode: Use arrow keys to control the snake.
  • AI Mode: The neural network predicts the next move.

Game over and restart

function gameOver() {
    clearInterval(gameInterval);
    alert('Game Over');
}

function resetGame() {
    score = 0;
    snake = [{ x: 5, y: 5 }];
    fruit = placeFruit();
}
  • Game Over: Stops the game when the snake crashes.
  • Reset: Resets the game back to the beginning.

Training the AI

File : ai.js

This file creates and trains the neural network — the AI brain that learns how to play Snake!

var movementOptions = ['left', 'forward', 'right'];

const neuralNet = tf.sequential();
neuralNet.add(tf.layers.dense({units: 256, inputShape: [5]}));
neuralNet.add(tf.layers.dense({units: 512}));
neuralNet.add(tf.layers.dense({units: 256}));
neuralNet.add(tf.layers.dense({units: 3}));
  • Neural Network: This is a simulated brain with four layers of neurons.
  • Input: Information about the game state (walls, fruit position, snake body).
  • Output: One of three choices: turn left, go forward, or turn right.
const optAdam = tf.train.adam(.001);
neuralNet.compile({
  optimizer: optAdam,
  loss: 'meanSquaredError'
});
  • Optimizer: Helps the brain learn efficiently by adjusting its weights.
  • Loss Function: Measures how wrong the AI’s predictions are, helping it improve.

Every time the snake plays a game, it remembers its moves and trains itself.

async function trainNeuralNet(moveRecord) {
  for (var i = 0; i < moveRecord.length; i++) {
     const expected = tf.oneHot(tf.tensor1d([deriveExpectedMove(moveRecord[i])], 'int32'), 3).cast('float32');
     posArr = tf.tensor2d([moveRecord[i]]);
     await neuralNet.fit(posArr, expected, { batchSize: 3, epochs: 1 });
     expected.dispose();
     posArr.dispose();
  }
}

After each game, the AI looks at what happened, adjusts its internal connections, and tries to improve for the next game.

The movementOptions array defines the possible movement directions for the snake: ‘left’, ‘forward’, and ‘right’.

An Adam optimizer with a learning rate of 0.001 compiles the model, and a mean squared error loss function is specified. The trainNeuralNet function is defined to train the neural network using a given moveRecord array. It iterates over the moveRecord array, creates one-hot encoded tensors for the expected movement, and trains the model using the TensorFlow.js fit method.

Predicting the next move

When playing, the AI predicts what the best move should be.

function computePrediction(input) {
  let inputs = tf.tensor2d([input]);
  const outputs = neuralNet.predict(inputs);
  return movementOptions[outputs.argMax(1).dataSync()[0]];
}

The computePrediction function makes predictions using the trained neural network. It takes an input array, creates a tensor from the input, predicts the movement using the neural network, and returns the movement option based on the predicted output.

The code demonstrates the creation of a neural network model, training it with a given move record, and making predictions using the trained model. This approach can enhance the snake AI game’s performance and intelligence by learning from its history and making informed decisions.

File : Dockerfile

FROM nginx:latest
COPY . /usr/share/nginx/html

FROM nginx:latest

  • This tells Docker to use the latest version of Nginx as the base image.
  • Nginx is a web server that’s great for serving static files like HTML, CSS, and JavaScript ( (which is what the Snake game is made of).
  • Instead of creating a server from scratch, the following line saves time by using an existing, reliable Nginx setup.

COPY . /usr/share/nginx/html

  • This line copies everything from your current project directory (the one with the Snake game files: index.html, script.js, ai.js, etc.) into the Nginx web server’s default folder for serving web content.
  • /usr/share/nginx/html is the default folder where Nginx looks for web files to display.

Development setup

Here’s how you can set up the development environment for the Snake game using Docker!

To make the development smoother, you can use Docker Volumes to avoid rebuilding the Docker image every time you change the game files. 

Run the command from the folder where Snake-AI-TensorFlow-Docker code exists:

docker run -it --rm -d -p 8080:80 --name web -v ./:/usr/share/nginx/html nginx

If you hit an error like

docker: Error response from daemon: Mounts denied: 
The path /Users/harsh/Downloads/Snake-AI-TensorFlow-Docker is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/settings/mac/#file-sharing for more info.

Open Docker Desktop, go to Settings -> Resources -> File Sharing -> Select the location where you cloned the repository code, and click on Apply & restart.

blog DD business file sharing

Figure 4: Using Docker Desktop to set up a development environment for the AI snake game

Run the command again, you won’t face any errors now 

docker run -it --rm -d -p 8080:80 --name web -v ./:/usr/share/nginx/html nginx

Check if the container is running with the following command

harsh@Harshs-MacBook-Air snake-game-tensorflow-docker % docker ps
CONTAINER ID   IMAGE     COMMAND                  CREATED         STATUS         PORTS                  NAMES
c47e2711b2db   nginx     "/docker-entrypoint.…"   3 seconds ago   Up 2 seconds   0.0.0.0:8080->80/tcp   web

Open the browser and go to the URL http://localhost:8080, you’ll see the snake game output. This setup is perfect for development because it keeps everything fast and dynamic. 

blog localhost 8080

Figure 5: Accessing the snake game via a web browser.

Changes in the code will be reflected right away in the browser without rebuilding the container-v ~/.:/usr/share/nginx/html — This is the magic part! It mounts your local directory (Snake-AI-TensorFlow-Docker) into the Nginx HTML directory (/usr/share/nginx/html) inside the container.

Any changes you make to HTML, CSS, or JavaScript files in code ~/Snake-AI-TensorFlow-Docker immediately reflect in the running app without needing to rebuild the container.

Conclusion 

In conclusion, building a Snake AI game using TensorFlow.js and Docker demonstrates how seamlessly ML and AI can be integrated into interactive web applications. Through this project, we’ve not only explored the fundamentals of reinforcement learning but also seen firsthand how Docker can simplify the development and deployment process.

By containerizing, Docker ensures a consistent environment across different systems, eliminating the common “it works on my machine” problem. This consistency makes it easier to collaborate with others, deploy to production, and manage dependencies without worrying about version mismatches or local configuration issues for web applications, machine learning projects, or AI applications.

LLM Everywhere: Docker for Local and Hugging Face Hosting

Par : Harsh Manvar
9 novembre 2023 à 16:13

This post is written in collaboration with Docker Captain Harsh Manvar.

Hugging Face has become a powerhouse in the field of machine learning (ML). Their large collection of pretrained models and user-friendly interfaces have entirely changed how we approach AI/ML deployment and spaces. If you’re interested in looking deeper into the integration of Docker and Hugging Face models, a comprehensive guide can be found in the article “Build Machine Learning Apps with Hugging Face’s Docker Spaces.”

The Large Language Model (LLM) — a marvel of language generation — is an astounding invention. In this article, we’ll look at how to use the Hugging Face hosted Llama model in a Docker context, opening up new opportunities for natural language processing (NLP) enthusiasts and researchers.

rectangle llm everywhere docker for local and hugging face hosting 1

Introduction to Hugging Face and LLMs

Hugging Face (HF) provides a comprehensive platform for training, fine-tuning, and deploying ML models. And, LLMs provide a state-of-the-art model capable of performing tasks like text generation, completion, and classification.

Leveraging Docker for ML

The robust Docker containerization technology makes it easier to package, distribute, and operate programs. It guarantees that ML models operate consistently across various contexts by enclosing them within Docker containers. Reproducibility is ensured, and the age-old “it works on my machine” issue is resolved.

Type of formats

For the majority of models on Hugging Face, two options are available. 

  • GPTQ (usually 4-bit or 8-bit, GPU only)
  • GGML (usually 4-, 5-, 8-bit, CPU/GPU hybrid)

Examples of quantization techniques used in AI model quantization include the GGML and GPTQ models. This can mean quantization either during or after training. By reducing model weights to a lower precision, the GGML and GPTQ models — two well-known quantized models — minimize model size and computational needs. 

HF models load on the GPU, which performs inference significantly more quickly than the CPU. Generally, the model is huge, and you also need a lot of VRAM. In this article, we will utilize the GGML model, which operates well on CPU and is probably faster if you don’t have a good GPU.

We will also be using transformers and ctransformers in this demonstration, so let’s first understand those: 

  • transformers: Modern pretrained models can be downloaded and trained with ease thanks to transformers’ APIs and tools. By using pretrained models, you can cut down on the time and resources needed to train a model from scratch, as well as your computational expenses and carbon footprint.
  • ctransformers: Python bindings for the transformer models developed in C/C++ with the GGML library.

Request Llama model access

We will utilize the Meta Llama model, signup, and request for access.

Create Hugging Face token

To create an Access token that will be used in the future, go to your Hugging Face profile settings and select Access Token from the left-hand sidebar (Figure 1). Save the value of the created Access Token.

Screenshot of Hugging Face profile settings; selecting Access Token.
Figure 1. Generating Access Token.

Setting up Docker environment

Before exploring the realm of the LLM, we must first configure our Docker environment. Install Docker first, following the instructions on the official Docker website based on your operating system. After installation, execute the following command to confirm your setup:

docker --version

Quick demo

The following command runs a container with the Hugging Face harsh-manvar-llama-2-7b-chat-test:latest image and exposes port 7860 from the container to the host machine. It will also set the environment variable HUGGING_FACE_HUB_TOKEN to the value you provided.

docker run -it -p 7860:7860 --platform=linux/amd64 \
	-e HUGGING_FACE_HUB_TOKEN="YOUR_VALUE_HERE"
\
	registry.hf.space/harsh-manvar-llama-2-7b-chat-test:latest python app.py
  • The -it flag tells Docker to run the container in interactive mode and to attach a terminal to it. This will allow you to interact with the container and its processes.
  • The -p flag tells Docker to expose port 7860 from the container to the host machine. This means that you will be able to access the container’s web server from the host machine on port 7860.
  • The --platform=linux/amd64 flag tells Docker to run the container on a Linux machine with an AMD64 architecture.
  • The -e HUGGING_FACE_HUB_TOKEN="YOUR_VALUE_HERE" flag tells Docker to set the environment variable HUGGING_FACE_HUB_TOKEN to the value you provided. This is required for accessing Hugging Face models from the container.

The app.py script is the Python script that you want to run in the container. This will start the container and open a terminal to it. You can then interact with the container and its processes in the terminal. To exit the container, press Ctrl+C.

Accessing the landing page

To access the container’s web server, open a web browser and navigate to http://localhost:7860. You should see the landing page for your Hugging Face model (Figure 2).

Open your browser and go to http://localhost:7860:

Screenshot of landing page for local Docker LLM, with welcome message that says "Hello, it's nice to meet you. Is there something I can help you with or would you like to chat?"
Figure 2. Accessing local Docker LLM.

Getting started

Cloning the project

To get started, you can clone or download the Hugging Face existing space/repository.

git clone https://huggingface.co/spaces/harsh-manvar/llama-2-7b-chat-test

File: requirements.txt

A requirements.txt file is a text file that lists the Python packages and modules that a project needs to run. It is used to manage the project’s dependencies and to ensure that all developers working on the project are using the same versions of the required packages.

The following Python packages are required to run the Hugging Face llama-2-13b-chat model. Note that this model is large, and it may take some time to download and install. You may also need to increase the memory allocated to your Python process to run the model.

gradio==3.37.0
protobuf==3.20.3
scipy==1.11.1
torch==2.0.1
sentencepiece==0.1.99
transformers==4.31.0
ctransformers==0.2.27

File: Dockerfile

FROM python:3.9
RUN useradd -m -u 1000 user
WORKDIR /code
COPY ./requirements.txt /code/requirements.txt
RUN pip install --upgrade pip
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
USER user
COPY --link --chown=1000 ./ /code

The following section provides a breakdown of the Dockerfile. The first line tells Docker to use the official Python 3.9 image as the base image for our image:

FROM python:3.9

The following line creates a new user named user with the user ID 1000. The -m flag tells Docker to create a home directory for the user.

RUN useradd -m -u 1000 user

Next, this line sets the working directory for the container to /code.

WORKDIR /code

It’s time to copy the requirements file from the current directory to /code in the container. Also, this line upgrades the pip package manager in the container.

RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt

This line sets the default user for the container to user.

USER user

The following line copies the contents of the current directory to /code in the container. The --link flag tells Docker to create hard links instead of copying the files, which can improve performance and reduce the size of the image. The --chown=1000 flag tells Docker to change the ownership of the copied files to the user user.

COPY --link --chown=1000 ./ /code

Once you have built the Docker image, you can run it using the docker run command. This will start a new container running the Python 3.9 image with the non-root user user. You can then interact with the container using the terminal.

File: app.py

The Python code shows how to use Gradio to create a demo for a text-generation model trained using transformers. The code allows users to input a text prompt and generate a continuation of the text.

Gradio is a Python library that allows you to create and share interactive machine learning demos with ease. It provides a simple and intuitive interface for creating and deploying demos, and it supports a wide range of machine learning frameworks and libraries, including transformers.

This Python script is a Gradio demo for a text chatbot. It uses a pretrained text generation model to generate responses to user input. We’ll break down the file and look at each of the sections.

The following line imports the Iterator type from the typing module. This type is used to represent a sequence of values that can be iterated over. The next line imports the gradio library as well.

from typing import Iterator
import gradio as gr

The following line imports the logging module from the transformers library, which is a popular machine learning library for natural language processing.

from transformers.utils import logging
from model import get_input_token_length, run

Next, this line imports the get_input_token_length() and run() functions from the model module. These functions are used to calculate the input token length of a text and generate text using a pretrained text generation model, respectively. The next two lines configure the logging module to print information-level messages and to use the transformers logger.

from model import get_input_token_length, run

logging.set_verbosity_info()
logger = logging.get_logger("transformers")

The following lines define some constants that are used throughout the code. Also, the lines define the text that is displayed in the Gradio demo.

DEFAULT_SYSTEM_PROMPT = """"""
MAX_MAX_NEW_TOKENS = 2048
DEFAULT_MAX_NEW_TOKENS = 1024
MAX_INPUT_TOKEN_LENGTH = 4000

DESCRIPTION = """"""

LICENSE = """"""

This line logs an information-level message indicating that the code is starting. This function clears the textbox and saves the input message to the saved_input state variable.

logger.info("Starting")
def clear_and_save_textbox(message: str) -> tuple[str, str]:
    return '', message

The following function displays the input message in the chatbot and adds the message to the chat history.

def display_input(message: str,
                   history: list[tuple[str, str]]) -> list[tuple[str, str]]:
  history.append((message, ''))
  logger.info("display_input=%s",message)             
  return history

This function deletes the previous response from the chat history and returns the updated chat history and the previous response.

def delete_prev_fn(
    history: list[tuple[str, str]]) -> tuple[list[tuple[str, str]], str]:
  try:
    message, _ = history.pop()
  except IndexError:
    message = ''
  return history, message or ''

The following function generates text using the pre-trained text generation model and the given parameters. It returns an iterator that yields a list of tuples, where each tuple contains the input message and the generated response.

def generate(
    message: str,
    history_with_input: list[tuple[str, str]],
    system_prompt: str,
    max_new_tokens: int,
    temperature: float,
    top_p: float,
    top_k: int,
) -> Iterator[list[tuple[str, str]]]:
  #logger.info("message=%s",message)
  if max_new_tokens > MAX_MAX_NEW_TOKENS:
    raise ValueError

  history = history_with_input[:-1]
  generator = run(message, history, system_prompt, max_new_tokens, temperature, top_p, top_k)
  try:
    first_response = next(generator)
    yield history + [(message, first_response)]
  except StopIteration:
    yield history + [(message, '')]
  for response in generator:
    yield history + [(message, response)]

The following function generates a response to the given message and returns the empty string and the generated response.

def process_example(message: str) -> tuple[str, list[tuple[str, str]]]:
  generator = generate(message, [], DEFAULT_SYSTEM_PROMPT, 1024, 1, 0.95, 50)
  for x in generator:
    pass
  return '', x

Here’s the complete Python code:

from typing import Iterator
import gradio as gr

from transformers.utils import logging
from model import get_input_token_length, run

logging.set_verbosity_info()
logger = logging.get_logger("transformers")

DEFAULT_SYSTEM_PROMPT = """"""
MAX_MAX_NEW_TOKENS = 2048
DEFAULT_MAX_NEW_TOKENS = 1024
MAX_INPUT_TOKEN_LENGTH = 4000

DESCRIPTION = """"""

LICENSE = """"""

logger.info("Starting")
def clear_and_save_textbox(message: str) -> tuple[str, str]:
    return '', message


def display_input(message: str,
                  history: list[tuple[str, str]]) -> list[tuple[str, str]]:
    history.append((message, ''))
    logger.info("display_input=%s",message)             
    return history


def delete_prev_fn(
        history: list[tuple[str, str]]) -> tuple[list[tuple[str, str]], str]:
    try:
        message, _ = history.pop()
    except IndexError:
        message = ''
    return history, message or ''


def generate(
    message: str,
    history_with_input: list[tuple[str, str]],
    system_prompt: str,
    max_new_tokens: int,
    temperature: float,
    top_p: float,
    top_k: int,
) -> Iterator[list[tuple[str, str]]]:
    #logger.info("message=%s",message)
    if max_new_tokens > MAX_MAX_NEW_TOKENS:
        raise ValueError

    history = history_with_input[:-1]
    generator = run(message, history, system_prompt, max_new_tokens, temperature, top_p, top_k)
    try:
        first_response = next(generator)
        yield history + [(message, first_response)]
    except StopIteration:
        yield history + [(message, '')]
    for response in generator:
        yield history + [(message, response)]


def process_example(message: str) -> tuple[str, list[tuple[str, str]]]:
    generator = generate(message, [], DEFAULT_SYSTEM_PROMPT, 1024, 1, 0.95, 50)
    for x in generator:
        pass
    return '', x


def check_input_token_length(message: str, chat_history: list[tuple[str, str]], system_prompt: str) -> None:
    #logger.info("check_input_token_length=%s",message)
    input_token_length = get_input_token_length(message, chat_history, system_prompt)
    #logger.info("input_token_length",input_token_length)
    #logger.info("MAX_INPUT_TOKEN_LENGTH",MAX_INPUT_TOKEN_LENGTH)
    if input_token_length > MAX_INPUT_TOKEN_LENGTH:
        logger.info("Inside IF condition")
        raise gr.Error(f'The accumulated input is too long ({input_token_length} > {MAX_INPUT_TOKEN_LENGTH}). Clear your chat history and try again.')
    #logger.info("End of check_input_token_length function")


with gr.Blocks(css='style.css') as demo:
    gr.Markdown(DESCRIPTION)
    gr.DuplicateButton(value='Duplicate Space for private use',
                       elem_id='duplicate-button')

    with gr.Group():
        chatbot = gr.Chatbot(label='Chatbot')
        with gr.Row():
            textbox = gr.Textbox(
                container=False,
                show_label=False,
                placeholder='Type a message...',
                scale=10,
            )
            submit_button = gr.Button('Submit',
                                      variant='primary',
                                      scale=1,
                                      min_width=0)
    with gr.Row():
        retry_button = gr.Button('Retry', variant='secondary')
        undo_button = gr.Button('Undo', variant='secondary')
        clear_button = gr.Button('Clear', variant='secondary')

    saved_input = gr.State()

    with gr.Accordion(label='Advanced options', open=False):
        system_prompt = gr.Textbox(label='System prompt',
                                   value=DEFAULT_SYSTEM_PROMPT,
                                   lines=6)
        max_new_tokens = gr.Slider(
            label='Max new tokens',
            minimum=1,
            maximum=MAX_MAX_NEW_TOKENS,
            step=1,
            value=DEFAULT_MAX_NEW_TOKENS,
        )
        temperature = gr.Slider(
            label='Temperature',
            minimum=0.1,
            maximum=4.0,
            step=0.1,
            value=1.0,
        )
        top_p = gr.Slider(
            label='Top-p (nucleus sampling)',
            minimum=0.05,
            maximum=1.0,
            step=0.05,
            value=0.95,
        )
        top_k = gr.Slider(
            label='Top-k',
            minimum=1,
            maximum=1000,
            step=1,
            value=50,
        )

    gr.Markdown(LICENSE)

    textbox.submit(
        fn=clear_and_save_textbox,
        inputs=textbox,
        outputs=[textbox, saved_input],
        api_name=False,
        queue=False,
    ).then(
        fn=display_input,
        inputs=[saved_input, chatbot],
        outputs=chatbot,
        api_name=False,
        queue=False,
    ).then(
        fn=check_input_token_length,
        inputs=[saved_input, chatbot, system_prompt],
        api_name=False,
        queue=False,
    ).success(
        fn=generate,
        inputs=[
            saved_input,
            chatbot,
            system_prompt,
            max_new_tokens,
            temperature,
            top_p,
            top_k,
        ],
        outputs=chatbot,
        api_name=False,
    )

    button_event_preprocess = submit_button.click(
        fn=clear_and_save_textbox,
        inputs=textbox,
        outputs=[textbox, saved_input],
        api_name=False,
        queue=False,
    ).then(
        fn=display_input,
        inputs=[saved_input, chatbot],
        outputs=chatbot,
        api_name=False,
        queue=False,
    ).then(
        fn=check_input_token_length,
        inputs=[saved_input, chatbot, system_prompt],
        api_name=False,
        queue=False,
    ).success(
        fn=generate,
        inputs=[
            saved_input,
            chatbot,
            system_prompt,
            max_new_tokens,
            temperature,
            top_p,
            top_k,
        ],
        outputs=chatbot,
        api_name=False,
    )

    retry_button.click(
        fn=delete_prev_fn,
        inputs=chatbot,
        outputs=[chatbot, saved_input],
        api_name=False,
        queue=False,
    ).then(
        fn=display_input,
        inputs=[saved_input, chatbot],
        outputs=chatbot,
        api_name=False,
        queue=False,
    ).then(
        fn=generate,
        inputs=[
            saved_input,
            chatbot,
            system_prompt,
            max_new_tokens,
            temperature,
            top_p,
            top_k,
        ],
        outputs=chatbot,
        api_name=False,
    )

    undo_button.click(
        fn=delete_prev_fn,
        inputs=chatbot,
        outputs=[chatbot, saved_input],
        api_name=False,
        queue=False,
    ).then(
        fn=lambda x: x,
        inputs=[saved_input],
        outputs=textbox,
        api_name=False,
        queue=False,
    )

    clear_button.click(
        fn=lambda: ([], ''),
        outputs=[chatbot, saved_input],
        queue=False,
        api_name=False,
    )

demo.queue(max_size=20).launch(share=False, server_name="0.0.0.0")

The check_input_token_length and generate functions comprise the main part of the code. The generate function is responsible for generating a response given a message, a history of previous messages, and various generation parameters, including:

  • max_new_tokens: This is an integer that indicates the most tokens that the response-generating model is permitted to produce.
  • temperature: This float value regulates how random the output that is produced is. The result is more random at higher values (like 1.0) and more predictable at lower levels (like 0.2).
  • top_p: The nucleus sampling is determined by this float value, which ranges from 0 to 1. It establishes a cutoff point for the tokens’ cumulative probability.
  • top_k: The number of next tokens to be considered is represented by this integer. A greater number results in a more concentrated output.

The UI component and running the API server are handled by app.py. Basically, app.py is where you initialize the application and other configuration.

File: Model.py

The Python script is a chat bot that uses an LLM to generate responses to user input. The script uses the following steps to generate a response:

  • It creates a prompt for the LLM by combining the user input, the chat history, and the system prompt.
  • It calculates the input token length of the prompt.
  • It generates a response using the LLM and the following parameters:
    • max_new_tokens: Maximum number of new tokens to generate.
    • temperature: Temperature to use when generating the response. A higher temperature will result in more creative and varied responses, but it may also result in less coherent responses
    • top_p: This parameter controls the nucleus sampling algorithm used to generate the response. A  higher top_p value will result in more focused and informative responses, while a lower value will  result in more creative and varied responses.
    • top_k: This parameter controls the number of highest probability tokens to consider when generating the response. A higher top_k value will result in more predictable and consistent responses, while a lower value will result in more creative and varied responses.

The main function of the TextIteratorStreamer class is to store print-ready text in a queue. This queue can then be used by a downstream application as an iterator to access the generated text in a non-blocking way.

from threading import Thread
from typing import Iterator

#import torch
from transformers.utils import logging
from ctransformers import AutoModelForCausalLM 
from transformers import TextIteratorStreamer, AutoTokenizer

logging.set_verbosity_info()
logger = logging.get_logger("transformers")

config = {"max_new_tokens": 256, "repetition_penalty": 1.1, 
          "temperature": 0.1, "stream": True}
model_id = "TheBloke/Llama-2-7B-Chat-GGML"
device = "cpu"


model = AutoModelForCausalLM.from_pretrained(model_id, model_type="llama", lib="avx2", hf=True)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

def get_prompt(message: str, chat_history: list[tuple[str, str]],
               system_prompt: str) -> str:
    #logger.info("get_prompt chat_history=%s",chat_history)
    #logger.info("get_prompt system_prompt=%s",system_prompt)
    texts = [f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    #logger.info("texts=%s",texts)
    do_strip = False
    for user_input, response in chat_history:
        user_input = user_input.strip() if do_strip else user_input
        do_strip = True
        texts.append(f'{user_input} [/INST] {response.strip()} </s><s>[INST] ')
    message = message.strip() if do_strip else message
    #logger.info("get_prompt message=%s",message)
    texts.append(f'{message} [/INST]')
    #logger.info("get_prompt final texts=%s",texts)
    return ''.join(texts) def get_input_token_length(message: str, chat_history: list[tuple[str, str]], system_prompt: str) -> int:
    #logger.info("get_input_token_length=%s",message)
    prompt = get_prompt(message, chat_history, system_prompt)
    #logger.info("prompt=%s",prompt)
    input_ids = tokenizer([prompt], return_tensors='np', add_special_tokens=False)['input_ids']
    #logger.info("input_ids=%s",input_ids)
    return input_ids.shape[-1]

def run(message: str,
        chat_history: list[tuple[str, str]],
        system_prompt: str,
        max_new_tokens: int = 1024,
        temperature: float = 0.8,
        top_p: float = 0.95,
        top_k: int = 50) -> Iterator[str]:
    prompt = get_prompt(message, chat_history, system_prompt)
    inputs = tokenizer([prompt], return_tensors='pt', add_special_tokens=False).to(device)

    streamer = TextIteratorStreamer(tokenizer,
                                    timeout=15.,
                                    skip_prompt=True,
                                    skip_special_tokens=True)
    generate_kwargs = dict(
        inputs,
        streamer=streamer,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_p=top_p,
        top_k=top_k,
        temperature=temperature,
        num_beams=1,
    )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    outputs = []
    for text in streamer:
        outputs.append(text)
        yield "".join(outputs)

To import the necessary modules and libraries for text generation with transformers, we can use the following code:

from transformers import AutoTokenizer, AutoModelForCausalLM

This will import the necessary modules for tokenizing and generating text with transformers.

To define the model to import, we can use:

model_id = "TheBloke/Llama-2-7B-Chat-GGML"

This step defines the model ID as TheBloke/Llama-2-7B-Chat-GGML, a scaled-down version of the Meta 7B chat LLama model.

Once you have imported the necessary modules and libraries and defined the model to import, you can load the tokenizer and model using the following code:

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

This will load the tokenizer and model from the Hugging Face Hub.The job of a tokenizer is to prepare the model’s inputs. Tokenizers for each model are available in the library. Define the model to import; again, we’re using TheBloke/Llama-2-7B-Chat-GGML.

You need to set the variables and values in config for max_new_tokens, temperature, repetition_penalty, and stream:

  • max_new_tokens: Most tokens possible, disregarding the prompt’s specified quantity of tokens.
  • temperature: The amount that was utilized to modify the probability for the subsequent tokens.
  • repetition_penalty: Repetition penalty parameter. 1.0 denotes no punishment. 
  • stream: Whether to generate the response text in a streaming manner or in a single batch. 

You can also create the space and commit files to it to host applications on Hugging Face and test directly.

Building the image

The following command builds a Docker image for the llama-2-13b-chat model on the linux/amd64 platform. The image will be tagged with the name local-llm:v1.

docker buildx  build --platform=linux/amd64  -t local-llm:v1 .

Running the container

The following command will start a new container running the local-llm:v1 Docker image and expose port 7860 on the host machine. The -e HUGGING_FACE_HUB_TOKEN="YOUR_VALUE_HERE" environment variable sets the Hugging Face Hub token, which is required to download the llama-2-13b-chat model from the Hugging Face Hub.

docker run -it -p 7860:7860 --platform=linux/amd64 -e HUGGING_FACE_HUB_TOKEN="YOUR_VALUE_HERE" local-llm:v1 python app.py

Next, open the browser and go to http://localhost:7860 to see local LLM Docker container output (Figure 3).

Screenshot of LLM Docker container output, showing Chatbot saying: "Hello! I'm just an AI, so I don't have feelings or emotions like a human does. However, I'm here to help..."
Figure 3. Local LLM Docker container output.

You can also view containers via the Docker Desktop (Figure 4).

Screenshot showing view of containers in the Docker Desktop.
Figure 4. Monitoring containers with Docker Desktop.

Conclusion

Deploying the LLM GGML model locally with Docker is a convenient and effective way to use natural language processing. Dockerizing the model makes it easy to move it between different environments and ensures that it will run consistently. Testing the model in a browser provides a user-friendly interface and allows you to quickly evaluate its performance.

This setup gives you more control over your infrastructure and data and makes it easier to deploy advanced language models for a variety of applications. It is a significant step forward in the deployment of large language models.

Learn more

Accelerating Machine Learning with TensorFlow.js: Using Pretrained Models and Docker

Par : Harsh Manvar
17 août 2023 à 13:26

In the rapidly evolving era of machine learning (ML) and artificial intelligence (AI), TensorFlow has emerged as a leading framework for developing and implementing sophisticated models. With the introduction of TensorFlow.js, TensorFlow’s capability is boosted for JavaScript developers. 

TensorFlow.js is a JavaScript machine learning toolkit that facilitates the creation of ML models and their immediate use in browsers or Node.js apps. TensorFlow.js has expanded the capability of TensorFlow into the realm of actual web development. A remarkable aspect of TensorFlow.js is its ability to utilize pretrained models, which opens up a wide range of possibilities for developers. 

In this article, we will explore the concept of pretrained models in TensorFlow.js and Docker and delve into the potential applications and benefits. 

tensorflow blog image 2400x1260

Understanding pretrained models

Pretrained models are a powerful tool for developers because they allow you to use ML models without having to train them yourself. This approach can save a lot of time and effort, and it can also be more accurate than training your own model from scratch.

A pretrained model is an ML model that has been professionally trained on a large volume of data. Because these models have been trained on complex patterns and representations, they are incredibly effective and precise in carrying out specific tasks. Developers may save a lot of time and computing resources by using pretrained models because they can avoid having to train a model from scratch.

Types of pretrained models available

There is a wide range of potential applications for pretrained models in TensorFlow.js. 

For example, developers could use them to:

  • Build image classification models that can identify objects in images.
  • Build natural language processing (NLP) models that can understand and respond to text.
  • Build speech recognition models that can transcribe audio into text.
  • Build recommendation systems that can suggest products or services to users.

TensorFlow.js and pretrained models

Developers can easily include pretrained models in their web applications using TensorFlow.js. With TensorFlow.js, you can benefit from robust machine learning algorithms without needing to be an expert in model deployment or training. The library offers a wide variety of pretrained models, including those for audio analysis, picture identification, natural language processing, and more (Figure 1).

Illustration of different types of pretrained models, including vision, text, sound, body, and other.
Figure 1: Available pretrained model types for TensorFlow.

How does it work?

The module allows for the direct loading of models in TensorFlow SavedModel or Keras Model formats. Once the model has been loaded, developers can use its features by invoking certain methods made available by the model API. Figure 2 shows the steps involved for training, distribution, and deployment.

Flow chart illustration showing flow of data within the training, distribution, and deployment steps for a pretrained model.
Figure 2: TensorFlow.js model API for a pretrained image classification model.

Training

The training section shows the steps involved in training a machine learning model. The first step is to collect data. This data is then preprocessed, which means that it is cleaned and prepared for training. The data is then fed to a machine learning algorithm, which trains the model.

  • Preprocess data: This is the process of cleaning and preparing data for training. This includes tasks such as removing noise, correcting errors, and normalizing the data.
  • TF Hub: TF Hub is a repository of pretrained ML models. These models can be used to speed up the training process or to improve the accuracy of a model.
  • tf.keras: tf.keras is a high-level API for building and training machine learning models. It is built on top of TensorFlow, which is a low-level library for machine learning.
  • Estimator: An estimator is a type of model in TensorFlow. It is a simplified way to build and train ML models.

Distribution

Distribution is the process of making a machine learning model available to users. This can be done by packaging the model in a format that can be easily shared, or by deploying the model to a production environment.

The distribution section shows the steps involved in distributing a machine learning model. The first step is to package the model. This means that the model is converted into a format that can be easily shared. The model is then distributed to users, who can then use it to make predictions.

Deployment

The deployment section shows the steps involved in deploying a machine learning model. The first step is to choose a framework. A framework is a set of tools that makes it easier to build and deploy machine learning models. The model is then converted into a format that can be used by the framework. The model is then deployed to a production environment, where it can be used to make predictions.

Benefits of using pretrained models

There are several pretrained models available in TensorFlow.js that can be utilized immediately in any project and offer the following notable advantages:

  • Savings in time and resources: Building an ML model from scratch might take a lot of time and resources. Developers can skip this phase and use a model that has already learned from lengthy training by using pretrained models. The time and resources needed to implement machine learning solutions are significantly decreased as a result.
  • State-of-the-art performance: Pretrained models are typically trained on huge datasets and refined by specialists, producing models that give state-of-the-art performance across a range of applications. Developers can benefit from these models’ high accuracy and reliability by incorporating them into TensorFlow.js, even if they lack a deep understanding of machine learning.
  • Accessibility: TensorFlow.js makes pretrained models powerful for web developers, allowing them to quickly and easily integrate cutting-edge machine learning capabilities into their projects. This accessibility creates new opportunities for developing cutting-edge web-based solutions that make use of machine learning’s capabilities.
  • Transfer learning: Pretrained models can also serve as the foundation for your process. Using a smaller, domain-specific dataset, developers can further train a pretrained model. Transfer learning enables models to swiftly adapt to particular use cases, making this method very helpful when data is scarce.

Why is containerizing TensorFlow.js important?

Containerizing TensorFlow.js brings several important benefits to the development and deployment process of machine learning applications. Here are five key reasons why containerizing TensorFlow.js is important:

  1. Docker provides a consistent and portable environment for running applications. By containerizing TensorFlow.js, you can package the application, its dependencies, and runtime environment into a self-contained unit. This approach allows you to deploy the containerized TensorFlow.js application across different environments, such as development machines, staging servers, and production clusters, with minimal configuration or compatibility issues.
  1. Docker simplifies the management of dependencies for TensorFlow.js. By encapsulating all the required libraries, packages, and configurations within the container, you can avoid conflicts with other system dependencies and ensure that the application has access to the specific versions of libraries it needs. This containerization eliminates the need for manual installation and configuration of dependencies on different systems, making the deployment process more streamlined and reliable.
  1. Docker ensures the reproducibility of your TensorFlow.js application’s environment. By defining the exact dependencies, libraries, and configurations within the container, you can guarantee that the application will run consistently across different deployments.
  1. Docker enables seamless scalability of the TensorFlow.js application. With containers, you can easily replicate and distribute instances of the application across multiple nodes or servers, allowing you to handle high volumes of user requests.
  1. Docker provides isolation between the application and the host system and between different containers running on the same host. This isolation ensures that the application’s dependencies and runtime environment do not interfere with the host system or other applications. It also allows for easy management of dependencies and versioning, preventing conflicts and ensuring a clean and isolated environment in which the application can operate.

Building a fully functional ML face-detection demo app

By combining the power of TensorFlow.js and Docker, developers can create a fully functional machine learning (ML) face-detection demo app. Once the app is deployed, the TensorFlow.js model can recognize faces in real-time by leveraging the camera. However, with a minor code change, it’s possible for developers to build an app that allows users to upload images or videos to be detected.

In this tutorial, you’ll learn how to build a fully functional face-detection demo app using TensorFlow.js and Docker. Figure 3 shows the file system architecture for this setup. Let’s get started.

Prerequisite

The following key components are essential to complete this walkthrough:

 Illustration of file system architecture for Docker Compose setup, including local folder, preinstalled TensorFlow dependencies, app folder, and terminal.
Figure 3: File system architecture for Docker Compose development setup.

Deploying a ML face-detection app is a simple process involving the following steps:

  • Clone the repository. 
  • Set up the required configuration files. 
  • Initialize TensorFlow.js.
  • Train and run the model. 
  • Bring up the face-detection app. 

We’ll explain each of these steps below.

Quick demo

If you’re in hurry, you can bring up the complete app by running the following command:

docker run -p 1234:1234 harshmanvar/face-detection-tensorjs:slim-v1

Open URL in browser: http://localhost:1234.

Need alt text
Figure 4: URL opened in a browser.

Getting started

Cloning the project

To get started, you can clone the repository by running the following command:

https://github.com/dockersamples/face-detection-tensorjs

We are utilizing the MediaPipe Face Detector demo for this demonstration. You first create a detector by choosing one of the models from SupportedModels, including MediaPipeFaceDetector.

For example:

const model = faceDetection.SupportedModels.MediaPipeFaceDetector;
const detectorConfig = {
  runtime: 'mediapipe', // or 'tfjs'
}
const detector = await faceDetection.createDetector(model, detectorConfig);

Then you can use the detector to detect faces:

const faces = await detector.estimateFaces(image);

File: index.html:

<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1.0, user-scalable=no">
  <style>
    body {
      margin: 0;
    }

    #stats {
      position: relative;
      width: 100%;
      height: 80px;
    }

    #main {
      position: relative;
      margin: 0;
    }

    #canvas-wrapper {
      position: relative;
    }
  </style>
</head>

<body>
  <div id="stats"></div>
  <div id="main">
    <div class="container">
      <div class="canvas-wrapper">
        <canvas id="output"></canvas>
        <video id="video" playsinline style="
          -webkit-transform: scaleX(-1);
          transform: scaleX(-1);
          visibility: hidden;
          width: auto;
          height: auto;
          ">
        </video>
      </div>
    </div>
  </div>
  </div>
</body>
<script src="https://cdnjs.cloudflare.com/ajax/libs/dat-gui/0.7.6/dat.gui.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/stats.js/r16/Stats.min.js"></script>
<script src="src/index.js"></script>

</html>

The web application’s principal entry point is the index.html file. It includes the video element needed to display the real-time video stream from the user’s webcam and the basic HTML page structure. The relevant JavaScript scripts for the facial detection capabilities are also imported.

File: src/Index.js:

import '@tensorflow/tfjs-backend-webgl';
import '@tensorflow/tfjs-backend-webgpu';

import * as tfjsWasm from '@tensorflow/tfjs-backend-wasm';

tfjsWasm.setWasmPaths(
    `https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-wasm@${
        tfjsWasm.version_wasm}/dist/`);

import * as faceDetection from '@tensorflow-models/face-detection';

import {Camera} from './camera';
import {setupDatGui} from './option_panel';
import {STATE, createDetector} from './shared/params';
import {setupStats} from './shared/stats_panel';
import {setBackendAndEnvFlags} from './shared/util';

let detector, camera, stats;
let startInferenceTime, numInferences = 0;
let inferenceTimeSum = 0, lastPanelUpdate = 0;
let rafId;

async function checkGuiUpdate() {
  if (STATE.isTargetFPSChanged || STATE.isSizeOptionChanged) {
    camera = await Camera.setupCamera(STATE.camera);
    STATE.isTargetFPSChanged = false;
    STATE.isSizeOptionChanged = false;
  }

  if (STATE.isModelChanged || STATE.isFlagChanged || STATE.isBackendChanged) {
    STATE.isModelChanged = true;

    window.cancelAnimationFrame(rafId);

    if (detector != null) {
      detector.dispose();
    }

    if (STATE.isFlagChanged || STATE.isBackendChanged) {
      await setBackendAndEnvFlags(STATE.flags, STATE.backend);
    }

    try {
      detector = await createDetector(STATE.model);
    } catch (error) {
      detector = null;
      alert(error);
    }

    STATE.isFlagChanged = false;
    STATE.isBackendChanged = false;
    STATE.isModelChanged = false;
  }
}

function beginEstimateFaceStats() {
  startInferenceTime = (performance || Date).now();
}

function endEstimateFaceStats() {
  const endInferenceTime = (performance || Date).now();
  inferenceTimeSum += endInferenceTime - startInferenceTime;
  ++numInferences;

  const panelUpdateMilliseconds = 1000;
  if (endInferenceTime - lastPanelUpdate >= panelUpdateMilliseconds) {
    const averageInferenceTime = inferenceTimeSum / numInferences;
    inferenceTimeSum = 0;
    numInferences = 0;
    stats.customFpsPanel.update(
        1000.0 / averageInferenceTime, 120);
    lastPanelUpdate = endInferenceTime;
  }
}

async function renderResult() {
  if (camera.video.readyState < 2) {
    await new Promise((resolve) => {
      camera.video.onloadeddata = () => {
        resolve(video);
      };
    });
  }

  let faces = null;

  if (detector != null) {
  
    beginEstimateFaceStats();

    try {
      faces =
          await detector.estimateFaces(camera.video, {flipHorizontal: false});
    } catch (error) {
      detector.dispose();
      detector = null;
      alert(error);
    }

    endEstimateFaceStats();
  }

  camera.drawCtx();
  if (faces && faces.length > 0 && !STATE.isModelChanged) {
    camera.drawResults(
        faces, STATE.modelConfig.boundingBox, STATE.modelConfig.keypoints);
  }
}

async function renderPrediction() {
  await checkGuiUpdate();

  if (!STATE.isModelChanged) {
    await renderResult();
  }

  rafId = requestAnimationFrame(renderPrediction);
};

async function app() {
  const urlParams = new URLSearchParams(window.location.search);

  await setupDatGui(urlParams);

  stats = setupStats();

  camera = await Camera.setupCamera(STATE.camera);

  await setBackendAndEnvFlags(STATE.flags, STATE.backend);

  detector = await createDetector();

  renderPrediction();
};

app();

JavaScript file that conducts the facial detection logic. TensorFlow.js is loaded, allowing for real-time face detection on the video stream using the pretrained face identification model. The file manages access to the camera, processing of the video frames, and creating bounding boxes around faces that have been recognized in the video feed.

File: src/camera.js:

import {VIDEO_SIZE} from './shared/params';
import {drawResults, isMobile} from './shared/util';

export class Camera {
  constructor() {
    this.video = document.getElementById('video');
    this.canvas = document.getElementById('output');
    this.ctx = this.canvas.getContext('2d');
  }
   
  static async setupCamera(cameraParam) {
    if (!navigator.mediaDevices || !navigator.mediaDevices.getUserMedia) {
      throw new Error(
          'Browser API navigator.mediaDevices.getUserMedia not available');
    }

    const {targetFPS, sizeOption} = cameraParam;
    const $size = VIDEO_SIZE[sizeOption];
    const videoConfig = {
      'audio': false,
      'video': {
        facingMode: 'user',
        width: isMobile() ? VIDEO_SIZE['360 X 270'].width : $size.width,
        height: isMobile() ? VIDEO_SIZE['360 X 270'].height : $size.height,
        frameRate: {
          ideal: targetFPS,
        },
      },
    };

    const stream = await navigator.mediaDevices.getUserMedia(videoConfig);

    const camera = new Camera();
    camera.video.srcObject = stream;

    await new Promise((resolve) => {
      camera.video.onloadedmetadata = () => {
        resolve(video);
      };
    });

    camera.video.play();

    const videoWidth = camera.video.videoWidth;
    const videoHeight = camera.video.videoHeight;
    // Must set below two lines, otherwise video element doesn't show.
    camera.video.width = videoWidth;
    camera.video.height = videoHeight;

    camera.canvas.width = videoWidth;
    camera.canvas.height = videoHeight;
    const canvasContainer = document.querySelector('.canvas-wrapper');
    canvasContainer.style = `width: ${videoWidth}px; height: ${videoHeight}px`;
    camera.ctx.translate(camera.video.videoWidth, 0);
    camera.ctx.scale(-1, 1);

    return camera;
  }

  drawCtx() {
    this.ctx.drawImage(
        this.video, 0, 0, this.video.videoWidth, this.video.videoHeight);
  }

  drawResults(faces, boundingBox, keypoints) {
    drawResults(this.ctx, faces, boundingBox, keypoints);
  }
}

The configuration for the camera’s width, audio, and other setup-related items is managed in camera.js.

File: .babelrc:

The .babelrc file is used to configure Babel, a JavaScript compiler, specifying presets and plugins that define the transformations to be applied during code transpilation.

File: src/shared:

shared % tree
.
├── option_panel.js
├── params.js
├── stats_panel.js
└── util.js

1 directory, 4 files

The parameters and other shared files found in the src/shared folder are needed to run and access the camera, checks, and parameter values. 

Defining services using a Compose file

Here’s how our services appear within a Docker Compose file:

services:
  tensorjs:
    #build: .
    image: harshmanvar/face-detection-tensorjs:v2
    ports:
      - 1234:1234
    volumes:
      - ./:/app
      - /app/node_modules
    command: watch

Your sample application has the following parts:

  • The tensorjs service is based on the harshmanvar/face-detection-tensorjs:v2 image.
  • This image contains the necessary dependencies and code to run a face detection system using TensorFlow.js.
  • It exposes port 1234 to communicate with the TensorFlow.js.
  • The volume ./:/app sets up a volume mount, linking the current directory (represented by ./) on the host machine to the /app directory within the container. This allows you to share files and code between your host machine and the container.
  • The watch command specifies the command to run within the container. In this case, it runs the watch command, which suggests that the face detection system will continuously monitor for changes or updates.

Building the image

It’s time to build the development image and install the dependencies to launch the face-detection model.

docker build  -t tensor-development:v1

Running the container

docker run -p 1234:1234 -v $(pwd):/app -v /app/node_modules tensor-development:v1 watch

Bringing up the container services

You can launch the application by running the following command:

docker compose up -d

Then, use the docker compose ps command to confirm that your stack is running correctly. Your terminal will produce the following output:

docker compose ps
NAME            IMAGE            COMMAND      SERVICE             STATUS                       PORTS
tensorflow    tensorjs:v2     "yarn watch"      tensorjs          Up 48 seconds       0.0.0.0:1234->1234/tcp

Viewing the containers via Docker Dashboard

You can also leverage the Docker Dashboard to view your container’s ID and easily access or manage your application (Figure 5) container.

Screenshot of Docker Dashboard showing list of containers.
Figure 5: Viewing containers in the Docker Dashboard.

Conclusion

Well done! You have acquired the knowledge to utilize a pre-trained machine learning model with JavaScript for a web application, all thanks to TensorFlow.js. In this article, we have demonstrated how Docker Compose lets you quickly create and deploy a fully functional ML face-detection demo app, using just one YAML file.

With this newfound expertise, you can now take this guide as a foundation to build even more sophisticated applications with just a few additional steps. The possibilities are endless, and your ML journey has just begun!

Learn more

Conversational AI Made Easy: Developing an ML FAQ Model Demo from Scratch Using Rasa and Docker

Par : Harsh Manvar
6 juillet 2023 à 13:52

In today’s fast-paced digital era, conversational AI chatbots have emerged as a game-changer in delivering efficient and personalized user interactions. These artificially intelligent virtual assistants are designed to mimic human conversations, providing users with quick and relevant responses to queries.

A crucial aspect of building successful chatbots is the ability to handle frequently asked questions (FAQs) seamlessly. FAQs form a significant portion of user queries in various domains, such as customer support, e-commerce, and information retrieval. Being able to provide accurate and prompt answers to common questions not only improves user satisfaction but also frees up human agents to focus on more complex tasks. 

In this article, we’ll look at how to use the open source Rasa framework along with Docker to build and deploy a containerized, conversational AI chatbot.

Dark purple background with Docker and Rasa logos

Meet Rasa 

To tackle the challenge of FAQ handling, developers turn to sophisticated technologies like Rasa, an open source conversational AI framework. Rasa offers a comprehensive set of tools and libraries that empower developers to create intelligent chatbots with natural language understanding (NLU) capabilities. With Rasa, you can build chatbots that understand user intents, extract relevant information, and provide contextual responses based on the conversation flow.

Rasa allows developers to build and deploy conversational AI chatbots and provides a flexible architecture and powerful NLU capabilities (Figure 1).

 Illustration of Rasa architecture, showing connections between Rasa SDK,  Dialogue Policies, NLU Pipeline, Agent, Input/output channels, etc.
Figure 1: Overview of Rasa.

Rasa is a popular choice for building conversational AI applications, including chatbots and virtual assistants, for several reasons. For example, Rasa is an open source framework, which means it is freely available for developers to use, modify, and contribute to. It provides a flexible and customizable architecture that gives developers full control over the chatbot’s behavior and capabilities. 

Rasa’s NLU capabilities allow you to extract intent and entity information from user messages, enabling the chatbot to understand and respond appropriately. Rasa supports different language models and machine learning (ML) algorithms for accurate and context-aware language understanding. 

Rasa also incorporates ML techniques to train and improve the chatbot’s performance over time. You can train the model using your own training data and refine it through iterative feedback loops, resulting in a chatbot that becomes more accurate and effective with each interaction. 

Additionally, Rasa can scale to handle large volumes of conversations and can be extended with custom actions, APIs, and external services. This capability allows you to integrate additional functionalities, such as database access, external API calls, and business logic, into your chatbot.

Why containerizing Rasa is important

Containerizing Rasa brings several important benefits to the development and deployment process of conversational AI chatbots. Here are four key reasons why containerizing Rasa is important:

1. Docker provides a consistent and portable environment for running applications.

By containerizing Rasa, you can package the chatbot application, its dependencies, and runtime environment into a self-contained unit. This approach allows you to deploy the containerized Rasa chatbot across different environments, such as development machines, staging servers, and production clusters, with minimal configuration or compatibility issues. 

Docker simplifies the management of dependencies for the Rasa chatbot. By encapsulating all the required libraries, packages, and configurations within the container, you can avoid conflicts with other system dependencies and ensure that the chatbot has access to the specific versions of libraries it needs. This containerization eliminates the need for manual installation and configuration of dependencies on different systems, making the deployment process more streamlined and reliable.

2. Docker ensures the reproducibility of your Rasa chatbot’s environment.

By defining the exact dependencies, libraries, and configurations within the container, you can guarantee that the chatbot will run consistently across different deployments. 

3. Docker enables seamless scalability of the Rasa chatbot.

With containers, you can easily replicate and distribute instances of the chatbot across multiple nodes or servers, allowing you to handle high volumes of user interactions. 

4. Docker provides isolation between the chatbot and the host system and between different containers running on the same host.

This isolation ensures that the chatbot’s dependencies and runtime environment do not interfere with the host system or other applications. It also allows for easy management of dependencies and versioning, preventing conflicts and ensuring a clean and isolated environment in which the chatbot can operate.

Building an ML FAQ model demo application

By combining the power of Rasa and Docker, developers can create an ML FAQ model demo that excels in handling frequently asked questions. The demo can be trained on a dataset of common queries and their corresponding answers, allowing the chatbot to understand and respond to similar questions with high accuracy.

In this tutorial, you’ll learn how to build an ML FAQ model demo using Rasa and Docker. You’ll set up a development environment to train the model and then deploy the model using Docker. You will also see how to integrate a WebChat UI frontend for easy user interaction. Let’s jump in.

Getting started

The following key components are essential to completing this walkthrough:

 Illustration showing connections between Local folder, app folder, preinstalled Rasa dependencies, etc.
Figure 2: Docker container with pre-installed Rasa dependencies and Volume mount point.

Deploying an ML FAQ demo app is a simple process involving the following steps:

  • Clone the repository
  • Set up the configuration files. 
  • Initialize Rasa.
  • Train and run the model. 
  • Bring up the WebChat UI app. 

We’ll explain each of these steps below.

Cloning the project

To get started, you can clone the repository by running the following command:

https://github.com/dockersamples/docker-ml-faq-rasa
docker-ml-faq-rasa % tree -L 2
.
├── Dockerfile-webchat
├── README.md
├── actions
│   ├── __init__.py
│   ├── __pycache__
│   └── actions.py
├── config.yml
├── credentials.yml
├── data
│   ├── nlu.yml
│   ├── rules.yml
│   └── stories.yml
├── docker-compose.yaml
├── domain.yml
├── endpoints.yml
├── index.html
├── models
│   ├── 20230618-194810-chill-idea.tar.gz
│   └── 20230619-082740-upbeat-step.tar.gz
└── tests
    └── test_stories.yml

6 directories, 16 files

Before we move to the next step, let’s look at each of the files one by one.

File: domain.yml

This file describes the chatbot’s domain and includes crucial data including intents, entities, actions, and answers. It outlines the conversation’s structure, including the user’s input, and the templates for the bot’s responses.

version: "3.1"

intents:
  - greet
  - goodbye
  - affirm
  - deny
  - mood_great
  - mood_unhappy
  - bot_challenge

responses:
  utter_greet:
  - text: "Hey! How are you?"

  utter_cheer_up:
  - text: "Here is something to cheer you up:"
    image: "https://i.imgur.com/nGF1K8f.jpg"

  utter_did_that_help:
  - text: "Did that help you?"

  utter_happy:
  - text: "Great, carry on!"

  utter_goodbye:
  - text: "Bye"

  utter_iamabot:
  - text: "I am a bot, powered by Rasa."

session_config:
  session_expiration_time: 60
  carry_over_slots_to_new_session: true

As shown previously, this configuration file includes intents, which represent the different types of user inputs the bot can understand. It also includes responses, which are the bot’s predefined messages for various situations. For example, the bot can greet the user, provide a cheer-up image, ask if the previous response helped, express happiness, say goodbye, or mention that it’s a bot powered by Rasa.

The session configuration sets the expiration time for a session (in this case, 60 seconds) and specifies whether the bot should carry over slots (data) from a previous session to a new session.

File: nlu.yml

The NLU training data are defined in this file. It includes input samples and the intents and entities that go with them. The NLU model, which connects user inputs to the right actions, is trained using this data.

version: "3.1"

nlu:
- intent: greet
  examples: |
    - hey
    - hello
    - hi
    - hello there
    - good morning
    - good evening
    - moin
    - hey there
    - let's go
    - hey dude
    - goodmorning
    - goodevening
    - good afternoon

- intent: goodbye
  examples: |
    - cu
    - good by
    - cee you later
    - good night
    - bye
    - goodbye
    - have a nice day
    - see you around
    - bye bye
    - see you later

- intent: affirm
  examples: |
    - yes
    - y
    - indeed
    - of course
    - that sounds good
    - correct

- intent: deny
  examples: |
    - no
    - n
    - never
    - I don't think so
    - don't like that
    - no way
    - not really

- intent: mood_great
  examples: |
    - perfect
    - great
    - amazing
    - feeling like a king
    - wonderful
    - I am feeling very good
    - I am great
    - I am amazing
    - I am going to save the world
    - super stoked
    - extremely good
    - so so perfect
    - so good
    - so perfect

- intent: mood_unhappy
  examples: |
    - my day was horrible
    - I am sad
    - I don't feel very well
    - I am disappointed
    - super sad
    - I'm so sad
    - sad
    - very sad
    - unhappy
    - not good
    - not very good
    - extremly sad
    - so saad
    - so sad

- intent: bot_challenge
  examples: |
    - are you a bot?
    - are you a human?
    - am I talking to a bot?
    - am I talking to a human?

This configuration file defines several intents, which represent different types of user inputs that the chatbot can recognize. Each intent has a list of examples, which are example phrases or sentences that users might type or say to express that particular intent.

You can customize and expand upon this configuration by adding more intents and examples that are relevant to your chatbot’s domain and use cases.

File: stories.yml

This file is used to define the training stories, which serve as examples of user-chatbot interactions. A series of user inputs, bot answers, and the accompanying intents and entities make up each story.

version: "3.1"

stories:

- story: happy path
  steps:
  - intent: greet
  - action: utter_greet
  - intent: mood_great
  - action: utter_happy

- story: sad path 1
  steps:
  - intent: greet
  - action: utter_greet
  - intent: mood_unhappy
  - action: utter_cheer_up
  - action: utter_did_that_help
  - intent: affirm
  - action: utter_happy

- story: sad path 2
  steps:
  - intent: greet
  - action: utter_greet
  - intent: mood_unhappy
  - action: utter_cheer_up
  - action: utter_did_that_help
  - intent: deny
  - action: utter_goodbye

The stories.yml file you provided contains a few training stories for a Rasa chatbot. These stories represent different conversation paths between the user and the chatbot. Each story consists of a series of steps, where each step corresponds to an intent or an action.

Here’s a breakdown of steps for the training stories in the file:

Story: happy path

  • User greets with an intent: greet
  • Bot responds with an action: utter_greet
  • User expresses a positive mood with an intent: mood_great
  • Bot acknowledges the positive mood with an action: utter_happy

Story: sad path 1

  • User greets with an intent: greet
  • Bot responds with an action: utter_greet
  • User expresses an unhappy mood with an intent: mood_unhappy
  • Bot tries to cheer the user up with an action: utter_cheer_up
  • Bot asks if the previous response helped with an action: utter_did_that_help
  • User confirms that it helped with an intent: affirm
  • Bot acknowledges the confirmation with an action: utter_happy

Story: sad path 2

  • User greets with an intent: greet
  • Bot responds with an action: utter_greet
  • User expresses an unhappy mood with an intent: mood_unhappy
  • Bot tries to cheer the user up with an action: utter_cheer_up
  • Bot asks if the previous response helped with an action: utter_did_that_help
  • User denies that it helped with an intent: deny
  • Bot says goodbye with an action: utter_goodbye

These training stories are used to train the Rasa chatbot on different conversation paths and to teach it how to respond appropriately to user inputs based on their intents.

File: config.yml

The configuration parameters for your Rasa project are contained in this file.

# The config recipe.
# https://rasa.com/docs/rasa/model-configuration/
recipe: default.v1

# The assistant project unique identifier
# This default value must be replaced with a unique assistant name within your deployment
assistant_id: placeholder_default

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: en

pipeline:
# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# # If you'd like to customize it, uncomment and adjust the pipeline.
# # See https://rasa.com/docs/rasa/tuning-your-model for more information.
#   - name: WhitespaceTokenizer
#   - name: RegexFeaturizer
#   - name: LexicalSyntacticFeaturizer
#   - name: CountVectorsFeaturizer
#   - name: CountVectorsFeaturizer
#     analyzer: char_wb
#     min_ngram: 1
#     max_ngram: 4
#   - name: DIETClassifier
#     epochs: 100
#     constrain_similarities: true
#   - name: EntitySynonymMapper
#   - name: ResponseSelector
#     epochs: 100
#     constrain_similarities: true
#   - name: FallbackClassifier
#     threshold: 0.3
#     ambiguity_threshold: 0.1

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
#   - name: MemoizationPolicy
#   - name: RulePolicy
#   - name: UnexpecTEDIntentPolicy
#     max_history: 5
#     epochs: 100
#   - name: TEDPolicy
#     max_history: 5
#     epochs: 100
#     constrain_similarities: true

The configuration parameters for your Rasa project are contained in this file. Here is a breakdown of the configuration file:

1. Assistant ID:

  • Assistant_id: placeholder_default
  • This placeholder value should be replaced with a unique identifier for your assistant.

2. Rasa NLU configuration:

  • Language: en
    • Specifies the language used for natural language understanding.
  • Pipeline:
    • Defines the pipeline of components used for NLU processing.
    • The pipeline is currently commented out, and the default pipeline is used.
    • The default pipeline includes various components like tokenizers, featurizers, classifiers, and response selectors.
    • If you want to customize the pipeline, you can uncomment the lines and adjust the pipeline configuration.

3. Rasa core configuration:

  • Policies:
    • Specifies the policies used for dialogue management.
    • The policies are currently commented out, and the default policies are used.
    • The default policies include memoization, rule-based, and TED (Transformer   Embedding Dialogue) policies
    • If you want to customize the policies, you can uncomment the lines and adjust the policy configuration.

File: actions.py

The custom actions that your chatbot can execute are contained in this file. Retrieving data from an API, communicating with a database, or doing any other unique business logic are all examples of actions.

# This files contains your custom actions which can be used to run
# custom Python code.
#
# See this guide on how to implement these action:
# https://rasa.com/docs/rasa/custom-actions


# This is a simple example for a custom action which utters "Hello World!"

# from typing import Any, Text, Dict, List
#
# from rasa_sdk import Action, Tracker
# from rasa_sdk.executor import CollectingDispatcher
#
#
# class ActionHelloWorld(Action):
#
#     def name(self) -> Text:
#         return "action_hello_world"
#
#     def run(self, dispatcher: CollectingDispatcher,
#             tracker: Tracker,
#             domain: Dict[Text, Any]) -> List[Dict[Text, Any]]:
#
#         dispatcher.utter_message(text="Hello World!")
#
#         return []

Explanation of the code:

  • The ActionHelloWorld class extends the Action class provided by the rasa_sdk.
  • The name method defines the name of the custom action, which in this case is “action_hello_world”.
  • The run method is where the logic for the custom action is implemented.
  • Within the run method, the dispatcher object is used to send a message back to the user. In this example, the message sent is “Hello World!”.
  • The return [] statement indicates that the custom action has completed its execution.

File: endpoints.yml

The endpoints for your chatbot are specified in this file, including any external services or the webhook URL for any custom actions.  

Initializing Rasa

This command initializes a new Rasa project in the current directory ($(pwd)):

docker run  -p 5005:5005 -v $(pwd):/app rasa/rasa:3.5.2 init --no-prompt

It sets up the basic directory structure and creates essential files for a Rasa project, such as config.yml, domain.yml, and data/nlu.yml. The -p flag maps port 5005 inside the container to the same port on the host, allowing you to access the Rasa server. <IMAGE>:3.5.2 refers to the Docker image for the specific version of Rasa you want to use.

Training the model

The following command trains a Rasa model using the data and configuration specified in the project directory:

docker run -v $(pwd):/app rasa/rasa:3.5.2 train --domain domain.yml --data data --out models

The -v flag mounts the current directory ($(pwd)) inside the container, allowing access to the project files. The --domain domain.yml flag specifies the domain configuration file, --data data points to the directory containing the training data, and --out models specifies the output directory where the trained model will be saved.

Running the model

This command runs the trained Rasa model in interactive mode, enabling you to test the chatbot’s responses:

docker run -v $(pwd):/app rasa/rasa:3.5.2 shell

The command loads the trained model from the models directory in the current project directory ($(pwd)). The chatbot will be accessible in the terminal, allowing you to have interactive conversations and see the model’s responses.

Verify Rasa is running:

curl localhost:5005
Hello from Rasa: 3.5.2

Now you can send the message and test your model with curl:

curl --location 'http://localhost:5005/webhooks/rest/webhook' \
--header 'Content-Type: application/json' \
--data '{
 "sender": "Test person",
 "message": "how are you ?"}'

Running WebChat

The following command deploys the trained Rasa model as a server accessible via a WebChat UI:

docker run -p 5005:5005 -v $(pwd):/app rasa/rasa:3.5.2 run -m models --enable-api --cors "*" --debug

The -p flag maps port 5005 inside the container to the same port on the host, making the Rasa server accessible. The -m models flag specifies the directory containing the trained model. The --enable-api flag enables the Rasa API, allowing external applications to interact with the chatbot. The --cors "*" flag enables cross-origin resource sharing (CORS) to handle requests from different domains. The --debug flag enables debug mode for enhanced logging and troubleshooting.

docker run -p 8080:80 harshmanvar/docker-ml-faq-rasa:webchat

Open http://localhost:8080 in the browser (Figure 3).

Screenshot of sample chatbot conversation in browser.
Figure 3: WebChat UI.

Defining services using a Compose file

Here’s how our services appear within a Docker Compose file:

services:
  rasa:
    image: rasa/rasa:3.5.2
    ports:
      - 5005:5005
    volumes:
      - ./:/app
    command: run -m models --enable-api --cors "*" --debug
  Webchat:
    image: harshmanvar/docker-ml-faq-rasa:webchat 
    build:
      context: .
      dockerfile: Dockerfile-webchat
    ports:
      - 8080:80

Your sample application has the following parts:

  • The rasa service is based on the rasa/rasa:3.5.2 image. 
  • It exposes port 5005 to communicate with the Rasa API. 
  • The current directory (./) is mounted as a volume inside the container, allowing the Rasa project files to be accessible. 
  • The command run -m models --enable-api --cors "*" --debug starts the Rasa server with the specified options.
  • The webchat service is based on the harshmanvar/docker-ml-faq-rasa:webchat image. It builds the image using the Dockerfile-webchat file located in the current context (.). Port 8080 on the host is mapped to port 80 inside the container to access the webchat interface.

You can clone the repository or download the docker-compose.yml file directly from GitHub.

Bringing up the container services

You can start the WebChat application by running the following command:

docker compose up -d --build

Then, use the docker compose ps command to confirm that your stack is running properly. Your terminal will produce the following output:

docker compose ps
NAME                            IMAGE                                  COMMAND                  SERVICE             CREATED             STATUS              PORTS
docker-ml-faq-rassa-rasa-1      harshmanvar/docker-ml-faq-rasa:3.5.2   "rasa run -m models …"   rasa                6 seconds ago       Up 5 seconds        0.0.0.0:5005->5005/tcp
docker-ml-faq-rassa-webchat-1   docker-ml-faq-rassa-webchat            "/docker-entrypoint.…"   webchat             6 seconds ago       Up 5 seconds        0.0.0.0:8080->80/tcp

Viewing the containers via Docker Dashboard

You can also leverage the Docker Dashboard to view your container’s ID and easily access or manage your application (Figure 4):

Screenshot showing running containers in Docker dashboard.
Figure 4: View containers with Docker Dashboard.

Conclusion

Congratulations! You’ve learned how to containerize a Rasa application with Docker. With a single YAML file, we’ve demonstrated how Docker Compose helps you quickly build and deploy an ML FAQ Demo Model app in seconds. With just a few extra steps, you can apply this tutorial while building applications with even greater complexity. Happy developing.

Check out Rasa on DockerHub.

❌
❌