Vue lecture

Il y a de nouveaux articles disponibles, cliquez pour rafraîchir la page.

Run Gemma 3 with Docker Model Runner: Fully Local GenAI Developer Experience

9 avril 2025 à 13:01

The landscape of generative AI development is evolving rapidly but comes with significant challenges. API usage costs can quickly add up, especially during development. Privacy concerns arise when sensitive data must be sent to external services. And relying on external APIs can introduce connectivity issues and latency.

Enter Gemma 3 and Docker Model Runner, a powerful combination that brings state-of-the-art language models to your local environment, addressing these challenges head-on.

In this blog post, we’ll explore how to run Gemma 3 locally using Docker Model Runner. We’ll also walk through a practical case study: a Comment Processing System that analyzes user feedback about a fictional AI assistant named Jarvis.

The power of local GenAI development

Before diving into the implementation, let’s look at why local GenAI development is becoming increasingly important:

Cost efficiency: With no per-token or per-request charges, you can experiment freely without worrying about usage fees.
Data privacy: Sensitive data stays within your environment, with no third-party exposure.
Reduced network latency: Eliminates reliance on external APIs and enables offline use.
Full control: Run the model on your terms, with no intermediaries and full transparency.

Setting up Docker Model Runner with Gemma 3

Docker Model Runner provides an OpenAI-compatible API interface to run models locally.
It is included in Docker Desktop for macOS, starting with version 4.40.0.

Here’s how to set it up with Gemma 3:

docker desktop enable model-runner --tcp 12434
docker model pull ai/gemma3

Once setup is complete, the OpenAI-compatible API provided by the Model Runner is available at: http://localhost:12434/engines/v1

Case study: Comment processing system

To demonstrate the power of local GenAI development, we’ve built a Comment Processing System that leverages Gemma 3 for multiple NLP tasks. This system:

Generates synthetic user comments about a fictional AI assistant
Categorizes comments as positive, negative, or neutral
Clusters similar comments together using embeddings
Identifies potential product features from the comments
Generates contextually appropriate responses

All tasks are performed locally with no external API calls.

Implementation details

Configuring the OpenAI SDK to use local models

To make this work, we configure the OpenAI SDK to point to the Docker Model Runner:

// config.js

export default {
  // Model configuration
  openai: {
    baseURL: "http://localhost:12434/engines/v1", // Base URL for Docker Model Runner
    apiKey: 'ignored',
    model: "ai/gemma3",
    commentGeneration: { // Each task has its own configuration, for example temperature is set to a high value when generating comments for creativity
      temperature: 0.3, 
      max_tokens: 250,
      n: 1,
    },
    embedding: {
      model: "ai/mxbai-embed-large", // Model for generating embeddings
    },
  },
  // ... other configuration options
};

import OpenAI from 'openai';
import config from './config.js';

// Initialize OpenAI client with local endpoint
const client = new OpenAI({
  baseURL: config.openai.baseURL,
  apiKey: config.openai.apiKey,
});

Task-specific configuration

One key benefit of running models locally is the ability to experiment freely with different configurations for each task without worrying about API costs or rate limits.

In our case:

Synthetic comment generation uses a higher temperature for creativity.
Categorization uses a lower temperature and a 10-token limit for consistency.
Clustering allows up to 20 tokens to improve semantic richness in embeddings.

This flexibility lets us iterate quickly, tune for performance, and tailor the model’s behavior to each use case.

Generating synthetic comments

To simulate user feedback, we use Gemma 3’s ability to follow detailed, context-aware prompts.

/**
 * Create a prompt for comment generation
 * @param {string} type - Type of comment (positive, negative, neutral)
 * @param {string} topic - Topic of the comment
 * @returns {string} - Prompt for OpenAI
 */
function createPromptForCommentGeneration(type, topic) {
  let sentiment = '';
  
  switch (type) {
    case 'positive':
      sentiment = 'positive and appreciative';
      break;
    case 'negative':
      sentiment = 'negative and critical';
      break;
    case 'neutral':
      sentiment = 'neutral and balanced';
      break;
    default:
      sentiment = 'general';
  }
  
  return `Generate a realistic ${sentiment} user comment about an AI assistant called Jarvis, focusing on its ${topic}.
  
The comment should sound natural, as if written by a real user who has been using Jarvis.
Keep the comment concise (1-3 sentences) and focused on the specific topic.
Do not include ratings (like "5/5 stars") or formatting.
Just return the comment text without any additional context or explanation.`;
}

Examples:

"Honestly, Jarvis is just a lot of empty promises. It keeps suggesting irrelevant articles and failing to actually understand my requests for help with my work – it’s not helpful at all."

"Jarvis is seriously impressive – the speed at which it responds is incredible! I’ve never used an AI assistant that’s so quick and efficient, it’s a game changer.

The ability to produce realistic feedback on demand is incredibly useful for simulating user data with zero API cost.

Generating contextual responses

We also use Gemma 3 to simulate polite, on-brand support responses to user comments. Here’s the prompt logic:

const response = await client.chat.completions.create({
    model: config.openai.model,
    messages: [
      {
        role: "system",
        content: `You are a customer support representative for an AI assistant called Jarvis. Your task is to generate polite, helpful responses to user comments.

Guidelines for responses:
1. Be empathetic and acknowledge the user's feedback
2. Thank the user for their input
3. If the comment is positive, express appreciation
4. If the comment is negative, apologize for the inconvenience and assure them you're working on improvements
5. If the comment is neutral, acknowledge their observation
6. If relevant, mention that their feedback will be considered for future updates
7. Keep responses concise (2-4 sentences) and professional
8. Do not make specific promises about feature implementation or timelines
9. Sign the response as "The Jarvis Team"`
      },
      {
        role: "user",
        content: `User comment: "${comment.text}"
Comment category: ${comment.category || 'unknown'}

${featuresContext}

Generate a polite, helpful response to this user comment.`
      }
    ],
    temperature: 0.7,
    max_tokens: 200
  });

Examples:

For a positive comment:

Thank you so much for your positive feedback regarding Jarvis’s interface! We’re thrilled to hear you find it clean and intuitive – that’s exactly what we’re aiming for. We appreciate you pointing out your desire for more visual customization options, and your feedback will definitely be considered as we continue to develop Jarvis.

The Jarvis Team

For a negative comment:

Thank you for your feedback – we appreciate you taking the time to share your observations about Jarvis. We sincerely apologize for the glitches and freezes you’ve experienced; we understand how frustrating that can be. Your input is valuable, and we’re actively working on improvements to enhance Jarvis’s reliability and accuracy. 

The Jarvis Team

This approach ensures a consistent, human-like support experience generated entirely locally.

Extracting product features from user feedback

Beyond generating and responding to comments, we also use Gemma 3 to analyze user feedback and identify actionable insights. This helps simulate the role of a product analyst, surfacing recurring themes, user pain points, and opportunities for improvement.

Here, we provide a prompt instructing the model to identify up to three potential features or improvements based on a set of user comments.

/**
 * Extract features from comments
 * @param {string} commentsText - Text of comments
 * @returns {Promise<Array>} - Array of identified features
 */
async function extractFeaturesFromComments(commentsText) {
  const response = await client.chat.completions.create({
    model: config.openai.model,
    messages: [
      {
        role: "system",
        content: `You are a product analyst for an AI assistant called Jarvis. Your task is to identify potential product features or improvements based on user comments.
        
For each set of comments, identify up to 3 potential features or improvements that could address the user feedback.

For each feature, provide:
1. A short name (2-5 words)
2. A brief description (1-2 sentences)
3. The type of feature (New Feature, Improvement, Bug Fix)
4. Priority (High, Medium, Low)

Format your response as a JSON array of features, with each feature having the fields: name, description, type, and priority.`
      },
      {
        role: "user",
        content: `Here are some user comments about Jarvis. Identify potential features or improvements based on these comments:

${commentsText}`
      }
    ],
    response_format: { type: "json_object" },
    temperature: 0.5
  });
  
  try {
    const result = JSON.parse(response.choices[0].message.content);
    return result.features || [];
  } catch (error) {
    console.error('Error parsing feature identification response:', error);
    return [];
  }
}

Here’s an example of what the model might return:

"features": [
    {
      "name": "Enhanced Visual Customization",
      "description": "Allows users to personalize the Jarvis interface with more themes, icon styles, and display options to improve visual appeal and user preference.",
      "type": "Improvement",
      "priority": "Medium",
      "clusters": [
        "1"
      ]
    },

And just like everything else in this project, it’s generated locally with no external services.

Conclusion

By combining Gemma 3 with Docker Model Runner, we’ve unlocked a local GenAI workflow that’s fast, private, cost-effective, and fully under our control. In building our Comment Processing System, we experienced firsthand the benefits of this approach:

Rapid iteration without worrying about API costs or rate limits
Flexibility to test different configurations for each task
Offline development with no dependency on external services
Significant cost savings during development

And this is just one example of what’s possible. Whether you’re prototyping a new AI product, building internal tools, or exploring advanced NLP use cases, running models locally puts you in the driver’s seat.

As open-source models and local tooling continue to evolve, the barrier to entry for building powerful AI systems keeps getting lower.

Don’t just consume AI; develop, shape, and own the process.

Try it yourself: clone the repository and start experimenting today.

How to Run Hugging Face Models Programmatically Using Ollama and Testcontainers

Docker

Ignasi Lopez Luna

11 juillet 2024 à 13:00

Hugging Face now hosts more than 700,000 models, with the number continuously rising. It has become the premier repository for AI/ML models, catering to both general and highly specialized needs.

As the adoption of AI/ML models accelerates, more application developers are eager to integrate them into their projects. However, the entry barrier remains high due to the complexity of setup and lack of developer-friendly tools. Imagine if deploying an AI/ML model could be as straightforward as spinning up a database. Intrigued? Keep reading to find out how.

2400x1260 how to run hugging face models programmatically using ollama and testcontainers

Introduction to Ollama and Testcontainers

Recently, Ollama announced support for running models from Hugging Face. This development is exciting because it brings the rich ecosystem of AI/ML components from Hugging Face to Ollama end users, who are often developers.

Testcontainers libraries already provide an Ollama module, making it straightforward to spin up a container with Ollama without needing to know the details of how to run Ollama using Docker:

import org.testcontainers.ollama.OllamaContainer; 

var ollama = new OllamaContainer("ollama/ollama:0.1.44"); 
ollama.start();

These lines of code are all that is needed to have Ollama running inside a Docker container effortlessly.

Running models in Ollama

By default, Ollama does not include any models, so you need to download the one you want to use. With Testcontainers, this step is straightforward by leveraging the execInContainer API provided by Testcontainers:

ollama.execInContainer("ollama", "pull", "moondream");

At this point, you have the moondream model ready to be used via the Ollama API.

Excited to try it out? Hold on for a bit. This model is running in a container, so what happens if the container dies? Will you need to spin up a new container and pull the model again? Ideally not, as these models can be quite large.

Thankfully, Testcontainers makes it easy to handle this scenario, by providing an easy-to-use API to commit a container image programmatically:

public void createImage(String imageName) {
var ollama = new OllamaContainer("ollama/ollama:0.1.44");
ollama.start();
ollama.execInContainer("ollama", "pull", "moondream");
ollama.commitToImage(imageName);
}

This code creates an image from the container with the model included. In subsequent runs, you can create a container from that image, and the model will already be present. Here’s the pattern:

var imageName = "tc-ollama-moondream";
var ollama = new OllamaContainer(DockerImageName.parse(imageName)
.asCompatibleSubstituteFor("ollama/ollama:0.1.44"));
try {
ollama.start();
} catch (ContainerFetchException ex) {
// If image doesn't exist, create it. Subsequent runs will reuse the image.
createImage(imageName);
ollama.start();
}

Now, you have a model ready to be used, and because it is running in Ollama, you can interact with its API:

var image = getImageInBase64("/whale.jpeg");
String response = given()
.baseUri(ollama.getEndpoint())
.header(new Header("Content-Type", "application/json"))
.body(new CompletionRequest("moondream:latest", "Describe the image.", Collections.singletonList(image), false))
.post("/api/generate")
.getBody().as(CompletionResponse.class).response();

System.out.println("Response from LLM " + response);

Using Hugging Face models

The previous example demonstrated using a model already provided by Ollama. However, with the ability to use Hugging Face models in Ollama, your available model options have now expanded by thousands.

To use a model from Hugging Face in Ollama, you need a GGUF file for the model. Currently, there are 20,647 models available in GGUF format. How cool is that?

The steps to run a Hugging Face model in Ollama are straightforward, but we’ve simplified the process further by scripting it into a custom OllamaHuggingFaceContainer. Note that this custom container is not part of the default library, so you can copy and paste the implementation of OllamaHuggingFaceContainer and customize it to suit your needs.

To run a Hugging Face model, do the following:

public void createImage(String imageName, String repository, String model) {
var model = new OllamaHuggingFaceContainer.HuggingFaceModel(repository, model);
var huggingFaceContainer = new OllamaHuggingFaceContainer(hfModel);
huggingFaceContainer.start();
huggingFaceContainer.commitToImage(imageName);
}

By providing the repository name and the model file as shown, you can run Hugging Face models in Ollama via Testcontainers.

You can find an example using an embedding model and an example using a chat model on GitHub.

Customize your container

One key strength of using Testcontainers is its flexibility in customizing container setups to fit specific project needs by encapsulating complex setups into manageable containers.

For example, you can create a custom container tailored to your requirements. Here’s an example of TinyLlama, a specialized container for spinning up the DavidAU/DistiLabelOrca-TinyLLama-1.1B-Q8_0-GGUF model from Hugging Face:

public class TinyLlama extends OllamaContainer {

    private final String imageName;

    public TinyLlama(String imageName) {
        super(DockerImageName.parse(imageName)
.asCompatibleSubstituteFor("ollama/ollama:0.1.44"));
        this.imageName = imageName;
    }

    public void createImage(String imageName) {
        var ollama = new OllamaContainer("ollama/ollama:0.1.44");
        ollama.start();
        try {
            ollama.execInContainer("apt-get", "update");
            ollama.execInContainer("apt-get", "upgrade", "-y");
            ollama.execInContainer("apt-get", "install", "-y", "python3-pip");
            ollama.execInContainer("pip", "install", "huggingface-hub");
            ollama.execInContainer(
                    "huggingface-cli",
                    "download",
                    "DavidAU/DistiLabelOrca-TinyLLama-1.1B-Q8_0-GGUF",
                    "distilabelorca-tinyllama-1.1b.Q8_0.gguf",
                    "--local-dir",
                    "."
            );
            ollama.execInContainer(
                    "sh",
                    "-c",
                    String.format("echo '%s' > Modelfile", "FROM distilabelorca-tinyllama-1.1b.Q8_0.gguf")
            );
            ollama.execInContainer("ollama", "create", "distilabelorca-tinyllama-1.1b.Q8_0.gguf", "-f", "Modelfile");
            ollama.execInContainer("rm", "distilabelorca-tinyllama-1.1b.Q8_0.gguf");
            ollama.commitToImage(imageName);
        } catch (IOException | InterruptedException e) {
            throw new ContainerFetchException(e.getMessage());
        }
    }

    public String getModelName() {
        return "distilabelorca-tinyllama-1.1b.Q8_0.gguf";
    }

    @Override
    public void start() {
        try {
            super.start();
        } catch (ContainerFetchException ex) {
            // If image doesn't exist, create it. Subsequent runs will reuse the image.
            createImage(imageName);
            super.start();
        }
    }
}

Once defined, you can easily instantiate and utilize your custom container in your application:

var tinyLlama = new TinyLlama("example");
tinyLlama.start();
String response = given()
.baseUri(tinyLlama.getEndpoint())
.header(new Header("Content-Type", "application/json"))
.body(new CompletionRequest(tinyLlama.getModelName() + ":latest", List.of(new Message("user", "What is the capital of France?")), false))
.post("/api/chat")
.getBody().as(ChatResponse.class).message.content;
System.out.println("Response from LLM " + response);

Note how all the implementation details are under the cover of the TinyLlama class, and the end user doesn’t need to know how to actually install the model into Ollama, what GGUF is, or that to get huggingface-cli you need to pip install huggingface-hub.

Advantages of this approach

Programmatic access: Developers gain seamless programmatic access to the Hugging Face ecosystem.
Reproducible configuration: All configuration, from setup to lifecycle management is codified, ensuring reproducibility across team members and CI environments.
Familiar workflows: By using containers, developers familiar with containerization can easily integrate AI/ML models, making the process more accessible.
Automated setups: Provides a straightforward clone-and-run experience for developers.

This approach leverages the strengths of both Hugging Face and Ollama, supported by the automation and encapsulation provided by the Testcontainers module, making powerful AI tools more accessible and manageable for developers across different ecosystems.

Conclusion

Integrating AI models into applications need not be a daunting task. By leveraging Ollama and Testcontainers, developers can seamlessly incorporate Hugging Face models into their projects with minimal effort. This approach not only simplifies the setup of the development environment process but also ensures reproducibility and ease of use. With the ability to programmatically manage models and containerize them for consistent environments, developers can focus on building innovative solutions without getting bogged down by complex setup procedures.

The combination of Ollama’s support for Hugging Face models and Testcontainers’ robust container management capabilities provides a powerful toolkit for modern AI development. As AI continues to evolve and expand, these tools will play a crucial role in making advanced models accessible and manageable for developers across various fields. So, dive in, experiment with different models, and unlock the potential of AI in your applications today.

Stay current on the latest Docker news. Subscribe to the Docker Newsletter.

Learn more

Visit the Testcontainers website.
Get started with Testcontainers Cloud by creating a free account.
Read LLM Everywhere: Docker for Local and Hugging Face Hosting.
Learn how to Effortlessly Build Machine Learning Apps with Hugging Face’s Docker Spaces.
Get the latest release of Docker Desktop.

A Promising Methodology for Testing GenAI Applications in Java

Docker

Ignasi Lopez Luna

24 avril 2024 à 16:03

In the vast universe of programming, the era of generative artificial intelligence (GenAI) has marked a turning point, opening up a plethora of possibilities for developers.

Tools such as LangChain4j and Spring AI have democratized access to the creation of GenAI applications in Java, allowing Java developers to dive into this fascinating world. With Langchain4j, for instance, setting up and interacting with large language models (LLMs) has become exceptionally straightforward. Consider the following Java code snippet:

public static void main(String[] args) {
    var llm = OpenAiChatModel.builder()
            .apiKey("demo")
            .modelName("gpt-3.5-turbo")
            .build();
    System.out.println(llm.generate("Hello, how are you?"));
}

This example illustrates how a developer can quickly instantiate an LLM within a Java application. By simply configuring the model with an API key and specifying the model name, developers can begin generating text responses immediately. This accessibility is pivotal for fostering innovation and exploration within the Java community. More than that, we have a wide range of models that can be run locally, and various vector databases for storing embeddings and performing semantic searches, among other technological marvels.

Despite this progress, however, we are faced with a persistent challenge: the difficulty of testing applications that incorporate artificial intelligence. This aspect seems to be a field where there is still much to explore and develop.

In this article, I will share a methodology that I find promising for testing GenAI applications.

Project overview

The example project focuses on an application that provides an API for interacting with two AI agents capable of answering questions.

An AI agent is a software entity designed to perform tasks autonomously, using artificial intelligence to simulate human-like interactions and responses.

In this project, one agent uses direct knowledge already contained within the LLM, while the other leverages internal documentation to enrich the LLM through retrieval-augmented generation (RAG). This approach allows the agents to provide precise and contextually relevant answers based on the input they receive.

I prefer to omit the technical details about RAG, as ample information is available elsewhere. I’ll simply note that this example employs a particular variant of RAG, which simplifies the traditional process of generating and storing embeddings for information retrieval.

Instead of dividing documents into chunks and making embeddings of those chunks, in this project, we will use an LLM to generate a summary of the documents. The embedding is generated based on that summary.

When the user writes a question, an embedding of the question will be generated and a semantic search will be performed against the embeddings of the summaries. If a match is found, the user’s message will be augmented with the original document.

This way, there’s no need to deal with the configuration of document chunks, worry about setting the number of chunks to retrieve, or worry about whether the way of augmenting the user’s message makes sense. If there is a document that talks about what the user is asking, it will be included in the message sent to the LLM.

Technical stack

The project is developed in Java and utilizes a Spring Boot application with Testcontainers and LangChain4j.

For setting up the project, I followed the steps outlined in Local Development Environment with Testcontainers and Spring Boot Application Testing and Development with Testcontainers.

I also use Tescontainers Desktop to facilitate database access and to verify the generated embeddings as well as to review the container logs.

The challenge of testing

The real challenge arises when trying to test the responses generated by language models. Traditionally, we could settle for verifying that the response includes certain keywords, which is insufficient and prone to errors.

static String question = "How I can install Testcontainers Desktop?";
@Test
    void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
        String answer  = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
        assertThat(answer).contains("https://testcontainers.com/desktop/");
    }

This approach is not only fragile but also lacks the ability to assess the relevance or coherence of the response.

An alternative is to employ cosine similarity to compare the embeddings of a “reference” response and the actual response, providing a more semantic form of evaluation.

This method measures the similarity between two vectors/embeddings by calculating the cosine of the angle between them. If both vectors point in the same direction, it means the “reference” response is semantically the same as the actual response.

static String question = "How I can install Testcontainers Desktop?";
static String reference = """
       - Answer must indicate to download Testcontainers Desktop from https://testcontainers.com/desktop/
       - Answer must indicate to use brew to install Testcontainers Desktop in MacOS
       - Answer must be less than 5 sentences
       """;
@Test
    void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
        String answer  = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
        double cosineSimilarity = getCosineSimilarity(reference, answer);
        assertThat(cosineSimilarity).isGreaterThan(0.8);
    }

However, this method introduces the problem of selecting an appropriate threshold to determine the acceptability of the response, in addition to the opacity of the evaluation process.

Toward a more effective method

The real problem here arises from the fact that answers provided by the LLM are in natural language and non-deterministic. Because of this, using current testing methods to verify them is difficult, as these methods are better suited to testing predictable values.

However, we already have a great tool for understanding non-deterministic answers in natural language: LLMs themselves. Thus, the key may lie in using one LLM to evaluate the adequacy of responses generated by another LLM.

This proposal involves defining detailed validation criteria and using an LLM as a “Validator Agent” to determine if the responses meet the specified requirements. This approach can be applied to validate answers to specific questions, drawing on both general knowledge and specialized information

By incorporating detailed instructions and examples, the Validator Agent can provide accurate and justified evaluations, offering clarity on why a response is considered correct or incorrect.

static String question = "How I can install Testcontainers Desktop?";
    static String reference = """
            - Answer must indicate to download Testcontainers Desktop from https://testcontainers.com/desktop/
            - Answer must indicate to use brew to install Testcontainers Desktop in MacOS
            - Answer must be less than 5 sentences
            """;

    @Test
    void verifyStraightAgentFailsToAnswerHowToInstallTCD() {
        String answer  = restTemplate.getForObject("/chat/straight?question={question}", ChatController.ChatResponse.class, question).message();
        ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
        assertThat(validate.response()).isEqualTo("no");
    }

    @Test
    void verifyRaggedAgentSucceedToAnswerHowToInstallTCD() {
        String answer  = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
        ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
        assertThat(validate.response()).isEqualTo("yes");
    }

We can even test more complex responses where the LLM should suggest a better alternative to the user’s question.

static String question = "How I can find the random port of a Testcontainer to connect to it?";
    static String reference = """
            - Answer must not mention using getMappedPort() method to find the random port of a Testcontainer
            - Answer must mention that you don't need to find the random port of a Testcontainer to connect to it
            - Answer must indicate that you can use the Testcontainers Desktop app to configure fixed port
            - Answer must be less than 5 sentences
            """;

    @Test
    void verifyRaggedAgentSucceedToAnswerHowToDebugWithTCD() {
        String answer  = restTemplate.getForObject("/chat/rag?question={question}", ChatController.ChatResponse.class, question).message();
        ValidatorAgent.ValidatorResponse validate = validatorAgent.validate(question, answer, reference);
        assertThat(validate.response()).isEqualTo("yes");
    }

Validator Agent

The configuration for the Validator Agent doesn’t differ from that of other agents. It is built using the LangChain4j AI Service and a list of specific instructions:

public interface ValidatorAgent {
    @SystemMessage("""
                ### Instructions
                You are a strict validator.
                You will be provided with a question, an answer, and a reference.
                Your task is to validate whether the answer is correct for the given question, based on the reference.
                
                Follow these instructions:
                - Respond only 'yes', 'no' or 'unsure' and always include the reason for your response
                - Respond with 'yes' if the answer is correct
                - Respond with 'no' if the answer is incorrect
                - If you are unsure, simply respond with 'unsure'
                - Respond with 'no' if the answer is not clear or concise
                - Respond with 'no' if the answer is not based on the reference
                
                Your response must be a json object with the following structure:
                {
                    "response": "yes",
                    "reason": "The answer is correct because it is based on the reference provided."
                }
                
                ### Example
                Question: Is Madrid the capital of Spain?
                Answer: No, it's Barcelona.
                Reference: The capital of Spain is Madrid
                ###
                Response: {
                    "response": "no",
                    "reason": "The answer is incorrect because the reference states that the capital of Spain is Madrid."
                }
                """)
    @UserMessage("""
            ###
            Question: {{question}}
            ###
            Answer: {{answer}}
            ###
            Reference: {{reference}}
            ###
            """)
    ValidatorResponse validate(@V("question") String question, @V("answer") String answer, @V("reference") String reference);

    record ValidatorResponse(String response, String reason) {}
}

As you can see, I’m using Few-Shot Prompting to guide the LLM on the expected responses. I also request a JSON format for responses to facilitate parsing them into objects, and I specify that the reason for the answer must be included, to better understand the basis of its verdict.

Conclusion

The evolution of GenAI applications brings with it the challenge of developing testing methods that can effectively evaluate the complexity and subtlety of responses generated by advanced artificial intelligences.

The proposal to use an LLM as a Validator Agent represents a promising approach, paving the way towards a new era of software development and evaluation in the field of artificial intelligence. Over time, we hope to see more innovations that allow us to overcome the current challenges and maximize the potential of these transformative technologies.

Learn more

Check out the GenAI Stack to get started with adding AI to your apps.
Subscribe to the Docker Newsletter.
Get the latest release of Docker Desktop.
Vote on what’s next! Check out our public roadmap.
Have questions? The Docker community is here to help.
New to Docker? Get started.