# Getting Started with Mixtral 8X7B

> Mixtral 8X7B is Mistral AI's new state-of-the-art LLM. Using a Mixture of Experts (MoE) architecture the LLM is able to exceed GPT-3.5 performance at a fraction of the computational cost. Here we learn how to deploy Mixtral as an AI Agent.

James Briggs · 2023-12-20

**Mixtral 8x7B** from Mistral AI is the first open-weight model to achieve better than GPT-3.5 performance. From our experimentation, we view this as the first step towards broadly applied open-weight LLMs in the industry.

In this walkthrough, we'll see how to set up and deploy Mixtral, the prompt format required, and how it performs when being used as an AI agent.

As a bit of a spoiler, Mixtral is the first open-weight LLM that is truly _very_ good — we say this considering the following key points:

1. Benchmarks show it to perform better than GPT-3.5.

2. Our testing shows Mixtral to be the first open-weight model we can reliably use as an agent.

3. Due to MoE architecture it is _very_ fast given its size. If you can afford to run on 2x A100s latency is good enough to be used in chatbot use-cases.

[Video](https://www.youtube.com/watch?v=aCRvIPpFyEI)


With that in mind, Mixtral is still 8x models — the total number of parameters is ~56B, so we still need plenty of space to store the model. The amount of space required decreases with quantized versions of the model such as the [GGUF quantized models from TheBloke](https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF)).

## Finding Somewhere to Run MIxtral

Unless you have two A100s or H100s lying around you'll need to find a service to run Mixtral. We'll demonstrate how to use [RunPod here](https://youtu.be/aCRvIPpFyEI?t=50) — we found this to be one of the easier and cheaper compute providers to set up with Mixtral.

* First, you'll need to [sign up for an account on RunPod](https://www.runpod.io/console/signup?utm_term=runpod&utm_campaign=Serverless+GPU+-+US+%2B+UK+-+Search+%2B+Website+Traffic+-+Dec.+18&utm_source=adwords&utm_medium=ppc&hsa_acc=4558579452&hsa_cam=20876437187&hsa_grp=155563324223&hsa_ad=685147843670&hsa_src=g&hsa_tgt=kwd-461794446387&hsa_kw=runpod&hsa_mt=e&hsa_net=adwords&hsa_ver=3&gclid=CjwKCAiAvoqsBhB9EiwA9XTWGWRrIRJgj3IB3PbCsG44_m53vfCyFuPGXdcB_bvAJNWxyNtoMxjovRoCqJwQAvD_BwE).

* Navigate to **Home** > click **Start Building**.

* Set up a GPU instance, you can use 2xA100 or 2xH100.

* Customize deployment to use

  *_Container Size: 120GB_

  * and _Disk Volume: 600GB_.

* Make sure _Jupyter Notebook_ is checked and click deploy!

Once deployed, click on the instance and click _Open Jupyter Server_ — this will take you to a Jupyter Labs instance running on the container. From there you can upload [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/mistral-ai/mixtral-8x7b/00-mixtral-8x7b-agent.ipynb) and follow along.

## Installing Prerequisites

There are a few prerequisites required, to run Mixtral 8x7B we need `transformers` and `accelerate`. We also install `duckduckgo_search` to use in our agent testing later.

```text
!pip install -qU \
    transformers==4.36.1 \
    accelerate==0.25.0 \
    duckduckgo_search==4.1.0
```

## Download and Initialize Mixtral

Once we have installed our prerequisites we're ready to download and initialize Mixtral. We'll use the instruct fine-tuned model hosted on the Hugging Face hub.

```
from torch import bfloat16
import transformers

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=bfloat16,
    device_map='auto'
)
model.eval()
```

As with all LLMs/transformer models we need to initialize a `tokenizer` that will take our plain text input and transform it into lists of tokens to be consumed by the first layer of the LLM/transformer. We initialize the tokenizer like so:

```
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
```

Now we set up a `text-generation` pipeline using `transformers`. There are a lot of generation parameters we can adjust here, we'd recommend leaving them as is for now and returning to them if you feel like your generated outputs need improvement.

```
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=False,  # if using langchain set True
    task="text-generation",
    # we pass model parameters here too
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,  # select from top tokens whose probability add up to 15%
    top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # if output begins repeating increase
)
```

To generate text we call the `generate_text` pipeline:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/03a075a508336b0ff1733872da40909253a00d0a.ipynb)


### Instruction Format

We can see a very generic generated output here. There are two primary reasons for that:

1. We haven't provided any instructions to the model.

2. We have not used the recommended instruction format.

The instruction format for Mixtral 8x7B looks like this:

```text
<s> [INST] Some instructions [/INST] Primer text [generated output] </s>
```

We would put our instructions to the model in place of `"Some instructions"` and place a primer like `"Assistant: "` in place of `"Primer text"`. The `<s>` and `</s>` are special tokens used by Mixtral to signify the **B**eginning **O**f **S**tring (BOS) and **E**nd **O**f **S**tring (EOS), ie beginning and end of our text. The `[INST]` and `[/INST]` strings tell the model that anything between those two strings are _instructions_ that the model should follow.

We can add some follow-up instructions like so:

```text
<s> [INST] Some instructions [/INST] Primer text [generated output] </s> [INST] Further instructions [/INST]
```

  
Let's begin by adding some _instructions_ first, we'll add instruction formatting later. In these instructions, we want to set up the guidelines for an agent that can use two tools (calculator and search) and also return answers to the user. All three of these options will be used by the agent via a JSON output format containing `"tool_name"` that specifies which tool to be used (one of [`Calculator`, `Search`, `Final Answer`]) and `"input"` that specifies the input to the chosen tool.

```
agent_template = """
You are a helpful AI assistant, you are an agent capable of using a variety of tools to answer a question. Here are a few of the tools available to you:

- Calculator: the calculator should be used whenever you need to perform a calculation, no matter how simple. It uses Python so make sure to write complete Python code required to perform the calculation required and make sure the Python returns your answer to the `output` variable.
- Search: the search tool should be used whenever you need to find information. It can be used to find information about everything
- Final Answer: the final answer tool must be used to respond to the user. You must use this when you have decided on an answer.

To use these tools you must always respond in JSON format containing `"tool_name"` and `"input"` key-value pairs. For example, to answer the question, "what is the square root of 51?" you must use the calculator tool like so:

```json
{
    "tool_name": "Calculator",
    "input": "from math import sqrt; output = sqrt(51)"
}
```

Or to answer the question "who is the current president of the USA?" you must respond:

```json
{
    "tool_name": "Search",
    "input": "current president of USA"
}
```

Remember, even when answering to the user, you must still use this JSON format! If you'd like to ask how the user is doing you must write:

```json
{
    "tool_name": "Final Answer",
    "input": "How are you today?"
}
```

Let's get started. The users query is as follows.

User: Hi there, I'm stuck on a math problem, can you help? My question is what is the square root of 512 multiplied by 7?

Assistant: ```json
{
    "tool_name": """
```

Using these instructions we get great performance:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/775f31ef4614c0e2a737992b08224a5c5b3603d4.ipynb)


Before continuing let's add the recommended instruction formatting. We'll do this via a function called `instruction_format` that will consume a `sys_message` (ie instructions) and a user's `query` and output the string with the required tokens.

```
def instruction_format(sys_message: str, query: str):
    # note, don't "</s>" to the end
    return f'<s> [INST] {sys_message} [/INST]\nUser: {query}\nAssistant: ```json\n{{\n"tool_name": '

sys_msg = """You are a helpful AI assistant, you are an agent capable of using a variety of tools to answer a question. Here are a few of the tools available to you:

- Calculator: the calculator should be used whenever you need to perform a calculation, no matter how simple. It uses Python so make sure to write complete Python code required to perform the calculation required and make sure the Python returns your answer to the `output` variable.
- Search: the search tool should be used whenever you need to find information. It can be used to find information about everything
- Final Answer: the final answer tool must be used to respond to the user. You must use this when you have decided on an answer.

To use these tools you must always respond in JSON format containing `"tool_name"` and `"input"` key-value pairs. For example, to answer the question, "what is the square root of 51?" you must use the calculator tool like so:

```json
{
    "tool_name": "Calculator",
    "input": "from math import sqrt; output = sqrt(51)"
}
```

Or to answer the question "who is the current president of the USA?" you must respond:

```json
{
    "tool_name": "Search",
    "input": "current president of USA"
}
```

Remember, even when answering to the user, you must still use this JSON format! If you'd like to ask how the user is doing you must write:

```json
{
    "tool_name": "Final Answer",
    "input": "How are you today?"
}
```

Let's get started. The users query is as follows.
"""
query = "Hi there, I'm stuck on a math problem, can you help? My question is what is the square root of 512 multiplied by 7?"

input_prompt = instruction_format(sys_msg, query)
```

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/4cd3b793d27a0f16082b7c0dba3b2d2e1af4bf8a.ipynb)


[Colab File](https://cdn.sanity.io/files/vr8gru94/production/29d8879e018c150544c98492efe22e85d514e9c6.ipynb)


Using the formatted instructions makes no difference to this example, but as this is the format the instruction-tuned model has been fine-tuned we would expect to see slightly better or at least more reliable results.

We need to parse the action string into a dictionary and run the Python code provided.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/64d454cdd937aaa6eca02403584357d97c3d5845.ipynb)


Now we can add this answer to our original prompt:

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/e5d1925ef77626e0691862ea23fb94375005b566.ipynb)


Then we feed the original prompt with this additional context from the tool back into Mixtral to get our final output. As before we will convert the final output into a dictionary and _if_ `Final Answer` is provided we return it to the user.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/2bb00b46b610439c151d8cc28aa9e654ccdc46aa.ipynb)


This final answer gives us the correct answer. We can formalize all of our tool select logic into a single `use_tool` function.

```
from duckduckgo_search import DDGS

def use_tool(action: dict):
    tool_name = action["tool_name"]
    if tool_name == "Final Answer":
        return "Assistant: "+action["input"]
    elif tool_name == "Calculator":
        exec(action["input"])
        return f"Tool Output: {output}"
    elif tool_name == "Search":
        contexts = []
        with DDGS() as ddgs:
            results = ddgs.text(
                action["input"],
                region="wt-wt", safesearch="on",
                max_results=3
            )
            for r in results:
                contexts.append(r['body'])
        info = "\n---\n".join(contexts)
        return f"Tool Output: {info}"
    else:
        # otherwise just assume final answer
        return "Assistant: "+action["input"]
```

Here we're also adding the `Search` tool which will use DuckDuckGo to search the web for information. With that, we can now ask questions that require up-to-date general world knowledge, like about world leaders.

```
query = "who is the current prime minister of the UK?"

input_prompt = instruction_format(sys_msg, query)
```

Let's define a run function to handle a single prompt, tool selection, and final action loop.

```
def run(query: str):
    res = generate_text(query)
    action_dict = format_output(res[0]["generated_text"])
    response = use_tool(action_dict)
    full_text = f"{query}{res[0]['generated_text']}\n{response}"
    return response, full_text
```

Now we can run this with our question about who the current Prime Minister of the UK is — given the dynamic nature of this position in recent years this is a hard question for an out-of-date LLM to answer.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/d3be5847ea69fdcb7371a8ed030be5f13c0caff2.ipynb)


We're not handling the logic of iterating through multiple agent steps yet — so we must run the next step manually.

[Colab File](https://cdn.sanity.io/files/vr8gru94/production/c70dac39e97a9e2fd6828a68580517cf222e016c.ipynb)


From this, we get a great response and perfect tool usage from Mixtral!