Skip to content

RAG: Retrieval Augmented Generation

Nixiesearch supports RAG-style question answering over fully local LLMs:

RAG

To use RAG queries, you need to explicitly define in the inference section of the config file which LLMs you plan to use query-time:

inference:
  embedding:
    # Used for semantic retrieval
    e5-small:
      model: nixiesearch/e5-small-v2-onnx
      prompt:
        doc: "passage: "
        query: "query: "
  completion:
    # Used for summarization
    qwen2:
      provider: llamacpp
      # Warning: this is a very small and dummy model
      # for production uses consider using something bigger.
      model: Qwen/Qwen2-0.5B-Instruct-GGUF
      file: qwen2-0_5b-instruct-q4_0.gguf
      prompt: qwen2

schema:
  movies:
    fields:
      title:
        type: text
        search: 
          type: semantic
          model: e5-small
        suggest: true
      overview:
        type: text
        search:
          type: semantic
          model: e5-small
        suggest: true

Where:

  • model: a Huggingface model handle in a format of namespace/model-name.
  • prompt: a prompt format, either one of pre-defined ones like qwen2 and llama3, or a raw prompt with {user} and {system} placeholders.
  • name: name of this model you will reference in RAG search requests
  • system (optional): A system prompt for the model.

Supported prompts

A qwen2 prompt, which is in fact an alias to the following raw prompt:

<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n

A more extended llama3 prompt is an alias to the next raw one:

<|start_header_id|>system<|end_header_id|>

{system}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

You can always define your own prompt:

inference:
  completion:
    # Used for summarization
    mistral7b:
      model: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
      prompt: "[INST] {user} [/INST]"
      name: mistral7b

Sending requests

For RAG requests, Nixiesearch supports REST and Server Side Events for streaming responses:

  • REST: much simpler to implement, but blocks till full RAG response is generated.
  • SSE: can stream each generated response token, but is more complex to set up.

Request format is the same for both protocols:

{
  "query": {
    "multi_match": {
      "fields": ["title", "description"],
      "query": "what is pizza"
    }
  },
  "fields": ["title", "description"],
  "rag": {
    "prompt": "Summarize search results for a query 'what is pizza'",
    "model": "qwen2",
    "stream": false
  }
}

The rag field has the following options:

  • stream (boolean, optional, default false): Should we stream response with SSE, or just block until the complete response is generated.
  • prompt (string, required): A main instruction for the LLM.
  • model (string, required): Model name from the rag.models index mapping section.
  • fields (string[], optional): A list of fields from the search results documents to embed to the LLM prompt. By default, use all stored fields from the response.
  • topDocs (int, optional): How many top-N documents to embed to the prompt. By default pick top-10, more documents - longer the context - higher the latency.
  • maxDocLength (int, optional): Limit each document in prompt by first N tokens. By default, use first 128 tokens.
  • maxResponseLength (int, optional): Maximum number of tokens LLM can generate. Default 64.

REST responses

A complete text of the LLM response you can find in a response field:

$> cat rag.json

{
  "query": {
    "multi_match": {
      "fields": ["title"],
      "query": "matrix"
    }
  },
  "fields": ["title"],
  "rag": {
    "prompt": "Summarize search results for a query 'matrix'",
    "model": "qwen2"
  }
}

$> curl -v -XPOST -d @rag.json http://localhost:8080/movies/_search

{
  "took": 3,
  "hits": [
    {
      "_id": "604",
      "title": "The Matrix Reloaded",
      "_score": 0.016666668
    },
    {
      "_id": "605",
      "title": "The Matrix Revolutions",
      "_score": 0.016393442
    },
  ],
  "aggs": {},
  "response": "The following is a list of search results for the query 'matrix'. It includes the following:\n\n- The matrix is the first film in the \"Matrix\" franchise."
}

Streaming responses

The main REST search endpoint /<index_name>/_search can also function as an SSE endpoint.

$> cat rag.json

{
  "query": {
    "multi_match": {
      "fields": ["title"],
      "query": "matrix"
    }
  },
  "fields": ["title"],
  "rag": {
    "prompt": "Summarize search results for a query 'matrix'",
    "model": "qwen2",
    "stream" true
  }
}

$> curl -v -XPOST -d @rag.json http://localhost:8080/movies/_search

< HTTP/1.1 200 OK
< Date: Fri, 13 Sep 2024 16:29:11 GMT
< Connection: keep-alive
< Content-Type: text/event-stream
< Transfer-Encoding: chunked
< 
event: results
data: {"took":3,"hits":["... skipped ..."],"aggs":{},"ts":1726246416275}

event: rag
data: {"token":"Summary","ts":1726246417457,"took":1178,"last":false}

event: rag
data: {"token":":","ts":1726246417469,"took":12,"last":false}

event: rag
data: {"token":" Searches","ts":1726246417494,"took":24,"last":false}

event: rag
data: {"token":" for","ts":1726246417511,"took":18,"last":false}

event: rag
data: {"token":" '","ts":1726246417526,"took":15,"last":false}

event: rag
data: {"token":"matrix","ts":1726246417543,"took":17,"last":true}

SSE response consists of two frame types:

  • results: a regular search response as for non-streaming requests
  • rag: a sequence of live generated per-token events.

results frame

A results frame has the following structure:

{
  "took": 112,
  "hits": [
    {
      "_id": "604",
      "title": "The Matrix Reloaded",
      "_score": 0.016666668
    },
    {
      "_id": "605",
      "title": "The Matrix Revolutions",
      "_score": 0.016393442
    }
  ],
  "ts":1722354191905
}

Note

Note that unlike in the REST response, the results.response field is missing from the response payload: it is going to be streamed per token with the rag frames!

rag frame

A rag frame is a tiny frame always following the results frame:

{
  "token": " Matrix",
  "ts": 1722354192184,
  "took": 20,
  "last": false
}
  • token (required, string): next generated LLM token
  • ts (required, long): generation timestamp
  • took (required, long): how many millis underlying LLM spend generating this token
  • last (required, bool): is this the last token in the response stream?

Assembling frames together

  • The results frame with search results is always the first one
  • If there was a request.rag field present in the search request, server will start streaming RAG response tokens
  • When server finishes generating RAG response, it will set last: true flag to communicate that.