RAG: Retrieval Augmented Generation¶
Nixiesearch supports RAG-style question answering over fully local LLMs:
To use RAG queries, you need to explcitly define in the config file which LLMs you plan to use query-time:
schema:
movies:
rag:
models:
- handle: Qwen/Qwen2-0.5B-Instruct-GGUF?file=qwen2-0_5b-instruct-q4_0.gguf
prompt: qwen2
name: qwen2
fields:
title:
type: text
search: semantic
suggest: true
overview:
type: text
search: semantic
suggest: true
Where:
handle
: a Huggingface model handle in a format ofnamespace
/model-name
. Optionally may include a?file=
specifier in a case when model repo contains multiple GGUF files. By default Nixiesearch will pick the lexicographically first file.prompt
: a prompt format, either one of pre-defined ones likeqwen2
andllama3
, or a raw prompt with{user}
and{system}
placeholders.name
: name of this model you will reference in RAG search requestssystem
(optional): A system prompt for the model.
Supported prompts¶
A qwen2
prompt, which is in fact an alias to the following raw prompt:
<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n
A more extended llama3
prompt is an alias to the next raw one:
<|start_header_id|>system<|end_header_id|>
{system}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
You can always define your own prompt:
schema:
movies:
rag:
models:
- handle: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
prompt: "[INST] {user} [/INST]"
name: mistral7b
Sending requests¶
For RAG requests, Nixiesearch supports REST and WebSocket protocols:
- REST: much simpler to implement, but blocks till full RAG response is generated.
- WebSocket: can stream each generated response toke, but more complex.
Request format is the same for both protocols:
{
"query": {
"multi_match": {
"fields": ["title", "description"],
"query": "what is pizza"
}
},
"fields": ["title", "description"],
"rag": {
"prompt": "Summarize search results for a query 'what is pizza'",
"model": "qwen2"
}
}
The rag
field has the following options:
prompt
(string, required): A main instruction for the LLM.model
(string, required): Model name from therag.models
index mapping section.fields
(string[], optional): A list of fields from the search results documents to embed to the LLM prompt. By default, use all stored fields from the response.topDocs
(int, optional): How many top-N documents to embed to the prompt. By default pick top-10, more documents - longer the context - higher the latency.maxDocLength
(int, optional): Limit each document in prompt by first N tokens. By default, use first 128 tokens.maxResponseLength
(int, optional): Maximum number of tokens LLM can generate. Default 64.
REST responses¶
A complete text of the LLM response you can find in a response
field:
$> cat rag.json
{
"query": {
"multi_match": {
"fields": ["title"],
"query": "matrix"
}
},
"fields": ["title"],
"rag": {
"prompt": "Summarize search results for a query 'matrix'",
"model": "qwen2"
}
}
$> curl -v -XPOST -d @rag.json http://localhost:8080/movies/_search
{
"took": 3,
"hits": [
{
"_id": "604",
"title": "The Matrix Reloaded",
"_score": 0.016666668
},
{
"_id": "605",
"title": "The Matrix Revolutions",
"_score": 0.016393442
},
],
"aggs": {},
"response": "The following is a list of search results for the query 'matrix'. It includes the following:\n\n- The matrix is the first film in the \"Matrix\" franchise."
}
Websocket responses¶
The main REST search endpoint /<index_name>/_search
can also function as a WebSocket endpoint. With a simple Python based WS client you can send websocket requests to this endpoint:
import websocket
import json
websocket.enableTrace(True)
def on_message(wsapp, message):
print(message)
def on_open(wsapp):
query = {
"id": "some-unique-request-id",
"query": {
"multi_match": {
"fields": ["title"],
"query": "matrix"
}
},
"fields": ["title"],
"rag": {
"prompt": "Summarize search results for a query matrix",
"model":"qwen2",
"maxResponseLength":10
}
}
wsapp.send(json.dumps(query))
endpoint = "ws://localhost:8080/movies/_search"
ws = websocket.WebSocketApp(endpoint, on_message=on_message, on_open=on_open)
ws.run_forever()
WebSocket response can be one of the following two frame types:
results
: regular search results JSON payload as for REST endpointrag
: a follow-up sequence of rag responses, frame per token, emitted while LLM inference is in progress.
results
frame¶
A results
frame has the following structure:
{
"results": {
"id": "some-unique-request-id",
"took": 112,
"hits": [
{
"_id": "604",
"title": "The Matrix Reloaded",
"_score": 0.016666668
},
{
"_id": "605",
"title": "The Matrix Revolutions",
"_score": 0.016393442
}
],
"ts":1722354191905
}
}
results
(required, Response): a search results payload.results.id
(optional, String): a parent search request id.
As Websocket is asynchronous protocol, in practice you can send multiple parallel search requests to the same Websocket endpoint, and receive multiple interleaved response streams within a single channel. The request/response id
field can be used to distinguish between these response streams.
Note that unlike in the REST response, the
results.response
field is missing from the response payload: it is going to be streamed per token with therag
frames!
rag
frame¶
A rag
frame is a tiny frame always following the results
frame:
{
"rag": {
"id": "some-unique-request-id",
"token": " Matrix",
"ts": 1722354192184,
"took": 20,
"last": false
}
}
rag.id
(optional, string): a parent request idrag.token
(required, string): next generated LLM tokenrag.ts
(required, long): generation timestamprag.took
(required, long): how many millis underlying LLM spend generating this tokenrag.last
(required, bool): is this the last token in the response stream?
Assembling frames together¶
So a series of results
and rag
frames can be combined into a single stream in the following way:
{"results":{"took":5,"hits":[{"_id":"604","title":"The Matrix Reloaded","_score":0.016666668},{"_id":"605","title":"The Matrix Revolutions","_score":0.016393442},{"_id":"157353","title":"Transcendence","_score":0.016129032},{"_id":"19995","title":"Avatar","_score":0.015873017},{"_id":"1538","title":"Collateral","_score":0.015625},{"_id":"264660","title":"Ex Machina","_score":0.015384615},{"_id":"1858","title":"Transformers","_score":0.015151516},{"_id":"1949","title":"Zodiac","_score":0.014925373},{"_id":"10681","title":"WALL·E","_score":0.014705882},{"_id":"766507","title":"Prey","_score":0.014492754}],"aggs":{},"id":"test1","ts":1722355042186}}
{"rag":{"id":"test1","token":"The","ts":1722355042502,"took":313,"last":false}}
{"rag":{"id":"test1","token":" Matrix","ts":1722355042518,"took":17,"last":false}}
{"rag":{"id":"test1","token":" Reload","ts":1722355042600,"took":82,"last":false}}
{"rag":{"id":"test1","token":"ed","ts":1722355042687,"took":86,"last":false}}
{"rag":{"id":"test1","token":":","ts":1722355042707,"took":21,"last":false}}
{"rag":{"id":"test1","token":" A","ts":1722355042759,"took":52,"last":false}}
{"rag":{"id":"test1","token":" novel","ts":1722355042778,"took":18,"last":false}}
{"rag":{"id":"test1","token":" adaptation","ts":1722355042801,"took":22,"last":false}}
{"rag":{"id":"test1","token":" of","ts":1722355042815,"took":16,"last":false}}
{"rag":{"id":"test1","token":" the","ts":1722355042833,"took":18,"last":false}}
{"rag":{"id":"test1","token":"","ts":1722355042834,"took":1,"last":true}}```
- The
results
frame with search results is always the first one - If there was a
request.rag
field present in the search request, server will start streaming RAG response tokens - When server finishes generating RAG response, it will set
last: true
flag to communicate that.