LLM chat completions inference¶

Competion models take an input prompt and generate a response. Internally these are used inside the RAG search endpoint to summarise search results in a single answer, but you can use them directly through the /inference/completion/ REST Endpoint.

Configuring a model¶

To use an LLM model for inference or for RAG search, you need to explicitly define it in the inference.completion config file section:

inference:
  completion:
    your-model-name:
      provider: llamacpp
      model: Qwen/Qwen2-0.5B-Instruct-GGUF
      file: qwen2-0_5b-instruct-q4_0.gguf

Nixiesearch uses an embedded llamacpp server to handle models, so any GGUF model should work. Also note that you should prefer instruct models which are tuned for instruction-following, and not the raw models.

Fields:

provider: required, string. As for v0.3.0, only llamacpp is supported. Other SaaS providers like OpenAI, Cohere, mxb and Google are on the roadmap.
model: required, string. A Huggingface handle, or an HTTP/Local/S3 URL for the model. See model URL reference for more details on how to load your model.
file: optional, string. A file name for the model, if the target model has multiple. A typical case for quantized models.
options: optional, obj. A dict of llamacpp-specific options.

See the inference.completion section in config file reference for more details on other advanced options of providers.

Sending requests¶

After you configured your completion LLM model, it becomes available for inference on the /inference/completion/<your-model-name> REST endpoint:

curl -d '{"prompt": "what is 2+2? answer as haiku", "max_tokens": 32}' http://localhost:8080/inference/completion/your-model-name

A full request payload looks like this:

{
  "prompt": "what is 2+2? answer as haiku",
  "max_tokens": 32,
  "stream": false
}

Fields:

prompt: required, string. A prompt to process. Before doing the actual inference, the prompt text will be pre-processed using the prompt template for a particular model.
max_tokens: required, string. Number of tokens to generate. Consider that as a safety net if model cannot stop generating.
stream: optional, boolean, default false. Should the response be in a streaming format? If yes, the server will respond with a sequence of Server Side Events. See Streaming responses section for more details.

Non-streaming responses¶

A non-streaming regular HTTP response looks like this:

{
  "output": "Two\nOne, one\nSumming up\nEqual to\n2",
  "took": 191
}

Response fields:

output: required, string. Generated answer for the prompt.
took: required, int. Number of milliseconds spent processing the request.

Streaming responses¶

When a completion request has a "streaming": true flag, then Nixiesearch will generate a sequence of Server Side Events for each generated token. This can be used to create a nice ChatGPT interfaces when you see the response being generated in real-time.

So for a search request:

{
  "prompt":"what is 2+2? answer short", 
  "max_tokens": 32, 
  "stream": true
}

The server will generate a SSE payload, having a special Content-Type: text/event-stream header:

curl -v -d '{"prompt":"what is 2+2? answer short", "max_tokens": 32, "stream": true}'\
   http://localhost:8080/inference/completion/qwen2

< HTTP/1.1 200 OK
< Date: Fri, 13 Sep 2024 16:29:11 GMT
< Connection: keep-alive
< Content-Type: text/event-stream
< Transfer-Encoding: chunked
< 
event: generate
data: {"token":"2","took":34,"last":false}

event: generate
data: {"token":"+","took":11,"last":false}

event: generate
data: {"token":"2","took":11,"last":false}

event: generate
data: {"token":" =","took":14,"last":false}

event: generate
data: {"token":" ","took":16,"last":false}

event: generate
data: {"token":"4","took":14,"last":false}

event: generate
data: {"token":"","took":13,"last":false}

event: generate
data: {"token":"","took":1,"last":true}

The SSE frame has the following syntax:

{
  "token": "wow",
  "took": 10,
  "last": false
}

Fields:

token: required, string. A next generated tokens. To get a full string, you need to concatenate all tokens together.
took: required, int. How many milliseconds spent generating this token. A first generated token also accounts for prompt processing time, so expect it to always be bigger.
last: required, boolean. Is this the last token? SSE has no notation of stream end, so you can use this field to assume that the stream is finished.