Skip to content

ML model inference

Nixiesearch also exposes embeddings and LLM chat completions APIs. These APIs used internally for semantic search and RAG tasks, but you can have a raw access to them. Typical use cases:

  • adding an additional safety filter on top of RAG responses by prompting an LLM for does {answer} answers the question {question}?.
  • preventing LLM hallucinations by embedding both question and RAG answer and computing cosine similarity between them. Question and a proper answer should be close to each other.

Adding models for inference

To use an embedding or chat model for inference, you need to explicitly define it in the config file inference section:

inference:
  embedding:
    <model-name>:
      provider: onnx
      model: nixiesearch/e5-small-v2-onnx
      prompt:
        query: "query: "
        doc: "passage: "
  completion:
    <model-name>:
      provider: llamacpp
      model: Qwen/Qwen2-0.5B-Instruct-GGUF
      file: qwen2-0_5b-instruct-q4_0.gguf
      prompt: qwen2

The inference section (and embedding/completion sub-sections also) are optional and not required if you do only lexical search. See a full config file reference for all the configuration options and a list of supported LLM models and embeddings.

Note

Inference for both embedding and completion models can also be done on GPUs.

Supported endpoints

Nixiesearch supports the following inference endpoints: