Text embedding inference¶

Text embeddings map your text inputs into a numerical representation in such a way, so a query and relevant document embeddings should be close in the cosine similarity space. Multiple embeddings providers are supported:

Local inference with Sentence-Transformers and ONNX - Any SBERT-compatible embedding model with local inference using ONNX runtime.
OpenAI - Using external OpenAI-compatible APIs to compute embeddings.
Cohere - Using Cohere v2/embed APIs for embeddings.

Configuration file¶

To use a text embedding model for search or only for inference, you need to configure model in the inference.embedding section of a config file:

inference:
  embedding:
    your-model-name:
      model: intfloat/e5-small-v2

You can also use OpenAI or any other API-based embedding provider:

inference:
  embedding:
    your-model-name:
      provider: openai
      model: text-embedding-3-small

A full configuration looks like this:

inference:
  embedding:
    your-model-name:
      provider: onnx
      model: intfloat/e5-small-v2
      cache:
        memory:
          max_size: 32768
      prompt: # (1)
        query: "query: "
        doc: "passage: "

Nixiesearch can correctly guess the prompt format for all the supported models

Fields:

provider: optional, string. As for v0.5.0, only the onnx, openai and cohere providers are supported. Default - auto-detected.
model: required, string. A Huggingface handle, or an HTTP/Local/S3 URL for the model. See model URL reference for more details on how to load your model.
prompt: optional. A document and query prefixes for asymmetrical models. Default - auto-detected.
cache: optional. As computing embeddings is latency heavy operation, caching can be used for frequent strings. See Embedding caching for more details.

See inference.embedding config file reference for all advanced options of the ONNX provider.

Nixiesearch supports the following set of models:

any sentence-transformers compatible embedding model in the ONNX format. See the list of supported pre-converted models Nixiesearch already has, or check out the guide on how to convert your own model.
As for version 0.3.0, Nixiesearch only supports the ONNX provider for embedding inference. We have OpenAI, Cohere, mxb and Google providers on the roadmap.

Note

Many embedding models (like E5, BGE and GTE) for an optimal predictive performance require a specific prompt prefix for documents and queries. Please consult the model documentation for the expected format.

Sending requests¶

After your model is configured in the inference.embedding section of the config file, you can send requests to the Nixiesearch endpoint /inference/embedding/your-model-name:

curl -v -d '{"input": [{"text": "hello"}]}' http://localhost:8080/inference/embedding/your-model-name

A full request payload looks like this:

{
  "input": [
    {"text": "what is love?", "type": "query"},
    {"text": "baby don't hurt me no more", "type": "document"}
  ]
}

input: list of objects, required. One or more texts to compute embeddings.
input.text: string, required. A text to compute embedding.
input.type: string, optional, default raw. A type of the input: query/document/raw. Some asymmetrical embedding models like E5/GTE/BGE produce different ones for queries and documents.

Note

When you embed many documents at once, Nixiesearch internally batches them together according to the inference model configuration. It is OK to sent large chunks of documents via inference API, they will be properly split to internal batches for better performance.

Embedding responses¶

A typical embedding response looks like this:

{
  "output": [
    {
      "embedding": [
        -0.43652993,
        0.21856548,
        0.011309982
      ]
    }
  ],
  "took": 4
}

Fields:

output: required, list of objects. Contains document embeddings in the same ordering as in request.
output.embedding: required, list of numbers. A vector of document embedding values. The dimensionality matches the embedding model configured.
took: required, int. Number of milliseconds spend processing the response.