Skip to content

Text embedding inference

Text embeddings map your text inputs into a numerical representation in such a way, so a query and relevant document embeddings should be close in the cosine similarity space.

Configuration file

To use a text embedding model for search or only for inference, you need to configure model in the inference.embedding section of a config file:

      provider: onnx
      model: nixiesearch/e5-small-v2-onnx
        query: "query: "
        doc: "passage: "


  • provider: required, string. As for v0.3.0, only the onnx provider is supported.
  • model: required, string. A Huggingface handle, or an HTTP/Local/S3 URL for the model. See model URL reference for more details on how to load your model.
  • prompt: optional. A document and query prefixes for asymmetrical models.

See inference.embedding config file reference for all advanced options of the ONNX provider.

Nixiesearch supports the following set of models:


Many embedding models (like E5, BGE and GTE) for an optimal predictive performance require a specific prompt prefix for documents and queries. Please consult the model documentation for the expected format.

Sending requests

After your model is configured in the inference.embedding section of the config file, you can send requests to the Nixiesearch endpoint /inference/embedding/your-model-name:

curl -v -d '{"input": [{"text": "hello"}]}' http://localhost:8080/inference/embedding/your-model-name

A full request payload looks like this:

  "input": [
    {"text": "what is love?", "type": "query"},
    {"text": "baby don't hurt me no more", "type": "document"}
  • input: list of objects, required. One or more texts to compute embeddings.
  • input.text: string, required. A text to compute embedding.
  • input.type: string, optional, default raw. A type of the input: query/document/raw. Some asymmetrical embedding models like E5/GTE/BGE produce different ones for queries and documents.


When you embed many documents at once, Nixiesearch internally batches them together according to the inference model configuration. It is OK to sent large chunks of documents via inference API, they will be properly split to internal batches for better performance.

Embedding responses

A typical embedding response looks like this:

  "output": [
      "embedding": [
  "took": 4


  • output: required, list of objects. Contains document embeddings in the same ordering as in request.
  • output.embedding: required, list of numbers. A vector of document embedding values. The dimensionality matches the embedding model configured.
  • took: required, int. Number of milliseconds spend processing the response.