Sentence Transformers

Supported embedding models¶

Nixiesearch supports any sentence-transformers-compatible model in the ONNX format.

The following list of models is tested to work well with Nixiesearch:

there is an ONNX model provided in the repo (e.g. a model.onnx file),
input tensor shapes are supported,
Nixiesearch can correctly guess query and document prompt format (like E5-family of models requiring query: and passage: prefixes),
embedding pooling method is supported - CLS or mean.

Note

Nixiesearch can automatically guess the proper prompt format and pooling method for all the models in the supported list table below. You can override this behavior in the model configuration section with pooling and prompt parameters.

List of supported models¶

Name	Size	Seqlen	Dimensions	Prompt	Pooling
sentence-transformers/all-MiniLM-L6-v2	22M	512	384	not needed	mean
sentence-transformers/all-MiniLM-L12-v2	33M	512	384	not needed	mean
sentence-transformers/all-mpnet-base-v2	109M	384	384	not needed	mean
intfloat/e5-small	33M	512	384	query+doc	mean
intfloat/e5-base	109M	512	768	query+doc	mean
intfloat/e5-large	335M	512	1024	query+doc	mean
intfloat/e5-small-v2	33M	512	384	query+doc	mean
intfloat/e5-base-v2	109M	512	768	query+doc	mean
intfloat/e5-large-v2	335M	512	1024	query+doc	mean
intfloat/multilingual-e5-small	118M	512	384	auto	mean
intfloat/multilingual-e5-small	278M	512	768	auto	mean
intfloat/multilingual-e5-small	560M	512	1024	auto	mean
Alibaba-NLP/gte-base-en-v1.5	137M	8192	768	not needed	CLS
Alibaba-NLP/gte-large-en-v1.5	434M	8192	1024	not needed	CLS
Alibaba-NLP/gte-modernbert-base	149M	8192	768	not needed	CLS
Snowflake/snowflake-arctic-embed-s	33M	512	384	query	CLS
Snowflake/snowflake-arctic-embed-xs	22M	512	384	query	CLS
Snowflake/snowflake-arctic-embed-m	109M	512	768	query	CLS
Snowflake/snowflake-arctic-embed-m-v1.5	109M	512	768	query	CLS
Snowflake/snowflake-arctic-embed-m-v2.0	109M	512	768	query	CLS
Snowflake/snowflake-arctic-embed-l	109M	512	1024	query	CLS
Snowflake/snowflake-arctic-embed-l-v2.0	109M	512	1024	query	CLS
BAAI/bge-small-en-v1.5	33M	512	384	query	mean
BAAI/bge-base-en-v1.5	109M	512	768	query	mean
BAAI/bge-large-en-v1.5	33M	512	1024	query	mean
BAAI/bge-small-zh-v1.5	33M	512	384	query	mean
BAAI/bge-base-zh-v1.5	109M	512	768	query	mean
BAAI/bge-large-zh-v1.5	33M	512	1024	query	mean
BAAI/bge-m3	560M	8192	1024	not needed	mean
WhereIsAI/UAE-Large-V1	335M	512	1024	query	mean
mixedbread-ai/mxbai-embed-large-v1	335M	512	1024	query	CLS
jinaai/jina-embeddings-v3	572M	8192	1024	query+doc	mean
NovaSearch/stella_en_400M_v5	435B	4096	8192	query	mean

If the model is not listed in this table, but has an ONNX file available, then most probably it should work well. But you might set prompt and pooling parameters based on model documentation. See embedding model configuration section for more details.

Model handles¶

Nixiesearch supports loading models directly from Huggingface by its handle (e.g. sentence-transformers/all-MiniLM-L6-v2) and from local file directory.

You can reference any HF model handle in the inference block, for example:

inference:
  embedding:
    e5-small:
      model: sentence-transformers/all-MiniLM-L6-v2

It also works with local paths:

inference:
  embedding:
    your-model:
      model: /path/to/model/dir

Optionally you can define which particular ONNX file to load, for example the QInt8 quantized one:

inference:
  embedding:
    # Used for semantic retrieval
    e5-small:
      model: nixiesearch/e5-small-v2-onnx
      file: model_opt2_QInt8.onnx

To enable caching for frequent strings, use the cache option. See Embedding caching for more details.

inference:
  embedding:
    # Used for semantic retrieval
    e5-small:
      model: nixiesearch/e5-small-v2-onnx
      cache:
        memory:
          max_size: 1024 # cache top-N most popular embeddings

Device allocation¶

By default, ONNX models run on CPU using all available CPU cores. You can configure which device to use for model inference by adding a device section to your model configuration:

inference:
  embedding:
    e5-small:
      model: nixiesearch/e5-small-v2-onnx
      device: cpu        # Use CPU with default thread count (number of CPU cores)

For GPU acceleration, specify a CUDA device (requires CUDA-enabled Docker image - see GPU deployment guide):

inference:
  embedding:
    e5-small:
      model: nixiesearch/e5-small-v2-onnx
      device: cuda:0     # Use GPU device 0

You can also control CPU thread allocation:

inference:
  embedding:
    e5-small:
      model: nixiesearch/e5-small-v2-onnx
      device: cpu:8      # Use CPU with 8 threads

Available device options: - cpu - CPU execution with default thread count (number of available CPU cores) - cpu:N - CPU execution with N threads (where N > 0) - cuda - GPU execution on device 0 (default GPU) - cuda:N - GPU execution on specific GPU device N (where N >= 0)

Converting your own model¶

You can use the nixiesearch/onnx-convert to convert your own model:

python convert.py --model_id intfloat/multilingual-e5-large --optimize 2 --quantize QInt8

Conversion config: ConversionArguments(model_id='intfloat/multilingual-e5-base', quantize='QInt8', output_parent_dir='./models/', task='sentence-similarity', opset=None, device='cpu', skip_validation=False, per_channel=True, reduce_range=True, optimize=2)
Exporting model to ONNX
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.1.0+cu121
Overriding 1 configuration item(s)
        - use_cache -> False
Post-processing the exported models...
Deduplicating shared (tied) weights...
Validating ONNX model models/intfloat/multilingual-e5-base/model.onnx...
        -[✓] ONNX model output names match reference model (last_hidden_state)
        - Validating ONNX Model output "last_hidden_state":
                -[✓] (2, 16, 768) matches (2, 16, 768)
                -[✓] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: models/intfloat/multilingual-e5-base
Export done
Processing model file ./models/intfloat/multilingual-e5-base/model.onnx
ONNX model loaded
Optimizing model with level=2
Optimization done, quantizing to QInt8

See the nixiesearch/onnx-convert repo for more details and options.