Skip to content

Language support

Nixiesearch language support differs for lexical and semantic search methods: * for lexical search, all Lucene analyzers are supported out of the box - see full list below * for semantic search, language support depends on the underlying embedding model used. See some examples in a Language support for semantic search below.

Language can be set in the index mapping for text-like fields:

schema:
  your-index-name:
    fields:
      title:
        type: text
        search: semantic
        language: en # use an ISO-639-1 language code

If language is not defined, a special default language analyzer is used - no language specific transformations are done, only ICU tokenization.

Nixiesearch supports all languages from Apache Lucene library:

Language ISO 639-1 code
Generic default
English en
Arabic ar
Bulgarian br
Bengali br
Brazilian Portugese br
Catalan br
Simplified Chinese zh
Czech cz
Danish da
German de
Greek el
Spanish es
Estonian et
Basque eu
Persian fa
Finnish fi
French fr
Irish ga
Hindi hi
Hungarian hu
Armenian hy
Indonesian id
Italian it
Lithuanian lt
Latvian lv
Dutch nl
Norwegian no
Portuguese pt
Romanian ro
Russian ru
Serbian sr
Swedish sv
Thai th
Turkish tr
Japanese ja
Polish po
Korean kr
Tamil ta
Ukrainian ua

If your language is not included in the list, please file a GitHub issie: https://github.com/nixiesearch/nixiesearch/issues

Language support for semantic search fully depends on the embedding model used:

So if your target language is English, you can choose almost any model from the MTEB leaderboard you like. For multilingual model, the intfloat/multilingual-e5-large is recommended.