Nixiesearch: batteries included search engine¶
What is Nixiesearch?¶
Nixiesearch is a modern search engine that runs on S3-compatible storage. We built it after dealing with the headaches of running large Elastic/OpenSearch clusters (here's the blog post full of pain), and here’s why it’s awesome:
- Powered by Apache Lucene: You get support for 39 languages, facets, advanced filters, autocomplete suggestions, and the familiar sorting features you’re used to.
- Decoupled S3-based storage and compute: There's nothing to break. You get risk-free backups, upgrades, schema changes and auto-scaling, all on a stateless index stored in S3.
- Pull indexing: Supports both offline and online incremental indexing using an Apache Spark based ETL process. No more POSTing JSON blobs to prod cluster (and overloading it).
- No state inside the cluster: All changes (settings, indexes, etc.) are just config updates, which makes blue-green deployments of index changes a breeze.
- AI batteries included: Embedding and LLM inference, first class RAG API support.
Search is never easy, but Nixiesearch has your back. It takes care of the toughest parts—like reindexing, capacity planning, and maintenance—so you can save time (and your sanity).
Note
Want to learn more? Go straight to the quickstart and check out the live demo.
What Nixiesearch is not?¶
- Nixiesearch is not a database, and was never meant to be. Nixiesearch is a search index for consumer-facing apps to find top-N most relevant documents for a query. For analytical cases consider using good old SQL with Clickhouse or Snowflake.
- Not a tool to search for logs. Log search is about throughput, and Nixiesearch is about relevance. If you plan to use Nixiesearch as a log storage system, please don't: consider ELK or Quickwit as better alternatives.
The difference¶
Our elasticsearch cluster has been a pain in the ass since day one with the main fix always "just double the size of the server" to the point where our ES cluster ended up costing more than our entire AWS bill pre-ES [HN source]
When your search cluster is red again when you accidentally send a wrong JSON to a wrong REST endpoint, you can just write your own S3-based search engine like big guys do:
- Uber: Lucene: Uber’s Search Platform Version Upgrade.
- Amazon: E-Commerce search at scale on Apache Lucene.
- Doordash: Introducing DoorDash’s in-house search engine.
Nixiesearch was inspired by these search engines, but is open-source. Decoupling search and storage makes ops simpler. Making your search configuration immutable makes it even more simple.
How it's different from popular search engines?
- vs Elastic: Embedding inference, hybrid search and reranking are free and open-source. For ES these are part of the proprietary cloud.
- vs OpenSearch: While OpenSearch can use S3-based segment replication, Nixiesearch can also offload cluster state to S3.
- vs Qdrant and Weaviate: Not a sidecar search engine to handle just vector search. Autocomplete, facets, RAG and embedding inference out of the box.
Try it out¶
Get the sample MSRD: Movie Search Ranking Dataset dataset:
curl -o movies.jsonl.gz https://nixiesearch.ai/data/movies.jsonl
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 162 100 162 0 0 3636 0 --:--:-- --:--:-- --:--:-- 3681
100 32085 100 32085 0 0 226k 0 --:--:-- --:--:-- --:--:-- 226k
Create an index mapping for movies
index in a file config.yml
:
inference:
embedding:
e5-small: #
model: intfloat/e5-small-v2 #
schema:
movies: # index name
fields:
title: # field name
type: text
search:
type: hybrid
model: e5-small
language: en # language is needed for lexical search
suggest: true
overview:
type: text
search:
type: hybrid
model: e5-small
language: en
Run the Nixiesearch docker container:
docker run -itp 8080:8080 -v .:/data nixiesearch/nixiesearch:latest standalone -c /data/config.yml
a.nixiesearch.index.sync.LocalIndex$ - Local index movies opened
ai.nixiesearch.index.Searcher$ - opening index movies
a.n.main.subcommands.StandaloneMode$ - ███╗ ██╗██╗██╗ ██╗██╗███████╗███████╗███████╗ █████╗ ██████╗ ██████╗██╗ ██╗
a.n.main.subcommands.StandaloneMode$ - ████╗ ██║██║╚██╗██╔╝██║██╔════╝██╔════╝██╔════╝██╔══██╗██╔══██╗██╔════╝██║ ██║
a.n.main.subcommands.StandaloneMode$ - ██╔██╗ ██║██║ ╚███╔╝ ██║█████╗ ███████╗█████╗ ███████║██████╔╝██║ ███████║
a.n.main.subcommands.StandaloneMode$ - ██║╚██╗██║██║ ██╔██╗ ██║██╔══╝ ╚════██║██╔══╝ ██╔══██║██╔══██╗██║ ██╔══██║
a.n.main.subcommands.StandaloneMode$ - ██║ ╚████║██║██╔╝ ██╗██║███████╗███████║███████╗██║ ██║██║ ██║╚██████╗██║ ██║
a.n.main.subcommands.StandaloneMode$ - ╚═╝ ╚═══╝╚═╝╚═╝ ╚═╝╚═╝╚══════╝╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚═╝ ╚═╝
a.n.main.subcommands.StandaloneMode$ -
o.h.ember.server.EmberServerBuilder - Ember-Server service bound to address: [::]:8080
Build an index for a hybrid search:
curl -XPUT -d @movies.jsonl http://localhost:8080/movies/_index
{"result":"created","took":8256}
Send the search query:
curl -XPOST -d '{"query": {"match": {"title":"matrix"}},"fields": ["title"], "size":3}'\
http://localhost:8080/movies/_search
{
"took": 1,
"hits": [
{
"_id": "605",
"title": "The Matrix Revolutions",
"_score": 0.016666668
},
{
"_id": "604",
"title": "The Matrix Reloaded",
"_score": 0.016393442
},
{
"_id": "624860",
"title": "The Matrix Resurrections",
"_score": 0.016129032
}
],
"aggs": {},
"ts": 1722441735886
}
You can also open http://localhost:8080/_ui
in your web browser for a basic web UI:
For more details, see a complete Quickstart guide.
License¶
This project is released under the Apache 2.0 license, as specified in the License file.