Update Meilisearch's parts about Replication and Sharding support#454
Update Meilisearch's parts about Replication and Sharding support#454Kerollmops wants to merge 7 commits intotypesense:masterfrom
Conversation
|
|
||
| That said, Meilisearch is a relatively new project, and while it's not suited for serious production use-cases today, | ||
| the project has a good team & momentum behind it. We're eager to see how the project evolves. | ||
| That said, Meilisearch is a relatively new project. The project has a good team & momentum behind it. We're eager to see how the project evolves. |
There was a problem hiding this comment.
We could probably remove the whole sentence.
| <td>❌</td> | ||
| <td>❌</td> | ||
| <td>❌</td> | ||
| <td>✅<br /><br />Can optionally use a GPU to generate embeddings</td> |
There was a problem hiding this comment.
This is already what's written on the v3 page
|
Do you have an article in your docs that describes how this specifically solves for High Availability? Because sharding and high availability are distinct things, and having sharding doesn't imply being highly available. For eg, what happens when a shard goes down. Anyway, I'll reserve my questions for after I read your docs. I checked your public docs and couldn't find it described there, so if you have a direct link, I can read through it. |
|
Hello @jasonbosco 👋 Thank you very much for taking the time to review my PR.
We haven't updated our public documentation yet, but we have internal documentation on how to set up full or unbalanced replication in the changelog for v1.37, the version introducing replicated sharding. Note that Meilisearch supports sharding and replication, which are handled via a virtual naming approach to be very flexible for any situation. Have a nice day 🌵 |
|
Hey @jasonbosco 👋 We finally merged the documentation about the new sharding feature of Meilisearch. As you can see, Meilisearch supports full sharding without replication, full replication, or a mix of sharding and replication. The engine handles those different configurations by using a virtual-naming approach. I let you read our documentation, |
|
@Kerollmops Thank you for sharing the link to the docs. After reading through it, I still don't think Meilisearch's replication architecture qualifies as high availability or fault tolerance in the standard distributed systems sense. Just to align on terminology, high availability / fault tolerance typically means that there is no single point of failure and a system continues to operate correctly (serving both reads and writes) despite a node failure, with automatic failover and no operator intervention required. Based on your docs, Meilisearch's current implementation doesn't seem to meet that bar. -- On the write side, your docs say:
So there’s a single leader for writes, and if that node goes down - say at 3am - writes are down until someone manually promotes a new leader. That’s a single point of failure for the write path and requires operator intervention to recover, regardless of the number of shards or replicas, even when fully replicated. -- Even on the read side, your docs say:
Based on that, my understanding is that during a distributed search, each shard is assigned to a specific remote. If that remote is down, the query does not retry against another replica that holds the same shard. The implication of this is that, let's say there's a Meilisearch cluster with a shard fully replicated across multiple nodes. Again it's 3am and one node goes down, and the query planner happens to pick that node for that shard, the request may return partial or failed results - even though another replica is healthy. From the application’s point of view, this breaks the expectation that replicas provide seamless failover. -- Putting both together, your current replication mechanism does not provide fault tolerance when a node is down. It only gives you data redundancy for existing data, but without automatic replica fallback on reads and automatic leader election for writes, the system will go into a degraded state where some searches might return partial results and writes will become unavailable, even if one node (especially the leader) fails. So this design is still not fault tolerant and does not provide high availability for use in a production environment. Let me know if I've misunderstood any behavior you've mentioned in the docs. |


Yellow, the Typesense team 👋
Change Summary
I have updated your comparison table to reflect our recent releases and, most importantly, the release of geo-replicated sharding. Users can now configure a cluster and define how to replicate and shard their documents. We use virtual-naming to let users scale horizontally, define a replication-only setup, or both simultaneously.
I also updated parts discussing the impact of available RAM on performance.
Thank you for your time, and have a great week-end 🌵
PR Checklist