index_manager.create is non-atomic; concurrent first-writes per kind can race the on-disk index creation #22
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_indexer#22
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The
IndexManager::createpath onhero_indexer_server(which leads totantivy::Index::create_in_dirafterstd::fs::create_dir_all) is not atomic with respect to concurrent first-write requests for the same kind:Concurrent first-write requests for the same kind:
path.exists()→ false → spawn_blocking →create_dir_all→ beginsIndex::create_in_dir.create_in_dirfinishes writingmeta.json).path.exists()→ true (A created it, step 2) → returnsDatabaseExistsearly.DatabaseExistsas benign ("kind already initialized, proceed to write") and issues adb.select+ write — butmeta.jsondoesn't exist yet (A hasn't finished step 3).db.selecterrors withFileDoesNotExist(meta.json).This is a server-side TOCTOU between the dir-exists check and the index-meta-written observation. Two parallel calls to
create_in_diragainst the same path can also corrupt the index state.Reproduction
Spawn N concurrent first-write requests for the same fresh kind. Some fraction will hit
FileDoesNotExist(meta.json)on subsequent operations. In practice this happens during initial backfill of a fresh deployment when many entity kinds get their first write simultaneously, or under cold-start load when an admin reindex RPC fans out across many kinds.Workaround (downstream)
Serialize per-kind first-write client-side. Hold a mutex across the entire (cache-check + db.create + cache-insert) sequence so a single flight per kind reaches the server. Pseudocode:
This is what downstream consumers have shipped to paper over the issue.
Suggested fix (server-side)
Make
IndexManager::createatomic. Two options:RwLock<HashMap<String, ...>>keyed by kind name across the whole create operation; concurrent requests for the same kind block on it until the first completes.create_or_getinstead ofcreate. Single idempotent operation that returns the existing index if creation is in flight (or completed), OR completes creation if not started. Removes the client-side round-trip and the TOCTOU window entirely.Option 2 is preferred because it eliminates the need for clients to do their own caching at all.
Impact
Currently every consumer that does parallel first-writes per kind must implement the client-side serialization workaround. Without it, backfill flows fail intermittently in a way that's hard to diagnose (the error surfaces on a later
db.select, not on the racingdb.create).Not blocking — workaround is small. Filing as ecosystem-correctness signal so the next consumer doesn't independently rediscover this.