Duplicate IPv6 bridge address 4a0:6976:8fa7:efc:1::1 on br-timur and br-ashraf — needs root-cause analysis #164

Open
opened 2026-04-29 03:32:56 +00:00 by sameh-farouk · 0 comments
Member

Observed

On dev box (138.201.206.39, 2026-04-29), the same per-user mycelium bridge IP is currently configured on two different bridge interfaces:

26: br-timur: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN
    inet6 4a0:6976:8fa7:efc:1::1/64 scope global
28: br-ashraf: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN
    inet6 4a0:6976:8fa7:efc:1::1/64 scope global

4a0:6976:8fa7:efc:1::1/64 (slot 1 of the per-user mycelium /64 ladder) is allocated to BOTH timur and ashraf.

Possible causes (none confirmed)

  1. mycelium_alloc_prefix64 allocator collision — the per-user prefix allocator in tools/modules/installers/multiuser.nu could return the same slot to two multi_user_add calls if its "find next free slot" logic has a fallback on RPC failure (or a race when two adds run concurrently).
  2. multi_user_del cleanup gap — when a user is removed, the bridge interface might be torn down but the IP address left attached (e.g., to a residual netns or another bridge), then a fresh multi_user_add reuses the slot, creating the duplicate.
  3. Manual ops residue — someone configured the duplicate via ip addr add outside the normal lifecycle.

I've not isolated which of these is the root cause. Filing as observation + hypothesis so the maintainer can pick the right diagnostic.

Diagnostic next steps

  • bridge link show to see what's plumbed into each bridge
  • Cross-reference allocation history: when were timur and ashraf created? journalctl / shell history / ~/hero/cfg/multi_user_*.log if any
  • Read mycelium_alloc_prefix64 and check whether its slot-iteration logic can ever return a value that's still attached at the OS level (i.e., does it consult ip -6 addr or only an internal registry?)
  • Check multi_user_del — does it ip addr del before ip link del?

Why this matters

  • Routing ambiguity: outbound packets to :1::1 could egress via either bridge
  • Bind conflicts: similar in spirit to the livekit EADDRINUSE issue (lhumina_code/hero_livekit#31), though that was Pion auto-enumeration, not a literal duplicate
  • User isolation breakdown if traffic intended for one user's mycelium namespace ends up on another's
  • Both bridges are currently DOWN/NO-CARRIER, so no active traffic is hitting this — but a peer reconnection on either side will surface it

Workaround (until root cause is found)

Manual cleanup:

# Pick the user who actually owns slot 1 (check ~/hero/cfg/hero_cfg.toml on each user)
# Then remove the duplicate from the wrong bridge:
sudo ip -6 addr del 4a0:6976:8fa7:efc:1::1/64 dev <wrong-bridge>

This just patches the symptom — the underlying allocator/cleanup bug will hit again on the next multi_user_add / multi_user_del cycle.

## Observed On dev box (138.201.206.39, 2026-04-29), the same per-user mycelium bridge IP is currently configured on two different bridge interfaces: ``` 26: br-timur: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN inet6 4a0:6976:8fa7:efc:1::1/64 scope global 28: br-ashraf: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN inet6 4a0:6976:8fa7:efc:1::1/64 scope global ``` `4a0:6976:8fa7:efc:1::1/64` (slot 1 of the per-user mycelium /64 ladder) is allocated to BOTH `timur` and `ashraf`. ## Possible causes (none confirmed) 1. **`mycelium_alloc_prefix64` allocator collision** — the per-user prefix allocator in `tools/modules/installers/multiuser.nu` could return the same slot to two `multi_user_add` calls if its "find next free slot" logic has a fallback on RPC failure (or a race when two adds run concurrently). 2. **`multi_user_del` cleanup gap** — when a user is removed, the bridge interface might be torn down but the IP address left attached (e.g., to a residual netns or another bridge), then a fresh `multi_user_add` reuses the slot, creating the duplicate. 3. **Manual ops residue** — someone configured the duplicate via `ip addr add` outside the normal lifecycle. I've **not** isolated which of these is the root cause. Filing as observation + hypothesis so the maintainer can pick the right diagnostic. ## Diagnostic next steps - `bridge link show` to see what's plumbed into each bridge - Cross-reference allocation history: when were `timur` and `ashraf` created? `journalctl` / shell history / `~/hero/cfg/multi_user_*.log` if any - Read `mycelium_alloc_prefix64` and check whether its slot-iteration logic can ever return a value that's still attached at the OS level (i.e., does it consult `ip -6 addr` or only an internal registry?) - Check `multi_user_del` — does it `ip addr del` before `ip link del`? ## Why this matters - Routing ambiguity: outbound packets to `:1::1` could egress via either bridge - Bind conflicts: similar in spirit to the livekit `EADDRINUSE` issue (lhumina_code/hero_livekit#31), though that was Pion auto-enumeration, not a literal duplicate - User isolation breakdown if traffic intended for one user's mycelium namespace ends up on another's - Both bridges are currently DOWN/NO-CARRIER, so no active traffic is hitting this — but a peer reconnection on either side will surface it ## Workaround (until root cause is found) Manual cleanup: ```bash # Pick the user who actually owns slot 1 (check ~/hero/cfg/hero_cfg.toml on each user) # Then remove the duplicate from the wrong bridge: sudo ip -6 addr del 4a0:6976:8fa7:efc:1::1/64 dev <wrong-bridge> ``` This just patches the symptom — the underlying allocator/cleanup bug will hit again on the next `multi_user_add` / `multi_user_del` cycle.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_skills#164
No description provided.