Duplicate IPv6 bridge address `4a0:6976:8fa7:efc:1::1` on br-timur and br-ashraf — needs root-cause analysis #164

New issue

Open

opened 2026-04-29 03:32:56 +00:00 by sameh-farouk · 1 comment

sameh-farouk commented

2026-04-29 03:32:56 +00:00

Member

Observed

On dev box (138.201.206.39, 2026-04-29), the same per-user mycelium bridge IP is currently configured on two different bridge interfaces:

26: br-timur: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN
    inet6 4a0:6976:8fa7:efc:1::1/64 scope global
28: br-ashraf: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN
    inet6 4a0:6976:8fa7:efc:1::1/64 scope global

4a0:6976:8fa7:efc:1::1/64 (slot 1 of the per-user mycelium /64 ladder) is allocated to BOTH timur and ashraf.

Possible causes (none confirmed)

mycelium_alloc_prefix64 allocator collision — the per-user prefix allocator in tools/modules/installers/multiuser.nu could return the same slot to two multi_user_add calls if its "find next free slot" logic has a fallback on RPC failure (or a race when two adds run concurrently).
multi_user_del cleanup gap — when a user is removed, the bridge interface might be torn down but the IP address left attached (e.g., to a residual netns or another bridge), then a fresh multi_user_add reuses the slot, creating the duplicate.
Manual ops residue — someone configured the duplicate via ip addr add outside the normal lifecycle.

I've not isolated which of these is the root cause. Filing as observation + hypothesis so the maintainer can pick the right diagnostic.

Diagnostic next steps

bridge link show to see what's plumbed into each bridge
Cross-reference allocation history: when were timur and ashraf created? journalctl / shell history / ~/hero/cfg/multi_user_*.log if any
Read mycelium_alloc_prefix64 and check whether its slot-iteration logic can ever return a value that's still attached at the OS level (i.e., does it consult ip -6 addr or only an internal registry?)
Check multi_user_del — does it ip addr del before ip link del?

Why this matters

Routing ambiguity: outbound packets to :1::1 could egress via either bridge
Bind conflicts: similar in spirit to the livekit EADDRINUSE issue (lhumina_code/hero_livekit#31), though that was Pion auto-enumeration, not a literal duplicate
User isolation breakdown if traffic intended for one user's mycelium namespace ends up on another's
Both bridges are currently DOWN/NO-CARRIER, so no active traffic is hitting this — but a peer reconnection on either side will surface it

Workaround (until root cause is found)

Manual cleanup:

# Pick the user who actually owns slot 1 (check ~/hero/cfg/hero_cfg.toml on each user)
# Then remove the duplicate from the wrong bridge:
sudo ip -6 addr del 4a0:6976:8fa7:efc:1::1/64 dev <wrong-bridge>

This just patches the symptom — the underlying allocator/cleanup bug will hit again on the next multi_user_add / multi_user_del cycle.

## Observed On dev box (138.201.206.39, 2026-04-29), the same per-user mycelium bridge IP is currently configured on two different bridge interfaces: ``` 26: br-timur: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN inet6 4a0:6976:8fa7:efc:1::1/64 scope global 28: br-ashraf: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 state DOWN inet6 4a0:6976:8fa7:efc:1::1/64 scope global ``` `4a0:6976:8fa7:efc:1::1/64` (slot 1 of the per-user mycelium /64 ladder) is allocated to BOTH `timur` and `ashraf`. ## Possible causes (none confirmed) 1. **`mycelium_alloc_prefix64` allocator collision** — the per-user prefix allocator in `tools/modules/installers/multiuser.nu` could return the same slot to two `multi_user_add` calls if its "find next free slot" logic has a fallback on RPC failure (or a race when two adds run concurrently). 2. **`multi_user_del` cleanup gap** — when a user is removed, the bridge interface might be torn down but the IP address left attached (e.g., to a residual netns or another bridge), then a fresh `multi_user_add` reuses the slot, creating the duplicate. 3. **Manual ops residue** — someone configured the duplicate via `ip addr add` outside the normal lifecycle. I've **not** isolated which of these is the root cause. Filing as observation + hypothesis so the maintainer can pick the right diagnostic. ## Diagnostic next steps - `bridge link show` to see what's plumbed into each bridge - Cross-reference allocation history: when were `timur` and `ashraf` created? `journalctl` / shell history / `~/hero/cfg/multi_user_*.log` if any - Read `mycelium_alloc_prefix64` and check whether its slot-iteration logic can ever return a value that's still attached at the OS level (i.e., does it consult `ip -6 addr` or only an internal registry?) - Check `multi_user_del` — does it `ip addr del` before `ip link del`? ## Why this matters - Routing ambiguity: outbound packets to `:1::1` could egress via either bridge - Bind conflicts: similar in spirit to the livekit `EADDRINUSE` issue (lhumina_code/hero_livekit#31), though that was Pion auto-enumeration, not a literal duplicate - User isolation breakdown if traffic intended for one user's mycelium namespace ends up on another's - Both bridges are currently DOWN/NO-CARRIER, so no active traffic is hitting this — but a peer reconnection on either side will surface it ## Workaround (until root cause is found) Manual cleanup: ```bash # Pick the user who actually owns slot 1 (check ~/hero/cfg/hero_cfg.toml on each user) # Then remove the duplicate from the wrong bridge: sudo ip -6 addr del 4a0:6976:8fa7:efc:1::1/64 dev <wrong-bridge> ``` This just patches the symptom — the underlying allocator/cleanup bug will hit again on the next `multi_user_add` / `multi_user_del` cycle.

mahmoud commented

2026-04-29 13:13:09 +00:00

Owner

Most likely root cause: the allocator at multiuser.nu:53-84 only reads mycelium's managed-address list — it never checks ip -6 addr. When multi_user_del runs mycelium network bridge delete but the kernel bridge + IP linger, the next multi_user_add sees the slot as free and reuses it.

Making it worse: the try { ... } catch { [] } around the address list silently returns slot 1 if mycelium RPC fails. And bridge_assign_user_ip falls back to plain ip addr add (line 99), creating IPs invisible to the allocator forever.

Minimal fix: scan ip -6 addr in the allocator, drop the silent catch, and have multi_user_del ip link del the bridge if it survives the mycelium delete.

Most likely root cause: the allocator at `multiuser.nu:53-84` only reads mycelium's managed-address list — it never checks `ip -6 addr`. When `multi_user_del` runs `mycelium network bridge delete` but the kernel bridge + IP linger, the next `multi_user_add` sees the slot as free and reuses it. Making it worse: the `try { ... } catch { [] }` around the address list silently returns slot 1 if mycelium RPC fails. And `bridge_assign_user_ip` falls back to plain `ip addr add` (line 99), creating IPs invisible to the allocator forever. Minimal fix: scan `ip -6 addr` in the allocator, drop the silent catch, and have `multi_user_del` `ip link del` the bridge if it survives the mycelium delete.