Benchmarks
Every optimization pass re-measured apples-to-apples on one box as a
PR-isolated before/after pair, plus a cumulative 16caba4→main summary. This
file is performance only — correctness gates (byte-identical proptests, FUSE
e2e, in-diff mutation) live in CI and the contributor guide, not
here.
Read it in three layers:
- Results at a glance — the cumulative per-subsystem delta and a one-line headline per pass.
- Methodology — machine, before/after definition, the overlay rule, run conventions, storage placement. Written once; every detail section assumes it.
- Per-pass detail — one section per pass: what changed, the before/after table, the reproduce command, and the "why" where it matters.
Results at a glance
Cumulative — 16caba4 → current main (e02223e)
Composed from the per-pass isolated deltas below, anchored to current-main
absolutes. Non-isolating: a same-harness run at both ends is infeasible (API
drift means neither the 16caba4-era harness nor the main harness compiles at
the other commit), so these compose the chain of passes that touched each
subsystem rather than a single end-to-end measurement. See
Cumulative detail for the absolutes and the per-pass
composition.
| Subsystem | Headline metric | 16caba4-era | current main | Δ | Dominant pass |
|---|---|---|---|---|---|
| Ingest | fsync count (durable) | 403 | 0 | eliminated | SP1 |
| cold scan, ci flac | 32 206 ms | 47 ms | ~685× | SP1 | |
| Refresh | refresh-1 @ 20 000 tracks | 173 ms | 1 ms | ~173× | #69 |
| Serve | sequential_read/flac | 929 µs | 569 µs | −38.8% | SP3 + PR3 |
| cold_first_read/ogg | 14.96 ms | 1.51 ms | −89.9% | SP4 | |
| concurrent m16+walker | 8.20 ms | 4.15 ms | −49.4% | SP3 + PR3 |
Per-pass headlines
Each headline is the pass's single largest statistically-significant delta on its deployment-representative tier.
| Pass | Commit | Headline (this box) |
|---|---|---|
| SP1 — ingestion scalability | ccbbfaa | durable cold scan ~1150–3600× faster; fsync storm 403→0 |
| SP2 — incremental tree refresh | ed5f380 | 5 000-track refresh-1 1.4× (32→23 ms) |
| SP3 — read/serve residuals | e8d56bd | sequential_read −8 to −13% (flac/mp3/m4a/m4a-last) |
| SP4 — storage-aware Ogg serving | a62453b | ogg cold-read −88%, seek −94% |
| #69 — refresh O(changed) | e7ae912 | refresh-1 @ 20 000 ~170× (173→1 ms) |
| #114 — root fan-out lookup | 0881b31 | root fan-out @ 20 000 ~5× (5→1 ms) |
| PR2 — scan pair (#67/#68) | 2d4faf3 | −128 B/file scan I/O (flac/ogg/wav); wall within noise |
| PR3 — serve-path copies (#70) | 32be8f0 | sequential_read −7 to −11% (m4a-last/ogg/wav); concurrent −19% |
| #136 — HeaderCache quick_cache | 2e6674e | within noise (marginal m4a/ogg sequential) |
| #112 — StructureOnly passthrough | faec017 | passthrough dd 3.36× (2.5→8.4 GB/s) |
One direction inverted vs the historical file: SP1 §4 (compute-isolated, on RAM) is now faster after the change, not slower. The old file recorded SP1 as ~1.9× slower on RAM-backed tempfs (the "honest cost" of the pipeline); on this 8-core box the parallel pipeline wins even on RAM (~1.4×), at higher peak RSS. See SP1 §4.
CI regression gating
BENCHMARKS.md records hand-run absolute numbers; CI guards against regressions
in three lanes:
- Counter gate (every non-doc PR, hard).
perf_counters.rs+tree.rsgolden work-counter assertions under--features metrics. Catches algorithmic regressions (extra copy, whole-file slurp, O(N) tree rebuild). - A/B wall-clock (warn-only, core
srcPRs). Theperf-benchmatrix job benches the base and PR commits in parallel on separate runners; theperf-abjob then diffs the two exported baselines and posts acritcmpdelta as a PR comment. Never blocks. - Release record. The
benchmarksjob runs the full bench suite at thecitier on a tag and uploads the numbers as an artifact for curation here.
The fsync-storm (403→0) signal needs a real FUSE mount and lives only in the
release lane / the #[ignore] bench_scan_under_latency, not the per-PR gate.
The release artifact is named benchmark-snapshot-<tag>; download it from the
tag's workflow run. The job runs on a GitHub-hosted ubuntu-latest runner,
not the dedicated box the rest of this file uses, so its wall-times are
runner-relative and are not folded into the per-pass tables — only the
portable signals (bytes_read, pread/fsync counts, refresh flatness) are
cross-comparable. Each release's snapshot is recorded verbatim under
Release CI snapshots.
Release CI snapshots
Per-tag records from the benchmarks release job
(CI regression gating §3), run on a GitHub-hosted
ubuntu-latest runner at the ci tier. This is not the dedicated box the
per-pass tables use, so the wall-times here are
runner-relative and are a point-in-time record per release — not comparable to
those sections. The portable signals (scan_bytes_read, pread/fsync
counts, refresh flatness) are comparable, and are the no-regression check.
v1.1.0 — f865afc, single run
No regression vs the curated tables: scan_bytes_read is unchanged from
PR2 (flac/ogg/wav = 845 000 / 847 400 / 828 000 B — the
−128 B/file ID3v1 gating still holds; mp3 847 200 B unchanged; m4a uses the
seek-reader, 0 B), and single-track refresh stays flat with library size.
read_throughput (Criterion, median estimate, µs):
| bench | flac | mp3 | m4a | m4a-last | ogg | wav |
|---|---|---|---|---|---|---|
| sequential_read | 416 | 418 | 417 | 419 | 524† | 422 |
| cold_first_read | 798 | 778 | 805 | 804 | 911 | 792 |
| seek_read | 368 | 353 | 372 | 370 | 582 | 365 |
concurrent_read_walk/m8_plus_walker: 931 µs.
† sequential_read/ogg collected only 10k iterations with 19% outliers
(Criterion low-sample warning) — treat as noisy.
bench_ingest — ci tier (200 tracks × 4 KiB), runner tmpfs:
| format | scan (ms) | revalidate (ms) | scan_bytes_read (B) | RSS (KiB) |
|---|---|---|---|---|
| flac | 31 | 1 | 845 000 | 7100 |
| mp3 | 82 | 1 | 847 200 | 7164 |
| m4a | 86 | 1 | 0 | 7180 |
| m4a-last | 87 | 1 | 0 | 7184 |
| ogg | 83 | 1 | 847 400 | 7184 |
| wav | 83 | 1 | 828 000 | 7184 |
bench_refresh — ci tier, single-track re-tag:
| library size | refresh-1 (ms) | root-fanout-1 (ms) |
|---|---|---|
| 100 | 0 | 0 |
| 1000 | 1 | 1 |
| 5000 | 12 | 3 |
| 20000 | 7 | 9 |
refresh-1 vs refresh-N (200-track ci, same instance): refresh-1 0 ms, refresh-N (100 touched) 6 ms. The 5000 > 20000 inversion is single-run noise on the shared runner; the dedicated-box #69 sweep is the clean flat signal.
Methodology
Machine
| CPU | 8 cores |
| RAM | 32 GB (31 GiB) |
Durable storage (/data) | btrfs, 2-device span (sda3+sdb3), rotational; Data: single, Metadata: RAID1; zstd:1. No SSD on this box. |
RAM storage (/dev/shm) | tmpfs |
| Toolchain | rustc 1.96.0 · release builds |
| Kernel | Linux 7.0 (FUSE passthrough requires ≥6.9 + CAP_SYS_ADMIN) |
Before / after definition
History is squash-merged (linear), so each pass is one commit:
- after = the pass's own squash-merge commit.
- before = its parent,
<after>^— PR-isolated, not currentmain. This preserves attribution (each delta is exactly what that PR changed) and avoids harness drift from later passes.
The overlay rule
Two passes (SP2, SP4) report a bench that did not yet exist at their
before commit. For those, the after-commit's harness file is checked out onto
the before checkout (git checkout <after> -- <bench_file>) so the old code is
measured with the new harness. Overlay use is called out in each affected
section.
Run conventions
bench_ingest/bench_refresh(ignored tests,cargo test --release … -- --ignored): 3 runs, median reported (spread noted where it matters).bench_ingestneeds--features metrics.read_throughput(Criterion bench): Criterion's own sampling; before side saved with--save-baseline, after side compared with--baseline. Reported Δ is Criterion's change estimate.- Wall times on
/dataare box-relative (rotational disk); where a portable signal exists (fsync count, bytes_read, pread count) it is the primary number.
Storage placement
- Durable rows run on
/data(rotational btrfs).bench_ingesthonorsMUSEFS_BENCH_DIR. - RAM rows run on
/dev/shm(tmpfs).bench_ingesthonorsMUSEFS_BENCH_DIR=/dev/shm/…;bench_refreshandread_throughputignore it and followTMPDIR=/dev/shm.
Per-pass detail
SP1 — Ingestion scalability
ccbbfaa^ → ccbbfaa. bench_ingest, --features metrics. No overlay.
What changed: whole-file fs::read slurp + per-file commits at
synchronous=FULL → bounded probing reads + parallel-probe/single-writer
pipeline + per-batch transactions at synchronous=NORMAL (WAL retained).
1. Durable small files — the fsync/batching win
ci tier (200 tracks × 4 KiB, no embedded art), corpus + DB on /data. Not
compute-bound — the before path is dominated by per-file fsync latency.
| format | before scan (ms) | after scan (ms) | speedup |
|---|---|---|---|
| flac | 32 206 | 21 | 1534× |
| mp3 | 16 124 | 14 | 1152× |
| m4a | 30 089 | 19 | 1584× |
| m4a-last | 39 592 | 11 | 3599× |
| ogg | 16 153 | 14 | 1154× |
| wav | 15 574 | 12 | 1298× |
2. Durable large files — bounded reads + batching
bandwidth tier (1000 tracks × 30 MiB FLAC + art ≈ 30 GiB), on /data, 1 run.
| metric | before (slurp) | after (bounded) | Δ |
|---|---|---|---|
| scan wall (ms) | 378 041 | 15 228 | 24.8× faster |
| revalidate (ms) | 243 | 14 | 17.4× |
| peak RSS (KiB) | 98 636 | 132 436 | 0.74× (more) |
The after path reads only a ~1 MiB metadata window per file instead of slurping each 30 MiB file in full.
3. fsync count — the mechanism
ci tier (200 FLAC) scanned through the passthrough latency-FS (ssd profile),
which counts fsyncs at the FUSE layer. Wall is box-relative (rotational /data);
the fsync count is the portable signal.
| config | fsyncs | scan wall (ms, box-relative) |
|---|---|---|
before (synchronous=FULL, per-file commits) | 403 | 79 |
after (synchronous=NORMAL, batched commits) | 0 | 21 |
The 403→0 collapse is the root cause of §1's durable speedups.
4. Compute-isolated (RAM) — the trade, now a win on this box
large-compute tier (100k tracks × ~38 KiB FLAC) on /dev/shm (RAM), where
fsync is free — so the §1/§3 batching win is neutralized and only raw compute
remains. bytes_read ≈ 3.92 GiB both sides (the 38 KiB files are below the 1 MiB
window, so bounded reads don't help).
| config | before scan (ms) | after scan (ms) | revalidate before→after (ms) | peak RSS before→after (KiB) |
|---|---|---|---|---|
| default jobs | 31 241 | 22 295 | 2239 → 1278 | 27 904 → 96 084 |
--jobs 1 | 31 111 | 23 565 | 2255 → 1283 | 28 024 → 92 200 |
Finding — direction inverted vs the historical file. The old file (6-core EPYC) recorded SP1 as ~1.9× slower on RAM — the deliberate "honest cost" of the pipeline where there is no fsync win to amortize. On this 8-core box the parallel pipeline is ~1.4× faster even on RAM (the extra cores outweigh the per-file coordination), at the cost of ~3.4× peak RSS (96 MB vs 28 MB). The trade has shifted from "small RAM loss" to "RAM win for more memory" on wider hardware.
# durable §1/§2: MUSEFS_BENCH_DIR on /data ; RAM §4: MUSEFS_BENCH_DIR on /dev/shm
MUSEFS_BENCH_TIER=ci MUSEFS_BENCH_DIR=/data/bench \
cargo test --release -p musefs-core --features metrics --test bench_ingest \
-- --ignored --nocapture bench_cold_scan_and_revalidate
# §3 fsync count:
MUSEFS_BENCH_LATENCY_PROFILE=ssd MUSEFS_BENCH_TIER=ci MUSEFS_BENCH_FORMAT_MIX=flac \
cargo test --release -p musefs-core --features metrics --test bench_ingest \
bench_scan_under_latency -- --ignored --nocapture
SP2 — Incremental tree refresh
ed5f380^ → ed5f380. bench_refresh, RAM (TMPDIR=/dev/shm).
Overlay: the bench_refresh_one_across_library_sizes sweep didn't exist at
ed5f380^, so the after-commit harness is overlaid on the before checkout.
What changed: replace the O(N) VirtualTree::build_with full reconstruction
with apply_changes (in-place im-backed tree mutation) — only nodes whose id
appears in the changed/added/removed sets are touched.
ci tier, FLAC, single-track re-tag, 3 runs (median):
| library size | before (ms) | after (ms) | speedup |
|---|---|---|---|
| 100 | 0 | 0 | n/a (sub-granularity) |
| 1000 | 5 | 6 | 0.83× (noise tier) |
| 5000 | 32 | 23 | 1.39× |
Why (Stage A → Stage B): at Stage A the rebuild already rendered
incrementally (only the changed track re-rendered, O(changed)), but the
subsequent VirtualTree::build_with reconstructed the whole tree from scratch
(O(N)) — the remaining linear cost. Stage B's apply_changes removes that full
reconstruction; the residual slope (still ~23 ms at 5000) is the lighter O(N)
render-key scan + HashMap rebuild that feeds apply_changes, not a full tree
rebuild. The speedup grows with library size because diff cost is proportional to
changes, not total entries. (Corpus is single-album, so build_with time is
slightly optimistic vs a real multi-album library.)
cargo test -p musefs-core --release --test bench_refresh \
bench_refresh_one_across_library_sizes -- --ignored --nocapture
SP3 — Read/serve residuals
e8d56bd^ → e8d56bd. Criterion read_throughput, RAM. No overlay.
What changed: (1) read_segments writes each BackingAudio run directly into
the output buffer's reserved tail (no throwaway vec![0u8; n] + copy); (2)
handles: Mutex<HashMap> → lock-free sharded_slab::Slab; (3) size_cache: Mutex<HashMap> → dashmap::DashMap.
sequential_read — per-format (4 MiB files, 128 KiB reads)
| format | before (µs) | after (µs) | time Δ | thrpt Δ |
|---|---|---|---|---|
| flac | 929.1 | 839.6 | −7.9% | +8.6% |
| mp3 | 940.2 | 824.8 | −13.1% | +15.1% |
| m4a | 939.8 | 824.2 | −10.8% | +12.2% |
| m4a-last | 938.0 | 842.6 | −10.3% | +11.4% |
| ogg | 966.8 | 1049.4 | +6.3% | −5.9% |
| wav | 935.4 | 912.3 | −2.5% | +2.5% |
The metadata-light formats improve 8–13% from dropping the per-splice alloc+copy. ogg +6.3% is a low-iteration sampling anomaly (Criterion warned "Unable to complete 100 samples in 5.0s" — only 5050 iterations vs 10k for other formats).
concurrent_read_walk/m16_plus_walker
16 reader threads + one metadata walker sharing one Arc<Musefs> (includes thread
spawn/join):
| before (ms) | after (ms) | Δ | |
|---|---|---|---|
| m16_plus_walker | 8.20 | 9.48 | +15.7% |
This high-variance burst metric regressed on this run — attributable to thread spawn/join overhead in the contention path rather than the read path itself; it is not a sequential-read regression. (The old file recorded this bench as parity/improved; it swings run-to-run.)
cargo bench -p musefs-core --bench read_throughput -- sequential_read concurrent_read_walk
SP4 — Storage-aware Ogg serving
a62453b^ → a62453b. Criterion read_throughput + latency-injected read.
Overlay: cold_first_read/seek_read were added by SP4, so the after-commit
bench is overlaid on the before checkout.
What changed: replace the eager whole-region Ogg page index with a stateless
per-request backwards-scan: find_page_start locates the containing page from a
~65 KB window (CRC-validated entry guard), serve_ogg_window patches each page
header algebraically (crc_shift_zeros, no payload I/O), and a one-entry
last_page memo short-circuits the scan + CRC guard when the next request lands
inside the already-located page.
sequential_read — warm repeat-read (no page-index amortization to win)
| format | before (µs) | after (µs) | Δ |
|---|---|---|---|
| flac | 856.2 | 880.5 | +2.8% |
| mp3 | 847.7 | 894.5 | +5.5% |
| m4a | 862.5 | 816.9 | −5.3% |
| m4a-last | 872.7 | 831.6 | −4.7% |
| ogg | 1037.9 | 1048.2 | +1.0% |
| wav | 892.6 | 840.8 | −5.8% |
cold_first_read / seek_read — the Ogg win
| bench | format | before | after | Δ |
|---|---|---|---|---|
| cold_first_read | ogg | 14.956 ms | 1.799 ms | −88.0% |
| seek_read | ogg | 13.541 ms | 827 µs | −93.9% |
Non-ogg cold/seek stay within ±7% (no page index involved). The wins come from
never building the whole-file index up front — the old code reads the entire
prefix to serve even one chunk near EOF; SP4 scans ~65 KB backward, then the memo
carries the validated page forward. sequential_read/ogg is flat (+1.0%) because
it reads the full file linearly regardless — the win is cold-start and seek.
Latency-injected reads (bench_read_under_latency, nfs-hdd) — AFTER only
This bench was introduced by SP4; no before baseline exists.
| label | format | tier | storage | wall (ms) | opens | preads |
|---|---|---|---|---|---|---|
| read_whole_cold | ogg | ci | nfs-hdd | 28 | 1 | 0 |
| read_seek_cold | ogg | ci | nfs-hdd | 28 | 1 | 0 |
preads=0: the backwards-scan reads are served from the layout's inline/generated
segments without reaching the backing file. Near-equal whole/seek wall time
indicates per-file open+resolve latency dominates under nfs-hdd; the local
cold/seek benches above are the clean signal.
Why crc_shift_zeros is a hybrid
patch_page_header_algebraic advances the CRC past a page's payload via
crc_shift_zeros. The per-step loop is O(n) and dominated linear sequential_read
on max-size 65 KB pages; a GF(2) matrix-power method is O(log n) but carries a
fixed ~32-matmul cost, so it is slower for the small pages real Opus/Vorbis
streams carry. The evolution across implementations (ogg benches):
| ogg bench | linear crc | +matrix | +matrix +memo-amortized guard (shipped) |
|---|---|---|---|
| sequential_read | 17.6 ms | 6.40 ms | 0.93 ms |
| cold_first_read | ~17 ms | 7.42 ms | 1.61 ms |
| seek_read | — | 821 µs | 829 µs |
Shipped as a hybrid: per-step loop below n=16384, matrix at/above; a differential test covers both paths + the boundary.
cargo bench -p musefs-core --bench read_throughput -- cold_first_read seek_read sequential_read
MUSEFS_BENCH_LATENCY_PROFILE=nfs-hdd cargo test --release -p musefs-core \
--features metrics --test bench_ingest bench_read_under_latency -- --ignored --nocapture
#69 — Refresh O(changed)
e7ae912^ → e7ae912. bench_refresh, RAM. No overlay.
What changed: changelog-driven change detection (changelog_since +
render_keys_for on just the changed ids) replaces the O(N) render-key scan, and
collision-gated apply_changes dirtying stops the old parent chain from being
rebuilt unconditionally. Refresh-1 cost becomes O(changed).
Single-track refresh vs library size (3 runs, median)
A single-track re-tag moves the track out of its shared album dir — the structural worst case for a flat corpus (one artist / one album, N siblings).
| library size | before — full rebuild (ms) | after — O(changed) (ms) | factor |
|---|---|---|---|
| 100 | 0 | 0 | — |
| 1000 | 6 | 0 | ∞ (sub-ms) |
| 5000 | 33 | 0 | ∞ (sub-ms) |
| 20000 | 173 | 1 | ~170× |
The after sweep is flat: refresh-1 @ 20 000 is within 1 ms of @ 100, against a linear ~170 ms slope before.
One-vs-many (same Musefs instance, 200-track ci tier)
| label | wall (ms) |
|---|---|
| refresh-1 | 0 |
| refresh-N (100 touched) | 4 |
refresh-N scales with the touched set, not the library.
# before (apply the 4-point sweep edit first):
sed -i 's/\[100usize, 1000, 5000\]/[100usize, 1000, 5000, 20000]/' musefs-core/tests/bench_refresh.rs
cargo test -p musefs-core --release --test bench_refresh \
bench_refresh_one_across_library_sizes -- --ignored --nocapture
cargo test -p musefs-core --release --test bench_refresh \
bench_refresh_one_vs_many -- --ignored --nocapture
#114 — Rendered child lookup (root fan-out)
0881b31^ → 0881b31. bench_refresh, RAM. Overlay: the
bench_refresh_root_fanout_one_across_library_sizes bench was added by #114, so
its harness is overlaid on the before checkout.
What changed: a rendered-name child index turns the root sibling scan in
deepest_existing_ancestor into an indexed miss. The corpus uses N top-level
artist directories; the timed update retags one track to fallback Unknown/…,
exercising an absent rendered-name lookup at root.
| library size (top-level artists) | before (ms) | after (ms) |
|---|---|---|
| 100 | 0 | 0 |
| 1000 | 0 | 0 |
| 5000 | 2 | 0 |
| 20000 | 5 | 1 |
~5× at the 20 000-artist fan-out; ≤5 k is already ≤2 ms on both sides.
cargo test -p musefs-core --release --test bench_refresh \
bench_refresh_root_fanout_one_across_library_sizes -- --ignored --nocapture
PR2 — Scan pair (#67/#68)
2d4faf3^ → 2d4faf3. bench_ingest, --features metrics, RAM, 3 runs.
No overlay.
What changed: (#67) gate the 128-byte ID3v1 tail read to .mp3 files — only
MP3 consumes the frame; (#68) ingest_bulk drains the owned Unit batch by
value, moving picture payloads into the DB structs instead of cloning.
Wall time — ci tier (200 tracks × 4 KiB, no art), median of 3
| format | before (ms) | after (ms) |
|---|---|---|
| flac | 29 | 30 |
| mp3 | 21 | 23 |
| m4a | 27 | 26 |
| m4a-last | 32 | 26 |
| ogg | 22 | 24 |
| wav | 21 | 24 |
Wall time is within run-to-run noise — at ci tier (4 KiB files, no embedded art) there is no picture payload to move, so #68's win doesn't show here. It appears on art-bearing corpora (the bandwidth tier / real libraries) where the clone was O(art-size) per file.
Scan I/O — the #67 signal (scan_bytes_read)
| format | before (B) | after (B) | Δ total | Δ per file |
|---|---|---|---|---|
| flac | 870 600 | 845 000 | −25 600 | −128 B |
| mp3 | 847 200 | 847 200 | 0 | 0 (tail still read) |
| m4a | 0 | 0 | 0 | n/a (seek-reader path) |
| m4a-last | 0 | 0 | 0 | n/a |
| ogg | 873 000 | 847 400 | −25 600 | −128 B |
| wav | 853 600 | 828 000 | −25 600 | −128 B |
Non-MP3 formats drop exactly the 128-byte ID3v1 tail per file (−25 600 B over the 200-track corpus). MP3 is unchanged; M4A uses the seek-reader, not the front-anchored probe path.
MUSEFS_BENCH_TIER=ci MUSEFS_BENCH_DIR=/dev/shm/bench \
cargo test -p musefs-core --release --features metrics --test bench_ingest \
-- --ignored --nocapture bench_cold_scan_and_revalidate
PR3 — Serve-path copies (#70)
32be8f0^ → 32be8f0. Criterion read_throughput, RAM. No overlay.
What changed: four stacked serve-path copy eliminations — DB chunk readers
fill the caller's &mut [u8]; read_segments writes ArtImage/BinaryTag/raw
OggArtSlice arms into the output buffer's resized tail; Musefs::read_into
serves into a caller buffer; and the FUSE layer reuses a per-worker thread-local
scratch buffer. None touches synthesis or layout (served audio stays
byte-identical).
sequential_read
| format | before (µs) | after (µs) | Δ | verdict |
|---|---|---|---|---|
| flac | 939.8 | 924.8 | −2.1% | noise |
| mp3 | 917.2 | 884.1 | −3.1% | noise |
| m4a | 904.1 | 877.6 | −3.7% | noise |
| m4a-last | 909.8 | 860.3 | −7.4% | improved |
| ogg | 1080.4 | 963.4 | −9.1% | improved |
| wav | 925.6 | 815.7 | −11.1% | improved |
cold_first_read / seek_read / concurrent
| bench | before | after | Δ | verdict |
|---|---|---|---|---|
| cold_first_read/flac | 1.652 ms | 1.557 ms | −5.8% | improved |
| cold_first_read/mp3 | 1.590 ms | 1.678 ms | +5.5% | regressed (within 10%) |
| cold_first_read/ogg | 1.781 ms | 1.694 ms | −4.9% | improved |
| seek_read (all) | — | — | within ±2.7% | held |
| concurrent_read_walk/m16 | 9.490 ms | 7.642 ms | −19.5% | improved |
No format breaches the >10% rise gate. The concurrent burst metric improves 19% here (it is high-variance and swings run-to-run; see SP3).
cargo bench -p musefs-core --bench read_throughput -- \
sequential_read concurrent_read_walk cold_first_read seek_read
#136 — HeaderCache → quick_cache
2e6674e^ → 2e6674e. Criterion read_throughput, RAM. No overlay.
What changed: an S3-FIFO byte-weighted quick_cache replaces the hand-rolled
16-shard Mutex LRU — the serve path's last shared std lock is gone.
At a glance: within noise. No workload regresses outside noise; the only movers are marginal sequential_read improvements on the metadata-light formats.
| bench | before | after | Δ | verdict |
|---|---|---|---|---|
| sequential_read/m4a | 851.1 µs | 794.7 µs | −6.6% | improved |
| sequential_read/m4a-last | 855.2 µs | 798.3 µs | −6.7% | improved |
| sequential_read/ogg | 1.043 ms | 962.9 µs | −7.7% | improved |
| sequential_read/flac,mp3,wav | — | — | within noise | held |
| cold_first_read (all) | — | — | within noise / −3.6% m4a | held |
| seek_read (all) | — | — | within noise | held |
| concurrent_read_walk/m16 | 5.557 ms | 5.451 ms | −1.9% | held |
cargo bench -p musefs-core --bench read_throughput
#112 — StructureOnly kernel passthrough
0881b31 → faec017. Bespoke dd harness (committed:
benches/passthrough_dd.sh), sudo (passthrough
needs CAP_SYS_ADMIN).
What changed: the backing fd is registered at open (FUSE passthrough, kernel ≥6.9); the kernel serves StructureOnly reads directly from the backing inode, bypassing the daemon round-trip.
512 MiB WAV backing on /dev/shm (RAM-cached, isolates FUSE-path overhead),
dd bs=1M sequential read, fresh mount per binary, 3 runs each:
| run 1 | run 2 | run 3 | median | |
|---|---|---|---|---|
| before (daemon reads) | 2.5 GB/s | 2.5 GB/s | 2.7 GB/s | 2.5 GB/s |
| after (passthrough) | 8.4 GB/s | 8.3 GB/s | 8.9 GB/s | 8.4 GB/s |
3.36× on this RAM-cached sequential workload: the before path round-trips
every ~128 KiB chunk through the daemon (wakeup + positioned read + copy back via
/dev/fuse); the after path reads straight from the backing inode's page cache.
sudo benches/passthrough_dd.sh target/release/musefs /dev/shm/pt 512
Cumulative detail
16caba4 → current main (e02223e). Derived, non-isolating — composed from
the per-pass isolated deltas above, anchored to current-main absolutes. A
same-harness end-to-end run is infeasible: MountConfig.case_insensitive and
scan_directory_with/ScanOptions/revalidate_with don't exist at 16caba4
(so main's harnesses can't compile there), and the 16caba4-era harness omits
the now-required case_insensitive field (so it can't compile on main either).
The deltas below name the contributing passes and the dominant one; unrelated
speedups are not multiplied into a single headline.
Current-main absolutes (1 run, native harness)
Ingest — ci tier, /data, bench_ingest:
| format | scan (ms) | revalidate (ms) | RSS (KiB) |
|---|---|---|---|
| flac | 47 | 2 | 6900 |
| mp3 | 25 | 2 | 6944 |
| m4a | 55 | 2 | 6956 |
| m4a-last | 39 | 3 | 6980 |
| ogg | 20 | 2 | 6980 |
| wav | 25 | 3 | 6984 |
Refresh — RAM, bench_refresh_one_across_library_sizes: refresh-1 @ 100 / 1000
/ 5000 = 0 ms; @ 20 000 = 1 ms.
Serve — RAM, read_throughput (Criterion median): sequential_read flac 569 µs
· mp3 563 µs · m4a 566 µs · m4a-last 568 µs · ogg 737 µs · wav 598 µs;
cold_first_read ogg 1.507 ms; seek_read ogg 806 µs; concurrent m16+walker 4.15 ms.
Composed per-subsystem deltas
Ingest = SP1 ∘ PR2. Dominated by SP1's durable-fsync elimination; PR2 is the −128 B/file + move-not-clone refinement.
| metric | pre-SP1 | current main | Δ |
|---|---|---|---|
| fsync count (latencyfs) | 403 | 0 | eliminated |
| scan_wall (ci flac) | 32 206 ms | 47 ms | ~685× |
| scan_wall (bandwidth flac) | 378 041 ms | ~15 228 ms† | ~24.8× |
| scan_bytes_read (ci flac) | 870 600 B | 845 000 B | −128 B/file |
† Bandwidth tier not re-measured at main; figure is SP1's after number.
Refresh = SP2 ∘ #69 ∘ #114. The O(N)→flat journey; dominant pass is #69 (changelog-driven O(changed) rebuild), with #114 shaving the 20 k root fan-out on top.
| metric | pre-SP2 | current main | Δ |
|---|---|---|---|
| refresh-1 @ 1000 | 5 ms | 0 ms | ∞ (sub-ms) |
| refresh-1 @ 5000 | 32 ms | 0 ms | ∞ (sub-ms) |
| refresh-1 @ 20000 | 173 ms | 1 ms | ~173× |
Serve = SP3 ∘ SP4 ∘ PR3 ∘ #136. SP3 + PR3 drive the cross-format sequential/cold/seek wins (alloc elimination + copy reduction); SP4 owns the ogg cold/seek collapse.
| metric | pre-SP3 | current main | Δ |
|---|---|---|---|
| sequential_read/flac | 929 µs | 569 µs | −38.8% |
| sequential_read/mp3 | 940 µs | 563 µs | −40.1% |
| sequential_read/m4a | 940 µs | 566 µs | −39.8% |
| sequential_read/ogg | 967 µs | 737 µs | −23.8% |
| sequential_read/wav | 935 µs | 598 µs | −36.1% |
| cold_first_read/ogg | 14.96 ms | 1.51 ms | −89.9% |
| seek_read/ogg | 13.54 ms | 806 µs | −94.0% |
| concurrent m16+walker | 8.20 ms | 4.15 ms | −49.4% |
Criterion's own change: lines compare against the previous on-machine baseline
(itself already optimized); the absolutes above are the reliable end-to-end
signal.
Storage tunables
A proposed --storage-profile {ssd,hdd,nfs} preset would have bumped
--max-readahead-kib and --max-background (and enabled --keep-cache) per medium,
on the premise that "larger read-ahead hides HDD/NFS latency." Measured against real
storage, that premise does not hold — only --keep-cache shows a benefit — so the
preset was dropped and these flags keep their defaults. This section records the
evidence.
Methodology
Unlike the optimization passes above (tmpfs, in-process Criterion), these run through a real kernel mount with a real reader, because the tunables are kernel↔FUSE negotiation parameters invisible to an in-process driver:
- Backing: real RAID-1 HDD (
/home,/dev/md127) and a btrfs HDD span (/data,/dev/sda3); for NFS, a loopback NFSv4.2 export (exportfs+mount -t nfs localhost:…) whose backing is tmpfs (isolates the RPC tax) or HDD (RPC + seeks). - Latency:
tc qdisc add dev lo root netem delay <X>msaddsXper packet → ≈2XRTT per NFS RPC. Tested at 8 ms, 50 ms, and 200 ms RTT (the last ≈ a trans-Pacific server). - Cold reads:
sync; echo 3 > /proc/sys/vm/drop_cachesbefore each measured read — without it the page cache serves repeats and hides all backing latency. - Mode:
synthesis, notstructure-only. Structure-only triggers kernel FUSE passthrough when the process is privileged (these run as root), which serves the backing fd directly and bypasses the daemon read path — and with it every tunable that acts on that path. Synthesis splicesBackingAudioreads through the daemon, the real serving path. - Why not the injected
MUSEFS_FAULT_*_USmodel: it cannot show a read-ahead effect. FUSE delivers reads to the daemon in fixed ≤256 KiB chunks (max_pages, already pinned at the kernel's 1 MiB ceiling byfuser's 16 MiB defaultmax_write), so the per-preadcount — and thus any per-preadinjected latency total — is independent ofmax_readahead.
Reproduce: benches/storage_tunables_bench.sh (needs /dev/fuse, root, and for the
NFS rows nfs-kernel-server + tc). HDD numbers are noisy (±10–15%); the trends, not
the digits, are the signal.
--max-readahead-kib — no benefit anywhere; hurts on HDD
Cold single-stream sequential throughput (MB/s), synthesis:
| readahead KiB | HDD /home (RAID1) | HDD /data (btrfs) | NFS 8 ms | NFS-on-HDD 50 ms | NFS-on-HDD 200 ms |
|---|---|---|---|---|---|
| 512 (default) | 248 | 127 | 30.8 | 4.7 | 1.3 |
| 2048 | 191 | 72 | 30.6 | 4.9 | 1.3 |
| 4096 | 153 | 84 | 30.5 | 4.9 | 1.3 |
| 32 (probe) | 237 | 75 | — | — | — |
(File sizes differ per column — 512 MiB local, 96 MiB at 50 ms, 48 MiB at 200 ms — so compare within a column, not across. The 200 ms column ≈ a trans-Pacific server: flat to the last digit.)
The window size barely moves throughput, and on HDD values ≥2048 KiB are among the slowest (peak is ~128–512 KiB). The reason is visible on NFS: 512 MiB ÷ 256 KiB × 8 ms ≈ 16 s ≈ the observed 31 MB/s — a single stream is served serially, one ≤256 KiB read at a time, each paying the full RTT, with no prefetch overlap that a larger window could exploit.
--max-background — no effect on read throughput
Wall time (s) for N concurrent cold streams over distinct tracks:
| max_background | HDD /home (16 streams) | NFS 8 ms (16) | NFS-on-HDD 50 ms (80) | NFS-on-HDD 200 ms (24) |
|---|---|---|---|---|
| 64 (default) | 4.55 | 5.16 | 177.8 | 238.5 |
| 128 | 5.05 | 5.18 | 175.7 | 237.4 |
64 ≈ 128 even with 80 > 64 streams. Expected: musefs's FuseConfig notes
max_background caps background work and that "foreground reads are bounded only by
client concurrency, not by this." The concurrent reads here are foreground.
(Concurrency does hide latency — 16 NFS streams reach ~10× single-stream aggregate —
but that is client parallelism, which max_background does not gate.)
--keep-cache — the one real win (~3×)
Cold read then immediate reopen (no cache drop between); reopen_s is the signal:
| keep_cache | HDD reopen (s) | NFS 8 ms reopen (s) | NFS-on-HDD 50 ms reopen (s) |
|---|---|---|---|
| false | 0.224 | 0.207 | 0.039 |
| true | 0.062 | 0.060 | 0.014 |
With --keep-cache the kernel retains the page cache across opens, so a re-opened file
is served from RAM instead of re-fetched over slow storage — ~3× faster reopen,
consistent across HDD and NFS. This is the only tunable worth changing for slow backing
(relevant for players/scanners that re-open files), and it needs no preset. It is on by
default as of #432 (inode invalidation on retag keeps it consistent); disable with
--keep-cache false on memory-constrained hosts.
Conclusions
- Drop the
--storage-profilepreset. Of the four knobs it would have set, three (max_readahead,max_background, and by extension a per-medium combination of them) show no benefit;max_readahead≥2048 KiB actively hurts on HDD. The only justified change — enable--keep-cacheon HDD/NFS — does not need an abstraction. - Single-stream latency hiding — addressed in #255 (next section). The serialized read path measured above (512 MiB ÷ 256 KiB × RTT) is exactly what backing read-ahead now fixes.
Backing read-ahead (#255)
Each --max-readahead-kib row above exposed the real bottleneck: a single stream is
served one ≤256 KiB FUSE chunk at a time, each paying the full backing RTT, so a
200 ms-RTT NFS mount tops out at ~1.3 MB/s regardless of the kernel read-ahead window.
The fix is read amplification in the daemon — BackingReader coalesces a stream's
small reads into one large positioned pread (geometric window growth, global RAM budget
with LRU eviction), so the backing client can pipeline/parallelize the RPCs behind one
syscall. A background-prefetch-threads layer ("Phase 2") was also built but is off by
default (see below).
Methodology
Two harnesses. Real kernel mount (benches/storage_tunables_bench.sh): a real reader
(dd) over a real FUSE mount, cold (drop_caches) each sample, median of 3. Local backing
on a btrfs HDD; NFS via a loopback NFSv4.2 export plus tc netem for RTT. The corpus
is real FLAC (MUSEFS_BENCH_CORPUS_SRC) — a /dev/zero corpus on a compressing fs
(btrfs compress=zstd) collapses to a cached extent and never touches the platter, which
silently inverts the HDD numbers; real already-compressed audio is incompressible.
In-process (musefs-core/tests/bench_ingest.rs::bench_read_under_latency): the core read
path over musefs-latencyfs (per-op injected latency), isolating the daemon from the kernel
FUSE layer. off = --read-ahead-budget-mib 0; phase1 = the default (amplification only);
phase1+2 = --read-ahead-prefetch.
Single-stream cold throughput (MB/s)
| backing | off | phase 1 (default) | phase 1+2 | passthrough |
|---|---|---|---|---|
| local HDD (btrfs, real FLAC) | ~60 | ~62 | ~60 | ~58 |
| NFS, tmpfs-backed, 200 ms RTT | 1.2 | 7.4 | 6.8 | 9.8 |
On NFS read-ahead is a ~6× single-stream win (1.2 → 7.4 MB/s, 75 % of the kernel-passthrough
ceiling). On a real local HDD all four configs sit within run-to-run noise (~±15 %) — read-ahead
is neutral, not a regression. (An earlier /dev/zero corpus showed a spurious −35 %; it was
the zstd-compression artifact above, not read-ahead.)
Concurrent streams (8 × distinct tracks, aggregate MB/s, NFS 200 ms RTT)
| off | phase 1 (default) | phase 1+2 | passthrough |
|---|---|---|---|
| 1.6 | 13.6 | 12.1 | 16.3 |
In-process, per-op latency (16 MiB Ogg whole read; wall ms / backing preads)
| profile | off | phase 1 (default) |
|---|---|---|
| ssd (80 µs/op) | 45 ms / 774 preads | 26 ms / 32 preads |
| nfs-ssd (600 µs/op) | 138 ms / 774 | 112 ms / 32 |
Amplification collapses 774 backing round-trips to 32; the win scales with per-op latency and is already material at SSD speeds (1.7×).
Phase 2 is off by default
Background prefetch threads (Phase 2) never beat amplification alone and cost a consistent
~10 %: single-stream NFS 6.8 vs 7.4, concurrent NFS 12.1 vs 13.6, neutral on HDD. A single large
pread already lets the NFS client pipeline its RPCs, so the threads add coordination overhead
without overlap to exploit. Phase 2 is therefore opt-in (--read-ahead-prefetch), retained for
hypothetical backends where one large read does not self-pipeline.
Defaults: read-ahead on at --read-ahead-budget-mib 64, Phase-1 amplification only. Set
0 to disable on local-disk-only setups (no benefit there, though no harm either).
Internal window cap on HDD (#433)
The amplification window doubles per sequential read up to WINDOW_ABS_CAP (8 MiB,
musefs-core/src/readahead.rs). The #256 sweep above measured the kernel max_readahead
knob — where ≥2048 KiB hurts on HDD — but never this daemon-internal cap, so #433 asked whether
8 MiB is too large for spinning media.
Methodology. WINDOW_ABS_CAP is a compile-time const, so the sweep builds one release binary
per value (benches/storage_tunables_bench.sh window-cap, which patches the const in place and
restores it after each build). Cold (drop_caches) single-stream reads of the same ~270 MiB real
FLAC, synthesis mount, default flags (amplification on, prefetch off), real backing on a btrfs
HDD (/data, 4389-track corpus). Reproduce:
WINDOW_CAP_MIB="1 2 4 8 16" MUSEFS_BENCH_CORPUS_SRC=<music-tree> MUSEFS_BENCH_CORPUS_MAX_MIB=300 \
benches/storage_tunables_bench.sh window-cap <hdd-backing-dir>
Result: no measurable cap effect — the medium's noise dominates. Median MB/s (and the within-cap min–max over 7 cold samples) overlap across every cap, and the apparent ordering is an artifact of measurement order, not the cap: throughput drifts down through each run, so whatever runs first looks fastest. Sweeping the caps in the reverse order reverses the "trend".
| cap (MiB) | ascending sweep, median (min–max) | descending sweep, median |
|---|---|---|
| 1 | 104 (65–132) | 46 |
| 2 | 85 (54–142) | 58 |
| 4 | 83 (51–104) | 61 |
| 8 | 61 (54–72) | 61 |
| 16 | 68 (54–94) | 86 |
The current default (8 MiB) lands at ~61 MB/s in both orderings; the first-measured cap is fastest in both (104 for cap 1 ascending, 86 for cap 16 descending). The within-cap spread (≈50–140 MB/s) dwarfs every between-cap median gap. This corroborates the #256 finding that backing read-ahead is neutral on local HDD.
Decision: keep 8 MiB, no runtime knob. There is no HDD gain to capture, and the cap exists for
the proven case — the ~6× single-stream amplification win on high-RTT NFS/remote, where coalescing
into one large pread lets the client pipeline RPCs. Lowering the cap to chase an unmeasurable HDD
effect would regress that win.
Global allocator — steady-state RSS (#360)
Long-lived high-churn FUSE load fragments glibc malloc, growing daemon RSS over
days without a true leak. The musefs binary now defaults to the jemalloc
global allocator with a background purge thread. Measured with
scripts/rss-churn-bench.sh (Linux; median VmRSS over the flattened tail —
steady state, not peak).
Parameters: WORKERS=8 (nproc), FILES=500, CYCLES=200, WARMUP=20, no
REFRESH_CMD. DB = a freshly-scanned 4427-track store on tmpfs (/tmp); backing
audio on /data (HDD). Concurrent cat-to-/dev/null churn drives the
open/read/release handle-table and read-synthesis allocation path.
| Allocator | Steady-state RSS |
|---|---|
| system malloc | ~74.7 MiB (76496 kiB) |
| jemalloc | ~28.7 MiB (29368 kiB) |
Decision: SHIP jemalloc. Steady-state RSS is ~62% lower (jemalloc ≤ system malloc, the §4 ship rule). Under identical churn glibc retained ~46 MiB of dirty pages that jemalloc's decay + background purge return to the OS — the #360 fragmentation failure mode, reproduced and fixed. The gap is far outside run-to-run noise, so no within-noise tie-break was needed.
Scan fingerprint overhead (#464)
Bench: cargo bench -p musefs-core --bench fingerprint_overhead
Corpus: 200 minimal FLAC files (~200 B metadata + 4 KiB audio) in tempfile::tempdir() (TMPDIR =
tmpfs/RAM). Single-threaded scan (jobs: 1). Criterion, 20 samples.
| Tier | Median (ms) | µs/file |
|---|---|---|
None | 47.0 | 235 |
Fingerprint | 107.6 | 538 |
Delta: +60.6 ms / 200 files = +303 µs/file overhead (+129%).
Interpretation: The 129% overhead on this synthetic RAM-backed bench exceeds the plan's ≤15%
threshold. The overhead is dominated by the extra UPDATE tracks SET fingerprint = … SQLite
execution per file inside the batch transaction — not by SHA-256 hashing cost (SHA-256 of a few
hundred bytes is sub-microsecond). On a real HDD-backed library the probe I/O (tens-of-ms per file)
is the bottleneck, making both the hash and the DB write negligible. See plan Task E2 step 3 decision
note: the decision to keep SHA-256 and add the length CHECK was escalated to the controller because
the raw percentage exceeded the stated threshold, even though the absolute overhead (303 µs/file) is
operationally negligible at disk I/O rates.
Scan fingerprint overhead — SSD latency profile (#464)
Bench: bench_scan_under_latency in musefs-core/tests/bench_ingest.rs (MUSEFS_BENCH_LATENCY_PROFILE=ssd).
Corpus: 200 minimal FLAC files (~200 B metadata + 4 KiB audio) on a musefs-latencyfs SSD-latency
FUSE mount. Default thread count (jobs: 0). 3 runs each, median reported.
| Tier | Median (ms) | µs/file |
|---|---|---|
None | 222 | 1110 |
Fingerprint | 241 | 1205 |
Delta: +19 ms / 200 files = +95 µs/file overhead (+8.6%).
Interpretation: Under an SSD latency profile the I/O dominates and the fingerprint overhead drops to +8.6% (+95 µs/file), well within the plan's ≤15% threshold. The RAM bench's +129% (+303 µs/file) was an artefact of RAM eliminating the I/O that would normally dwarf the extra SHA-256 hash and DB write. At real SSD rates the fingerprint cost is operationally negligible.