Benchmarks

Every optimization pass re-measured apples-to-apples on one box as a PR-isolated before/after pair, plus a cumulative 16caba4main summary. This file is performance only — correctness gates (byte-identical proptests, FUSE e2e, in-diff mutation) live in CI and the contributor guide, not here.

Read it in three layers:

  1. Results at a glance — the cumulative per-subsystem delta and a one-line headline per pass.
  2. Methodology — machine, before/after definition, the overlay rule, run conventions, storage placement. Written once; every detail section assumes it.
  3. Per-pass detail — one section per pass: what changed, the before/after table, the reproduce command, and the "why" where it matters.

Results at a glance

Cumulative — 16caba4 → current main (e02223e)

Composed from the per-pass isolated deltas below, anchored to current-main absolutes. Non-isolating: a same-harness run at both ends is infeasible (API drift means neither the 16caba4-era harness nor the main harness compiles at the other commit), so these compose the chain of passes that touched each subsystem rather than a single end-to-end measurement. See Cumulative detail for the absolutes and the per-pass composition.

SubsystemHeadline metric16caba4-eracurrent mainΔDominant pass
Ingestfsync count (durable)4030eliminatedSP1
cold scan, ci flac32 206 ms47 ms~685×SP1
Refreshrefresh-1 @ 20 000 tracks173 ms1 ms~173×#69
Servesequential_read/flac929 µs569 µs−38.8%SP3 + PR3
cold_first_read/ogg14.96 ms1.51 ms−89.9%SP4
concurrent m16+walker8.20 ms4.15 ms−49.4%SP3 + PR3

Per-pass headlines

Each headline is the pass's single largest statistically-significant delta on its deployment-representative tier.

PassCommitHeadline (this box)
SP1 — ingestion scalabilityccbbfaadurable cold scan ~1150–3600× faster; fsync storm 403→0
SP2 — incremental tree refreshed5f3805 000-track refresh-1 1.4× (32→23 ms)
SP3 — read/serve residualse8d56bdsequential_read −8 to −13% (flac/mp3/m4a/m4a-last)
SP4 — storage-aware Ogg servinga62453bogg cold-read −88%, seek −94%
#69 — refresh O(changed)e7ae912refresh-1 @ 20 000 ~170× (173→1 ms)
#114 — root fan-out lookup0881b31root fan-out @ 20 000 ~5× (5→1 ms)
PR2 — scan pair (#67/#68)2d4faf3−128 B/file scan I/O (flac/ogg/wav); wall within noise
PR3 — serve-path copies (#70)32be8f0sequential_read −7 to −11% (m4a-last/ogg/wav); concurrent −19%
#136 — HeaderCache quick_cache2e6674ewithin noise (marginal m4a/ogg sequential)
#112 — StructureOnly passthroughfaec017passthrough dd 3.36× (2.5→8.4 GB/s)

One direction inverted vs the historical file: SP1 §4 (compute-isolated, on RAM) is now faster after the change, not slower. The old file recorded SP1 as ~1.9× slower on RAM-backed tempfs (the "honest cost" of the pipeline); on this 8-core box the parallel pipeline wins even on RAM (~1.4×), at higher peak RSS. See SP1 §4.


CI regression gating

BENCHMARKS.md records hand-run absolute numbers; CI guards against regressions in three lanes:

  1. Counter gate (every non-doc PR, hard). perf_counters.rs + tree.rs golden work-counter assertions under --features metrics. Catches algorithmic regressions (extra copy, whole-file slurp, O(N) tree rebuild).
  2. A/B wall-clock (warn-only, core src PRs). The perf-bench matrix job benches the base and PR commits in parallel on separate runners; the perf-ab job then diffs the two exported baselines and posts a critcmp delta as a PR comment. Never blocks.
  3. Release record. The benchmarks job runs the full bench suite at the ci tier on a tag and uploads the numbers as an artifact for curation here.

The fsync-storm (403→0) signal needs a real FUSE mount and lives only in the release lane / the #[ignore] bench_scan_under_latency, not the per-PR gate.

The release artifact is named benchmark-snapshot-<tag>; download it from the tag's workflow run. The job runs on a GitHub-hosted ubuntu-latest runner, not the dedicated box the rest of this file uses, so its wall-times are runner-relative and are not folded into the per-pass tables — only the portable signals (bytes_read, pread/fsync counts, refresh flatness) are cross-comparable. Each release's snapshot is recorded verbatim under Release CI snapshots.

Release CI snapshots

Per-tag records from the benchmarks release job (CI regression gating §3), run on a GitHub-hosted ubuntu-latest runner at the ci tier. This is not the dedicated box the per-pass tables use, so the wall-times here are runner-relative and are a point-in-time record per release — not comparable to those sections. The portable signals (scan_bytes_read, pread/fsync counts, refresh flatness) are comparable, and are the no-regression check.

v1.1.0 — f865afc, single run

No regression vs the curated tables: scan_bytes_read is unchanged from PR2 (flac/ogg/wav = 845 000 / 847 400 / 828 000 B — the −128 B/file ID3v1 gating still holds; mp3 847 200 B unchanged; m4a uses the seek-reader, 0 B), and single-track refresh stays flat with library size.

read_throughput (Criterion, median estimate, µs):

benchflacmp3m4am4a-lastoggwav
sequential_read416418417419524†422
cold_first_read798778805804911792
seek_read368353372370582365

concurrent_read_walk/m8_plus_walker: 931 µs.

sequential_read/ogg collected only 10k iterations with 19% outliers (Criterion low-sample warning) — treat as noisy.

bench_ingest — ci tier (200 tracks × 4 KiB), runner tmpfs:

formatscan (ms)revalidate (ms)scan_bytes_read (B)RSS (KiB)
flac311845 0007100
mp3821847 2007164
m4a86107180
m4a-last87107184
ogg831847 4007184
wav831828 0007184

bench_refresh — ci tier, single-track re-tag:

library sizerefresh-1 (ms)root-fanout-1 (ms)
10000
100011
5000123
2000079

refresh-1 vs refresh-N (200-track ci, same instance): refresh-1 0 ms, refresh-N (100 touched) 6 ms. The 5000 > 20000 inversion is single-run noise on the shared runner; the dedicated-box #69 sweep is the clean flat signal.

Methodology

Machine

CPU8 cores
RAM32 GB (31 GiB)
Durable storage (/data)btrfs, 2-device span (sda3+sdb3), rotational; Data: single, Metadata: RAID1; zstd:1. No SSD on this box.
RAM storage (/dev/shm)tmpfs
Toolchainrustc 1.96.0 · release builds
KernelLinux 7.0 (FUSE passthrough requires ≥6.9 + CAP_SYS_ADMIN)

Before / after definition

History is squash-merged (linear), so each pass is one commit:

  • after = the pass's own squash-merge commit.
  • before = its parent, <after>^PR-isolated, not current main. This preserves attribution (each delta is exactly what that PR changed) and avoids harness drift from later passes.

The overlay rule

Two passes (SP2, SP4) report a bench that did not yet exist at their before commit. For those, the after-commit's harness file is checked out onto the before checkout (git checkout <after> -- <bench_file>) so the old code is measured with the new harness. Overlay use is called out in each affected section.

Run conventions

  • bench_ingest / bench_refresh (ignored tests, cargo test --release … -- --ignored): 3 runs, median reported (spread noted where it matters). bench_ingest needs --features metrics.
  • read_throughput (Criterion bench): Criterion's own sampling; before side saved with --save-baseline, after side compared with --baseline. Reported Δ is Criterion's change estimate.
  • Wall times on /data are box-relative (rotational disk); where a portable signal exists (fsync count, bytes_read, pread count) it is the primary number.

Storage placement

  • Durable rows run on /data (rotational btrfs). bench_ingest honors MUSEFS_BENCH_DIR.
  • RAM rows run on /dev/shm (tmpfs). bench_ingest honors MUSEFS_BENCH_DIR=/dev/shm/…; bench_refresh and read_throughput ignore it and follow TMPDIR=/dev/shm.

Per-pass detail

SP1 — Ingestion scalability

ccbbfaa^ccbbfaa. bench_ingest, --features metrics. No overlay.

What changed: whole-file fs::read slurp + per-file commits at synchronous=FULL → bounded probing reads + parallel-probe/single-writer pipeline + per-batch transactions at synchronous=NORMAL (WAL retained).

1. Durable small files — the fsync/batching win

ci tier (200 tracks × 4 KiB, no embedded art), corpus + DB on /data. Not compute-bound — the before path is dominated by per-file fsync latency.

formatbefore scan (ms)after scan (ms)speedup
flac32 206211534×
mp316 124141152×
m4a30 089191584×
m4a-last39 592113599×
ogg16 153141154×
wav15 574121298×

2. Durable large files — bounded reads + batching

bandwidth tier (1000 tracks × 30 MiB FLAC + art ≈ 30 GiB), on /data, 1 run.

metricbefore (slurp)after (bounded)Δ
scan wall (ms)378 04115 22824.8× faster
revalidate (ms)2431417.4×
peak RSS (KiB)98 636132 4360.74× (more)

The after path reads only a ~1 MiB metadata window per file instead of slurping each 30 MiB file in full.

3. fsync count — the mechanism

ci tier (200 FLAC) scanned through the passthrough latency-FS (ssd profile), which counts fsyncs at the FUSE layer. Wall is box-relative (rotational /data); the fsync count is the portable signal.

configfsyncsscan wall (ms, box-relative)
before (synchronous=FULL, per-file commits)40379
after (synchronous=NORMAL, batched commits)021

The 403→0 collapse is the root cause of §1's durable speedups.

4. Compute-isolated (RAM) — the trade, now a win on this box

large-compute tier (100k tracks × ~38 KiB FLAC) on /dev/shm (RAM), where fsync is free — so the §1/§3 batching win is neutralized and only raw compute remains. bytes_read ≈ 3.92 GiB both sides (the 38 KiB files are below the 1 MiB window, so bounded reads don't help).

configbefore scan (ms)after scan (ms)revalidate before→after (ms)peak RSS before→after (KiB)
default jobs31 24122 2952239 → 127827 904 → 96 084
--jobs 131 11123 5652255 → 128328 024 → 92 200

Finding — direction inverted vs the historical file. The old file (6-core EPYC) recorded SP1 as ~1.9× slower on RAM — the deliberate "honest cost" of the pipeline where there is no fsync win to amortize. On this 8-core box the parallel pipeline is ~1.4× faster even on RAM (the extra cores outweigh the per-file coordination), at the cost of ~3.4× peak RSS (96 MB vs 28 MB). The trade has shifted from "small RAM loss" to "RAM win for more memory" on wider hardware.

# durable §1/§2: MUSEFS_BENCH_DIR on /data ; RAM §4: MUSEFS_BENCH_DIR on /dev/shm
MUSEFS_BENCH_TIER=ci MUSEFS_BENCH_DIR=/data/bench \
  cargo test --release -p musefs-core --features metrics --test bench_ingest \
  -- --ignored --nocapture bench_cold_scan_and_revalidate
# §3 fsync count:
MUSEFS_BENCH_LATENCY_PROFILE=ssd MUSEFS_BENCH_TIER=ci MUSEFS_BENCH_FORMAT_MIX=flac \
  cargo test --release -p musefs-core --features metrics --test bench_ingest \
  bench_scan_under_latency -- --ignored --nocapture

SP2 — Incremental tree refresh

ed5f380^ed5f380. bench_refresh, RAM (TMPDIR=/dev/shm). Overlay: the bench_refresh_one_across_library_sizes sweep didn't exist at ed5f380^, so the after-commit harness is overlaid on the before checkout.

What changed: replace the O(N) VirtualTree::build_with full reconstruction with apply_changes (in-place im-backed tree mutation) — only nodes whose id appears in the changed/added/removed sets are touched.

ci tier, FLAC, single-track re-tag, 3 runs (median):

library sizebefore (ms)after (ms)speedup
10000n/a (sub-granularity)
1000560.83× (noise tier)
500032231.39×

Why (Stage A → Stage B): at Stage A the rebuild already rendered incrementally (only the changed track re-rendered, O(changed)), but the subsequent VirtualTree::build_with reconstructed the whole tree from scratch (O(N)) — the remaining linear cost. Stage B's apply_changes removes that full reconstruction; the residual slope (still ~23 ms at 5000) is the lighter O(N) render-key scan + HashMap rebuild that feeds apply_changes, not a full tree rebuild. The speedup grows with library size because diff cost is proportional to changes, not total entries. (Corpus is single-album, so build_with time is slightly optimistic vs a real multi-album library.)

cargo test -p musefs-core --release --test bench_refresh \
  bench_refresh_one_across_library_sizes -- --ignored --nocapture

SP3 — Read/serve residuals

e8d56bd^e8d56bd. Criterion read_throughput, RAM. No overlay.

What changed: (1) read_segments writes each BackingAudio run directly into the output buffer's reserved tail (no throwaway vec![0u8; n] + copy); (2) handles: Mutex<HashMap> → lock-free sharded_slab::Slab; (3) size_cache: Mutex<HashMap>dashmap::DashMap.

sequential_read — per-format (4 MiB files, 128 KiB reads)

formatbefore (µs)after (µs)time Δthrpt Δ
flac929.1839.6−7.9%+8.6%
mp3940.2824.8−13.1%+15.1%
m4a939.8824.2−10.8%+12.2%
m4a-last938.0842.6−10.3%+11.4%
ogg966.81049.4+6.3%−5.9%
wav935.4912.3−2.5%+2.5%

The metadata-light formats improve 8–13% from dropping the per-splice alloc+copy. ogg +6.3% is a low-iteration sampling anomaly (Criterion warned "Unable to complete 100 samples in 5.0s" — only 5050 iterations vs 10k for other formats).

concurrent_read_walk/m16_plus_walker

16 reader threads + one metadata walker sharing one Arc<Musefs> (includes thread spawn/join):

before (ms)after (ms)Δ
m16_plus_walker8.209.48+15.7%

This high-variance burst metric regressed on this run — attributable to thread spawn/join overhead in the contention path rather than the read path itself; it is not a sequential-read regression. (The old file recorded this bench as parity/improved; it swings run-to-run.)

cargo bench -p musefs-core --bench read_throughput -- sequential_read concurrent_read_walk

SP4 — Storage-aware Ogg serving

a62453b^a62453b. Criterion read_throughput + latency-injected read. Overlay: cold_first_read/seek_read were added by SP4, so the after-commit bench is overlaid on the before checkout.

What changed: replace the eager whole-region Ogg page index with a stateless per-request backwards-scan: find_page_start locates the containing page from a ~65 KB window (CRC-validated entry guard), serve_ogg_window patches each page header algebraically (crc_shift_zeros, no payload I/O), and a one-entry last_page memo short-circuits the scan + CRC guard when the next request lands inside the already-located page.

sequential_read — warm repeat-read (no page-index amortization to win)

formatbefore (µs)after (µs)Δ
flac856.2880.5+2.8%
mp3847.7894.5+5.5%
m4a862.5816.9−5.3%
m4a-last872.7831.6−4.7%
ogg1037.91048.2+1.0%
wav892.6840.8−5.8%

cold_first_read / seek_read — the Ogg win

benchformatbeforeafterΔ
cold_first_readogg14.956 ms1.799 ms−88.0%
seek_readogg13.541 ms827 µs−93.9%

Non-ogg cold/seek stay within ±7% (no page index involved). The wins come from never building the whole-file index up front — the old code reads the entire prefix to serve even one chunk near EOF; SP4 scans ~65 KB backward, then the memo carries the validated page forward. sequential_read/ogg is flat (+1.0%) because it reads the full file linearly regardless — the win is cold-start and seek.

Latency-injected reads (bench_read_under_latency, nfs-hdd) — AFTER only

This bench was introduced by SP4; no before baseline exists.

labelformattierstoragewall (ms)openspreads
read_whole_coldoggcinfs-hdd2810
read_seek_coldoggcinfs-hdd2810

preads=0: the backwards-scan reads are served from the layout's inline/generated segments without reaching the backing file. Near-equal whole/seek wall time indicates per-file open+resolve latency dominates under nfs-hdd; the local cold/seek benches above are the clean signal.

Why crc_shift_zeros is a hybrid

patch_page_header_algebraic advances the CRC past a page's payload via crc_shift_zeros. The per-step loop is O(n) and dominated linear sequential_read on max-size 65 KB pages; a GF(2) matrix-power method is O(log n) but carries a fixed ~32-matmul cost, so it is slower for the small pages real Opus/Vorbis streams carry. The evolution across implementations (ogg benches):

ogg benchlinear crc+matrix+matrix +memo-amortized guard (shipped)
sequential_read17.6 ms6.40 ms0.93 ms
cold_first_read~17 ms7.42 ms1.61 ms
seek_read821 µs829 µs

Shipped as a hybrid: per-step loop below n=16384, matrix at/above; a differential test covers both paths + the boundary.

cargo bench -p musefs-core --bench read_throughput -- cold_first_read seek_read sequential_read
MUSEFS_BENCH_LATENCY_PROFILE=nfs-hdd cargo test --release -p musefs-core \
  --features metrics --test bench_ingest bench_read_under_latency -- --ignored --nocapture

#69 — Refresh O(changed)

e7ae912^e7ae912. bench_refresh, RAM. No overlay.

What changed: changelog-driven change detection (changelog_since + render_keys_for on just the changed ids) replaces the O(N) render-key scan, and collision-gated apply_changes dirtying stops the old parent chain from being rebuilt unconditionally. Refresh-1 cost becomes O(changed).

Single-track refresh vs library size (3 runs, median)

A single-track re-tag moves the track out of its shared album dir — the structural worst case for a flat corpus (one artist / one album, N siblings).

library sizebefore — full rebuild (ms)after — O(changed) (ms)factor
10000
100060∞ (sub-ms)
5000330∞ (sub-ms)
200001731~170×

The after sweep is flat: refresh-1 @ 20 000 is within 1 ms of @ 100, against a linear ~170 ms slope before.

One-vs-many (same Musefs instance, 200-track ci tier)

labelwall (ms)
refresh-10
refresh-N (100 touched)4

refresh-N scales with the touched set, not the library.

# before (apply the 4-point sweep edit first):
sed -i 's/\[100usize, 1000, 5000\]/[100usize, 1000, 5000, 20000]/' musefs-core/tests/bench_refresh.rs
cargo test -p musefs-core --release --test bench_refresh \
  bench_refresh_one_across_library_sizes -- --ignored --nocapture
cargo test -p musefs-core --release --test bench_refresh \
  bench_refresh_one_vs_many -- --ignored --nocapture

#114 — Rendered child lookup (root fan-out)

0881b31^0881b31. bench_refresh, RAM. Overlay: the bench_refresh_root_fanout_one_across_library_sizes bench was added by #114, so its harness is overlaid on the before checkout.

What changed: a rendered-name child index turns the root sibling scan in deepest_existing_ancestor into an indexed miss. The corpus uses N top-level artist directories; the timed update retags one track to fallback Unknown/…, exercising an absent rendered-name lookup at root.

library size (top-level artists)before (ms)after (ms)
10000
100000
500020
2000051

~5× at the 20 000-artist fan-out; ≤5 k is already ≤2 ms on both sides.

cargo test -p musefs-core --release --test bench_refresh \
  bench_refresh_root_fanout_one_across_library_sizes -- --ignored --nocapture

PR2 — Scan pair (#67/#68)

2d4faf3^2d4faf3. bench_ingest, --features metrics, RAM, 3 runs. No overlay.

What changed: (#67) gate the 128-byte ID3v1 tail read to .mp3 files — only MP3 consumes the frame; (#68) ingest_bulk drains the owned Unit batch by value, moving picture payloads into the DB structs instead of cloning.

Wall time — ci tier (200 tracks × 4 KiB, no art), median of 3

formatbefore (ms)after (ms)
flac2930
mp32123
m4a2726
m4a-last3226
ogg2224
wav2124

Wall time is within run-to-run noise — at ci tier (4 KiB files, no embedded art) there is no picture payload to move, so #68's win doesn't show here. It appears on art-bearing corpora (the bandwidth tier / real libraries) where the clone was O(art-size) per file.

Scan I/O — the #67 signal (scan_bytes_read)

formatbefore (B)after (B)Δ totalΔ per file
flac870 600845 000−25 600−128 B
mp3847 200847 20000 (tail still read)
m4a000n/a (seek-reader path)
m4a-last000n/a
ogg873 000847 400−25 600−128 B
wav853 600828 000−25 600−128 B

Non-MP3 formats drop exactly the 128-byte ID3v1 tail per file (−25 600 B over the 200-track corpus). MP3 is unchanged; M4A uses the seek-reader, not the front-anchored probe path.

MUSEFS_BENCH_TIER=ci MUSEFS_BENCH_DIR=/dev/shm/bench \
  cargo test -p musefs-core --release --features metrics --test bench_ingest \
  -- --ignored --nocapture bench_cold_scan_and_revalidate

PR3 — Serve-path copies (#70)

32be8f0^32be8f0. Criterion read_throughput, RAM. No overlay.

What changed: four stacked serve-path copy eliminations — DB chunk readers fill the caller's &mut [u8]; read_segments writes ArtImage/BinaryTag/raw OggArtSlice arms into the output buffer's resized tail; Musefs::read_into serves into a caller buffer; and the FUSE layer reuses a per-worker thread-local scratch buffer. None touches synthesis or layout (served audio stays byte-identical).

sequential_read

formatbefore (µs)after (µs)Δverdict
flac939.8924.8−2.1%noise
mp3917.2884.1−3.1%noise
m4a904.1877.6−3.7%noise
m4a-last909.8860.3−7.4%improved
ogg1080.4963.4−9.1%improved
wav925.6815.7−11.1%improved

cold_first_read / seek_read / concurrent

benchbeforeafterΔverdict
cold_first_read/flac1.652 ms1.557 ms−5.8%improved
cold_first_read/mp31.590 ms1.678 ms+5.5%regressed (within 10%)
cold_first_read/ogg1.781 ms1.694 ms−4.9%improved
seek_read (all)within ±2.7%held
concurrent_read_walk/m169.490 ms7.642 ms−19.5%improved

No format breaches the >10% rise gate. The concurrent burst metric improves 19% here (it is high-variance and swings run-to-run; see SP3).

cargo bench -p musefs-core --bench read_throughput -- \
  sequential_read concurrent_read_walk cold_first_read seek_read

#136 — HeaderCache → quick_cache

2e6674e^2e6674e. Criterion read_throughput, RAM. No overlay.

What changed: an S3-FIFO byte-weighted quick_cache replaces the hand-rolled 16-shard Mutex LRU — the serve path's last shared std lock is gone.

At a glance: within noise. No workload regresses outside noise; the only movers are marginal sequential_read improvements on the metadata-light formats.

benchbeforeafterΔverdict
sequential_read/m4a851.1 µs794.7 µs−6.6%improved
sequential_read/m4a-last855.2 µs798.3 µs−6.7%improved
sequential_read/ogg1.043 ms962.9 µs−7.7%improved
sequential_read/flac,mp3,wavwithin noiseheld
cold_first_read (all)within noise / −3.6% m4aheld
seek_read (all)within noiseheld
concurrent_read_walk/m165.557 ms5.451 ms−1.9%held
cargo bench -p musefs-core --bench read_throughput

#112 — StructureOnly kernel passthrough

0881b31faec017. Bespoke dd harness (committed: benches/passthrough_dd.sh), sudo (passthrough needs CAP_SYS_ADMIN).

What changed: the backing fd is registered at open (FUSE passthrough, kernel ≥6.9); the kernel serves StructureOnly reads directly from the backing inode, bypassing the daemon round-trip.

512 MiB WAV backing on /dev/shm (RAM-cached, isolates FUSE-path overhead), dd bs=1M sequential read, fresh mount per binary, 3 runs each:

run 1run 2run 3median
before (daemon reads)2.5 GB/s2.5 GB/s2.7 GB/s2.5 GB/s
after (passthrough)8.4 GB/s8.3 GB/s8.9 GB/s8.4 GB/s

3.36× on this RAM-cached sequential workload: the before path round-trips every ~128 KiB chunk through the daemon (wakeup + positioned read + copy back via /dev/fuse); the after path reads straight from the backing inode's page cache.

sudo benches/passthrough_dd.sh target/release/musefs /dev/shm/pt 512

Cumulative detail

16caba4 → current main (e02223e). Derived, non-isolating — composed from the per-pass isolated deltas above, anchored to current-main absolutes. A same-harness end-to-end run is infeasible: MountConfig.case_insensitive and scan_directory_with/ScanOptions/revalidate_with don't exist at 16caba4 (so main's harnesses can't compile there), and the 16caba4-era harness omits the now-required case_insensitive field (so it can't compile on main either). The deltas below name the contributing passes and the dominant one; unrelated speedups are not multiplied into a single headline.

Current-main absolutes (1 run, native harness)

Ingest — ci tier, /data, bench_ingest:

formatscan (ms)revalidate (ms)RSS (KiB)
flac4726900
mp32526944
m4a5526956
m4a-last3936980
ogg2026980
wav2536984

Refresh — RAM, bench_refresh_one_across_library_sizes: refresh-1 @ 100 / 1000 / 5000 = 0 ms; @ 20 000 = 1 ms.

Serve — RAM, read_throughput (Criterion median): sequential_read flac 569 µs · mp3 563 µs · m4a 566 µs · m4a-last 568 µs · ogg 737 µs · wav 598 µs; cold_first_read ogg 1.507 ms; seek_read ogg 806 µs; concurrent m16+walker 4.15 ms.

Composed per-subsystem deltas

Ingest = SP1 ∘ PR2. Dominated by SP1's durable-fsync elimination; PR2 is the −128 B/file + move-not-clone refinement.

metricpre-SP1current mainΔ
fsync count (latencyfs)4030eliminated
scan_wall (ci flac)32 206 ms47 ms~685×
scan_wall (bandwidth flac)378 041 ms~15 228 ms†~24.8×
scan_bytes_read (ci flac)870 600 B845 000 B−128 B/file

† Bandwidth tier not re-measured at main; figure is SP1's after number.

Refresh = SP2 ∘ #69 ∘ #114. The O(N)→flat journey; dominant pass is #69 (changelog-driven O(changed) rebuild), with #114 shaving the 20 k root fan-out on top.

metricpre-SP2current mainΔ
refresh-1 @ 10005 ms0 ms∞ (sub-ms)
refresh-1 @ 500032 ms0 ms∞ (sub-ms)
refresh-1 @ 20000173 ms1 ms~173×

Serve = SP3 ∘ SP4 ∘ PR3 ∘ #136. SP3 + PR3 drive the cross-format sequential/cold/seek wins (alloc elimination + copy reduction); SP4 owns the ogg cold/seek collapse.

metricpre-SP3current mainΔ
sequential_read/flac929 µs569 µs−38.8%
sequential_read/mp3940 µs563 µs−40.1%
sequential_read/m4a940 µs566 µs−39.8%
sequential_read/ogg967 µs737 µs−23.8%
sequential_read/wav935 µs598 µs−36.1%
cold_first_read/ogg14.96 ms1.51 ms−89.9%
seek_read/ogg13.54 ms806 µs−94.0%
concurrent m16+walker8.20 ms4.15 ms−49.4%

Criterion's own change: lines compare against the previous on-machine baseline (itself already optimized); the absolutes above are the reliable end-to-end signal.


Storage tunables

A proposed --storage-profile {ssd,hdd,nfs} preset would have bumped --max-readahead-kib and --max-background (and enabled --keep-cache) per medium, on the premise that "larger read-ahead hides HDD/NFS latency." Measured against real storage, that premise does not hold — only --keep-cache shows a benefit — so the preset was dropped and these flags keep their defaults. This section records the evidence.

Methodology

Unlike the optimization passes above (tmpfs, in-process Criterion), these run through a real kernel mount with a real reader, because the tunables are kernel↔FUSE negotiation parameters invisible to an in-process driver:

  • Backing: real RAID-1 HDD (/home, /dev/md127) and a btrfs HDD span (/data, /dev/sda3); for NFS, a loopback NFSv4.2 export (exportfs + mount -t nfs localhost:…) whose backing is tmpfs (isolates the RPC tax) or HDD (RPC + seeks).
  • Latency: tc qdisc add dev lo root netem delay <X>ms adds X per packet → ≈2X RTT per NFS RPC. Tested at 8 ms, 50 ms, and 200 ms RTT (the last ≈ a trans-Pacific server).
  • Cold reads: sync; echo 3 > /proc/sys/vm/drop_caches before each measured read — without it the page cache serves repeats and hides all backing latency.
  • Mode: synthesis, not structure-only. Structure-only triggers kernel FUSE passthrough when the process is privileged (these run as root), which serves the backing fd directly and bypasses the daemon read path — and with it every tunable that acts on that path. Synthesis splices BackingAudio reads through the daemon, the real serving path.
  • Why not the injected MUSEFS_FAULT_*_US model: it cannot show a read-ahead effect. FUSE delivers reads to the daemon in fixed ≤256 KiB chunks (max_pages, already pinned at the kernel's 1 MiB ceiling by fuser's 16 MiB default max_write), so the per-pread count — and thus any per-pread injected latency total — is independent of max_readahead.

Reproduce: benches/storage_tunables_bench.sh (needs /dev/fuse, root, and for the NFS rows nfs-kernel-server + tc). HDD numbers are noisy (±10–15%); the trends, not the digits, are the signal.

--max-readahead-kib — no benefit anywhere; hurts on HDD

Cold single-stream sequential throughput (MB/s), synthesis:

readahead KiBHDD /home (RAID1)HDD /data (btrfs)NFS 8 msNFS-on-HDD 50 msNFS-on-HDD 200 ms
512 (default)24812730.84.71.3
20481917230.64.91.3
40961538430.54.91.3
32 (probe)23775

(File sizes differ per column — 512 MiB local, 96 MiB at 50 ms, 48 MiB at 200 ms — so compare within a column, not across. The 200 ms column ≈ a trans-Pacific server: flat to the last digit.)

The window size barely moves throughput, and on HDD values ≥2048 KiB are among the slowest (peak is ~128–512 KiB). The reason is visible on NFS: 512 MiB ÷ 256 KiB × 8 ms ≈ 16 s ≈ the observed 31 MB/s — a single stream is served serially, one ≤256 KiB read at a time, each paying the full RTT, with no prefetch overlap that a larger window could exploit.

--max-background — no effect on read throughput

Wall time (s) for N concurrent cold streams over distinct tracks:

max_backgroundHDD /home (16 streams)NFS 8 ms (16)NFS-on-HDD 50 ms (80)NFS-on-HDD 200 ms (24)
64 (default)4.555.16177.8238.5
1285.055.18175.7237.4

64 ≈ 128 even with 80 > 64 streams. Expected: musefs's FuseConfig notes max_background caps background work and that "foreground reads are bounded only by client concurrency, not by this." The concurrent reads here are foreground. (Concurrency does hide latency — 16 NFS streams reach ~10× single-stream aggregate — but that is client parallelism, which max_background does not gate.)

--keep-cache — the one real win (~3×)

Cold read then immediate reopen (no cache drop between); reopen_s is the signal:

keep_cacheHDD reopen (s)NFS 8 ms reopen (s)NFS-on-HDD 50 ms reopen (s)
false0.2240.2070.039
true0.0620.0600.014

With --keep-cache the kernel retains the page cache across opens, so a re-opened file is served from RAM instead of re-fetched over slow storage — ~3× faster reopen, consistent across HDD and NFS. This is the only tunable worth changing for slow backing (relevant for players/scanners that re-open files), and it needs no preset. It is on by default as of #432 (inode invalidation on retag keeps it consistent); disable with --keep-cache false on memory-constrained hosts.

Conclusions

  • Drop the --storage-profile preset. Of the four knobs it would have set, three (max_readahead, max_background, and by extension a per-medium combination of them) show no benefit; max_readahead ≥2048 KiB actively hurts on HDD. The only justified change — enable --keep-cache on HDD/NFS — does not need an abstraction.
  • Single-stream latency hiding — addressed in #255 (next section). The serialized read path measured above (512 MiB ÷ 256 KiB × RTT) is exactly what backing read-ahead now fixes.

Backing read-ahead (#255)

Each --max-readahead-kib row above exposed the real bottleneck: a single stream is served one ≤256 KiB FUSE chunk at a time, each paying the full backing RTT, so a 200 ms-RTT NFS mount tops out at ~1.3 MB/s regardless of the kernel read-ahead window. The fix is read amplification in the daemonBackingReader coalesces a stream's small reads into one large positioned pread (geometric window growth, global RAM budget with LRU eviction), so the backing client can pipeline/parallelize the RPCs behind one syscall. A background-prefetch-threads layer ("Phase 2") was also built but is off by default (see below).

Methodology

Two harnesses. Real kernel mount (benches/storage_tunables_bench.sh): a real reader (dd) over a real FUSE mount, cold (drop_caches) each sample, median of 3. Local backing on a btrfs HDD; NFS via a loopback NFSv4.2 export plus tc netem for RTT. The corpus is real FLAC (MUSEFS_BENCH_CORPUS_SRC) — a /dev/zero corpus on a compressing fs (btrfs compress=zstd) collapses to a cached extent and never touches the platter, which silently inverts the HDD numbers; real already-compressed audio is incompressible. In-process (musefs-core/tests/bench_ingest.rs::bench_read_under_latency): the core read path over musefs-latencyfs (per-op injected latency), isolating the daemon from the kernel FUSE layer. off = --read-ahead-budget-mib 0; phase1 = the default (amplification only); phase1+2 = --read-ahead-prefetch.

Single-stream cold throughput (MB/s)

backingoffphase 1 (default)phase 1+2passthrough
local HDD (btrfs, real FLAC)~60~62~60~58
NFS, tmpfs-backed, 200 ms RTT1.27.46.89.8

On NFS read-ahead is a ~6× single-stream win (1.2 → 7.4 MB/s, 75 % of the kernel-passthrough ceiling). On a real local HDD all four configs sit within run-to-run noise (~±15 %) — read-ahead is neutral, not a regression. (An earlier /dev/zero corpus showed a spurious −35 %; it was the zstd-compression artifact above, not read-ahead.)

Concurrent streams (8 × distinct tracks, aggregate MB/s, NFS 200 ms RTT)

offphase 1 (default)phase 1+2passthrough
1.613.612.116.3

In-process, per-op latency (16 MiB Ogg whole read; wall ms / backing preads)

profileoffphase 1 (default)
ssd (80 µs/op)45 ms / 774 preads26 ms / 32 preads
nfs-ssd (600 µs/op)138 ms / 774112 ms / 32

Amplification collapses 774 backing round-trips to 32; the win scales with per-op latency and is already material at SSD speeds (1.7×).

Phase 2 is off by default

Background prefetch threads (Phase 2) never beat amplification alone and cost a consistent ~10 %: single-stream NFS 6.8 vs 7.4, concurrent NFS 12.1 vs 13.6, neutral on HDD. A single large pread already lets the NFS client pipeline its RPCs, so the threads add coordination overhead without overlap to exploit. Phase 2 is therefore opt-in (--read-ahead-prefetch), retained for hypothetical backends where one large read does not self-pipeline.

Defaults: read-ahead on at --read-ahead-budget-mib 64, Phase-1 amplification only. Set 0 to disable on local-disk-only setups (no benefit there, though no harm either).

Internal window cap on HDD (#433)

The amplification window doubles per sequential read up to WINDOW_ABS_CAP (8 MiB, musefs-core/src/readahead.rs). The #256 sweep above measured the kernel max_readahead knob — where ≥2048 KiB hurts on HDD — but never this daemon-internal cap, so #433 asked whether 8 MiB is too large for spinning media.

Methodology. WINDOW_ABS_CAP is a compile-time const, so the sweep builds one release binary per value (benches/storage_tunables_bench.sh window-cap, which patches the const in place and restores it after each build). Cold (drop_caches) single-stream reads of the same ~270 MiB real FLAC, synthesis mount, default flags (amplification on, prefetch off), real backing on a btrfs HDD (/data, 4389-track corpus). Reproduce:

WINDOW_CAP_MIB="1 2 4 8 16" MUSEFS_BENCH_CORPUS_SRC=<music-tree> MUSEFS_BENCH_CORPUS_MAX_MIB=300 \
  benches/storage_tunables_bench.sh window-cap <hdd-backing-dir>

Result: no measurable cap effect — the medium's noise dominates. Median MB/s (and the within-cap min–max over 7 cold samples) overlap across every cap, and the apparent ordering is an artifact of measurement order, not the cap: throughput drifts down through each run, so whatever runs first looks fastest. Sweeping the caps in the reverse order reverses the "trend".

cap (MiB)ascending sweep, median (min–max)descending sweep, median
1104 (65–132)46
285 (54–142)58
483 (51–104)61
861 (54–72)61
1668 (54–94)86

The current default (8 MiB) lands at ~61 MB/s in both orderings; the first-measured cap is fastest in both (104 for cap 1 ascending, 86 for cap 16 descending). The within-cap spread (≈50–140 MB/s) dwarfs every between-cap median gap. This corroborates the #256 finding that backing read-ahead is neutral on local HDD.

Decision: keep 8 MiB, no runtime knob. There is no HDD gain to capture, and the cap exists for the proven case — the ~6× single-stream amplification win on high-RTT NFS/remote, where coalescing into one large pread lets the client pipeline RPCs. Lowering the cap to chase an unmeasurable HDD effect would regress that win.


Global allocator — steady-state RSS (#360)

Long-lived high-churn FUSE load fragments glibc malloc, growing daemon RSS over days without a true leak. The musefs binary now defaults to the jemalloc global allocator with a background purge thread. Measured with scripts/rss-churn-bench.sh (Linux; median VmRSS over the flattened tail — steady state, not peak).

Parameters: WORKERS=8 (nproc), FILES=500, CYCLES=200, WARMUP=20, no REFRESH_CMD. DB = a freshly-scanned 4427-track store on tmpfs (/tmp); backing audio on /data (HDD). Concurrent cat-to-/dev/null churn drives the open/read/release handle-table and read-synthesis allocation path.

AllocatorSteady-state RSS
system malloc~74.7 MiB (76496 kiB)
jemalloc~28.7 MiB (29368 kiB)

Decision: SHIP jemalloc. Steady-state RSS is ~62% lower (jemalloc ≤ system malloc, the §4 ship rule). Under identical churn glibc retained ~46 MiB of dirty pages that jemalloc's decay + background purge return to the OS — the #360 fragmentation failure mode, reproduced and fixed. The gap is far outside run-to-run noise, so no within-noise tie-break was needed.


Scan fingerprint overhead (#464)

Bench: cargo bench -p musefs-core --bench fingerprint_overhead Corpus: 200 minimal FLAC files (~200 B metadata + 4 KiB audio) in tempfile::tempdir() (TMPDIR = tmpfs/RAM). Single-threaded scan (jobs: 1). Criterion, 20 samples.

TierMedian (ms)µs/file
None47.0235
Fingerprint107.6538

Delta: +60.6 ms / 200 files = +303 µs/file overhead (+129%).

Interpretation: The 129% overhead on this synthetic RAM-backed bench exceeds the plan's ≤15% threshold. The overhead is dominated by the extra UPDATE tracks SET fingerprint = … SQLite execution per file inside the batch transaction — not by SHA-256 hashing cost (SHA-256 of a few hundred bytes is sub-microsecond). On a real HDD-backed library the probe I/O (tens-of-ms per file) is the bottleneck, making both the hash and the DB write negligible. See plan Task E2 step 3 decision note: the decision to keep SHA-256 and add the length CHECK was escalated to the controller because the raw percentage exceeded the stated threshold, even though the absolute overhead (303 µs/file) is operationally negligible at disk I/O rates.


Scan fingerprint overhead — SSD latency profile (#464)

Bench: bench_scan_under_latency in musefs-core/tests/bench_ingest.rs (MUSEFS_BENCH_LATENCY_PROFILE=ssd). Corpus: 200 minimal FLAC files (~200 B metadata + 4 KiB audio) on a musefs-latencyfs SSD-latency FUSE mount. Default thread count (jobs: 0). 3 runs each, median reported.

TierMedian (ms)µs/file
None2221110
Fingerprint2411205

Delta: +19 ms / 200 files = +95 µs/file overhead (+8.6%).

Interpretation: Under an SSD latency profile the I/O dominates and the fingerprint overhead drops to +8.6% (+95 µs/file), well within the plan's ≤15% threshold. The RAM bench's +129% (+303 µs/file) was an artefact of RAM eliminating the I/O that would normally dwarf the extra SHA-256 hash and DB write. At real SSD rates the fingerprint cost is operationally negligible.