Systematic exploration of k-way merge input space to expose performance characteristics across all algorithm variants (LinearScan, HeapBased, LoserTree).
Key insight: Top candidates design data-driven benchmarks that systematically vary independent dimensions rather than ad-hoc “typical” cases.
For k-way merge CollatingIterator, the critical dimensions are:
Drives asymptotic complexity: O(Nk) vs O(N log k)
Drives total work
Affects heap/tree balance
Affects comparison patterns and cache behavior
Tests end-game behavior
| ID | k | N | Distribution | Pattern | Purpose |
|---|---|---|---|---|---|
baseline-k2 |
2 | 10K | Uniform | Random | Minimal merge baseline |
baseline-k3 |
3 | 10K | Uniform | Random | Classic 3-way merge |
baseline-k5 |
5 | 10K | Uniform | Random | Small k sweet spot |
Prediction: All algorithms competitive, linear scan may win on small k due to cache locality.
| ID | k | N | Distribution | Pattern | Purpose |
|---|---|---|---|---|---|
small-k3-uniform |
3 | 100K | Uniform | Random | Linear scan favorable |
small-k5-uniform |
5 | 100K | Uniform | Random | Linear scan still good |
small-k8-uniform |
8 | 100K | Uniform | Random | Crossover point? |
small-k3-sequential |
3 | 100K | Uniform | Sequential | Cache-friendly for linear scan |
Prediction: Linear scan competitive or wins due to perfect cache locality, no tree overhead.
| ID | k | N | Distribution | Pattern | Purpose |
|---|---|---|---|---|---|
medium-k10-uniform |
10 | 100K | Uniform | Random | Heap/tree start to win |
medium-k50-uniform |
50 | 100K | Uniform | Random | Clear heap/tree advantage |
medium-k100-uniform |
100 | 100K | Uniform | Random | Log k dominates |
medium-k50-skewed |
50 | 100K | Skewed | Random | Tests unbalanced heaps |
Prediction: Heap and loser tree significantly faster than linear scan. Loser tree ~10-20% faster than heap (log k vs 2 log k comparisons).
| ID | k | N | Distribution | Pattern | Purpose |
|---|---|---|---|---|---|
large-k100-uniform |
100 | 1M | Uniform | Random | Loser tree pulls ahead |
large-k500-uniform |
500 | 1M | Uniform | Random | Comparison count critical |
large-k1000-uniform |
1000 | 1M | Uniform | Random | Maximum k tested |
Prediction: Loser tree clearly fastest (half the comparisons of heap). Linear scan impractically slow.
| ID | k | N | Distribution | Pattern | Purpose |
|---|---|---|---|---|---|
edge-k1-single |
1 | 100K | N/A | Random | Degenerate passthrough |
edge-k10-empty |
10 | 0 | All empty | N/A | Handle empty gracefully |
edge-k100-tiny |
100 | 1K | Uniform | Random | Overhead dominates |
edge-k10-onesided |
10 | 100K | Single-dominant | Random | 99% in one iterator |
Prediction: k=1 all equal (passthrough), empty all equal (trivial), overhead cases unpredictable, one-sided favors simpler algorithms.
| ID | k | N | Distribution | Pattern | Purpose |
|---|---|---|---|---|---|
adversarial-alternating |
10 | 100K | Uniform | Alternating | Maximize comparisons |
adversarial-clustered |
50 | 100K | Uniform | Clustered | Poor cache locality |
adversarial-sequential-reverse |
10 | 100K | Uniform | Reverse sequential | Worst-case for some algorithms |
Prediction: Adversarial cases stress algorithms differently. Alternating maximizes comparison cost (favors loser tree). Clustered stresses cache (favors linear scan’s locality).
| ID | k | N | Distribution | Pattern | Description |
|---|---|---|---|---|---|
realistic-log-merge |
20 | 500K | Power-law | Random | Simulates merging log segments (few large, many small) |
realistic-db-index |
50 | 1M | Uniform | Clustered | Simulates database index merge |
realistic-timeseries |
100 | 1M | Uniform | Sequential | Simulates timeseries data merge |
Prediction: Realistic cases favor heap/loser tree. Real workloads typically have k ≥ 10.
Total: 24 test cases
Plus JMH parameterization will cross some dimensions for total ~50-100 benchmark combinations.
k iterators, N total elements
Each iterator: floor(N/k) or ceil(N/k) elements
Values: Random integers in [0, 1000000), sorted per iterator
Seed: Fixed (42) for reproducibility
k iterators, N total elements
Iterator 0: 80% of elements
Remaining k-1 iterators: 20% of elements divided equally
Values: Random integers in [0, 1000000), sorted per iterator
Seed: Fixed (42) for reproducibility
k iterators, N total elements
Iterator i gets values: i, i+k, i+2k, i+3k, ...
Example (k=3, N=12):
iter[0] = [0, 3, 6, 9]
iter[1] = [1, 4, 7, 10]
iter[2] = [2, 5, 8, 11]
Already sorted, no randomness needed
Same as sequential but designed to maximize comparisons
Forces algorithm to compare across all k iterators at each step
k iterators, N total elements
Divide value space [0, 1000000) into k ranges
Iterator i gets random values from range i only
Results in poor cache locality when iterators interleave
Based on Stage 3 theoretical analysis:
| Algorithm | Small k (≤8) | Medium k (10-100) | Large k (≥100) |
|---|---|---|---|
| LinearScan | Best (cache locality) | 5-10× slower | 50-100× slower |
| HeapBased | Good | Best or tied | Good |
| LoserTree | Good | Best or tied | Best (2× faster than heap) |
Crossover points to identify: 1. When does linear scan lose to heap? (Expected: k=8-10) 2. When does loser tree beat heap? (Expected: k=50-100) 3. Do cache effects ever make linear scan win for k>8? (Investigate)
Benchmarks validate Stage 3 predictions if: 1. ✓ Linear scan competitive for k ≤ 8 2. ✓ Heap/loser tree 5-10× faster than linear scan for k=50 3. ✓ Loser tree 1.5-2× faster than heap for k ≥ 100 4. ✓ Results within ±30% of theoretical predictions (accounting for constant factors)
TestDataGenerator.java with all patterns@Param for
dimensionsperformance_interpretation skillreporting_visualization skill