Collating Iterator: Candidate Algorithms

Research Artifact - Stage 2B

October 2025

Research Question

From Stage 1 and Stage 2A: “Which algorithms (if any) achieve the \(\Omega(N \log k)\) lower bound?”

We know from Stage 2A that \(\Omega(N \log k)\) comparisons are required. Now we explore which algorithms achieve this bound.

Literature Review

Sources consulted:

Knuth TAOCP Vol 3 §5.4.1: K-way merge using heap, tournament tree, loser tree
CLRS Ch 6, Problem 6-2: Binary heaps, d-ary heap variants
Wikipedia: K-way merge algorithm overview
Grafana Labs blog (2024): Practical loser tree implementations
Frigo et al. (1999): Cache-oblivious funnelsort
WebSearch: Modern surveys and Stack Overflow discussions

Algorithms discovered in literature:

Linear scan (naive baseline)
Binary min-heap / priority queue
Winner tournament tree
Loser tournament tree (Knuth preferred, used in production)
D-ary heap (d=4 common in practice)
Pairwise merge (divide-and-conquer)
Collect-and-sort (ignores pre-sorted property)
Cache-oblivious funnelsort

Candidate Approaches

Without assuming the answer, we analyze eight natural approaches:

Candidate 1: Linear Scan

Source: First principles (naive baseline)

Algorithm

next(): \(\text{minVal} \gets \infty\) \(\text{minIdx} \gets -1\) \(\text{val} \gets \text{iterators}[i].\texttt{peek()}\) \(\text{minVal} \gets \text{val}\) \(\text{minIdx} \gets i\) \(\text{iterators}[\text{minIdx}].\texttt{next()}\) \(\text{minVal}\)

Analysis

Time Complexity:

Per next(): \(O(k)\) (scan all iterators)
Total: \(O(Nk)\) for \(N\) elements

Space Complexity: \(O(k)\) (iterator references only)

Comparison to Lower Bound: \[O(Nk) \gg \Omega(N \log k) \quad \text{for } k > \log k\]

Achieves bound? NO - Sub-optimal asymptotically for \(k >\) constant

When Competitive: For small \(k\) (say \(k \leq 8\)), the simplicity and cache locality may make this competitive despite poor asymptotics.

Candidate 2: Priority Queue (Binary Min-Heap)

Source: Standard (CLRS Ch 6, TAOCP Vol 3 §5.4.1)

Algorithm Idea

Maintain a priority queue (min-heap) containing:

One element from each non-empty iterator
Each entry: (element value, source iterator)

Operations:

extractMin(): Remove minimum from queue
Refill: Advance source iterator, insert next element

Binary Min-Heap Implementation

Initialize: \(H \gets \text{empty min-heap}\) \(H.\texttt{insert}((I_i.\texttt{next}(), I_i))\) next(): \((val, iter) \gets H.\texttt{extractMin}()\) \(H.\texttt{insert}((iter.\texttt{next}(), iter))\) \(val\)

Analysis

Heap Operations:

insert(): \(O(\log k)\) (sift-up)
extractMin(): \(O(\log k)\) (sift-down)
Heap size: At most \(k\) elements

Time Complexity:

Initialization: \(O(k \log k)\) (or \(O(k)\) with heapify)
Per next(): \(O(\log k)\)
Total: \(O(k \log k + N \log k) = O(N \log k)\)

Space Complexity: \(O(k)\) (heap storage)

Comparison to Lower Bound: \(O(N \log k) = \Omega(N \log k)\)

Achieves bound? YES ✓ - Asymptotically optimal!

Correctness

Invariant 1 (Heap Invariant). At the start of each next() call:

\(H\) contains at most one element from each non-exhausted iterator
Each element in \(H\) is the next unconsumed element from its source iterator
All elements output so far are \(\leq\) all elements in \(H\)

Correctness Proof. Initialization: Insert first element from each iterator. By definition, first elements are minimums of their iterators. Invariant holds.

Maintenance:

Extract minimum \(v\) from \(H\)
By heap property: \(v \leq\) all other elements in \(H\)
By invariant (2): Elements in \(H\) are minimums of their iterators
Therefore: \(v \leq\) all remaining elements globally
Refill from source iterator: New element is next minimum from that iterator
Invariant maintained

Termination: \(H\) becomes empty iff all iterators exhausted. All \(N\) elements extracted in sorted order. \(\square\) ◻

Candidate 3: Winner Tournament Tree

Source: TAOCP Vol 3 §5.4.1 “tournament sort”

Algorithm Idea

Build a complete binary tree with:

\(k\) leaves (one per iterator)
Each internal node: minimum of its children (“winner”)
Root: global minimum

Analysis

Time Complexity:

Per next(): Extract root, refill leaf, propagate up
Propagation cost: \(O(\log k)\) (tree height)
Total: \(O(N \log k)\)

Space Complexity: \(O(k)\) (tree nodes)

Comparison to Lower Bound: \(O(N \log k) = \Omega(N \log k)\)

Achieves bound? YES ✓ - Asymptotically optimal!

Comparison to Heap: Same asymptotic complexity. Constant factors differ (see Stage 3).

Candidate 4: Loser Tournament Tree

Source: TAOCP Vol 3 §5.4.1 (Knuth’s preferred variant), Grafana Labs production use (2024)

Algorithm Idea

A variant of tournament tree where:

Internal nodes store the loser of each comparison
Winner propagates up to next level
Root pointer indicates overall winner

Key difference from winner tree: During refill, only need to compare against losers on path to root, not recompute entire subtree.

Analysis

Time Complexity:

Per next(): Read winner, refill leaf, compare against losers up path
Comparisons: Exactly \(\log_2 k\) (one per level)
Total: \(O(N \log k)\)

Space Complexity: \(O(k)\) (tree nodes + root pointer)

Comparison to Lower Bound: \(O(N \log k) = \Omega(N \log k)\)

Achieves bound? YES ✓ - Asymptotically optimal!

Advantages over Winner Tree:

Simpler refill logic (only compare with losers, not siblings)
Fewer branches in code
Better for expensive comparison operations
Used in production (Grafana Loki log aggregation system)

Disadvantages:

More complex to implement than binary heap
Pointer-based structure (worse cache behavior than heap)

Candidate 5: D-ary Heap

Source: CLRS Problem 6-2, generalization of binary heap

Algorithm Idea

Generalization of binary heap where each node has \(d\) children instead of 2.

Operations:

extractMin(): \((d-1) \log_d k\) comparisons (compare \(d\) children per level)
Tree height: \(\log_d k\)

Analysis

Time Complexity:

Per next(): \(O(d \log_d k) = O(d \frac{\log k}{\log d})\)
For fixed \(d\): \(O(\log k)\)
Total: \(O(N \log k)\)

Space Complexity: \(O(k)\) (array storage)

Trade-off:

Larger \(d\) \(\rightarrow\) shallower tree (fewer levels)
But: more comparisons per level (\(d-1\) instead of 2)
Optimal \(d\) depends on cache behavior and comparison cost

Practical Choice: \(d=4\) commonly used (.NET PriorityQueue uses quaternary heap)

Achieves bound? YES ✓ - Still \(O(N \log k)\)

Candidate 6: Pairwise Merge

Source: Divide-and-conquer mergesort pattern

Algorithm Idea

Recursively merge iterators in pairs: \[((I_0 \oplus I_1) \oplus (I_2 \oplus I_3)) \oplus \ldots\]

Analysis

Problem: Violates lazy evaluation requirement!

Pairwise merging forces materialization of intermediate results. For example, \((I_0 \oplus I_1)\) must complete before merging with \((I_2 \oplus I_3)\).

Time Complexity: \(O(N \log k)\) if we materialize intermediates.

Space Complexity: \(O(N)\) (must store intermediate merged sequences).

Achieves bound? NO - Violates problem constraints (lazy evaluation). Space requirement \(O(N) \gg O(k)\).

When Useful: Parallel merging (divide-and-conquer parallelizes naturally). Different problem domain.

Candidate 7: Collect-and-Sort

Source: First principles (ignores pre-sorted property)

Algorithm

Consume all \(k\) iterators into array \(A\) Sort \(A\) Return sequential iterator over \(A\)

Analysis

Time Complexity: \(O(N \log N)\)

Space Complexity: \(O(N)\) (must store all elements)

Comparison to Lower Bound: \[O(N \log N) \gg O(N \log k) \quad \text{when } k \ll N\]

Achieves bound? NO - Sub-optimal. Doesn’t exploit pre-sorted property of inputs. Violates lazy evaluation and space constraints.

Candidate 8: Cache-Oblivious Funnelsort

Source: Frigo, Leiserson, Prokop, and Ramachandran (1999)

Algorithm Idea

A cache-oblivious sorting algorithm designed for external memory:

Recursive k-way merge with “funnels”
Adapts to memory hierarchy without knowing cache parameters
Optimal I/O complexity in external memory model

Analysis

Time Complexity: \(O(N \log N)\) comparisons (for full sort, not just k-way merge)

I/O Complexity: \(O(\frac{N}{B} \log_{M/B} \frac{N}{B})\) where \(B\) = block size, \(M\) = memory size

Relevance to Our Problem:

Designed for full sorting, not k-way merge of pre-sorted sequences
Cache-oblivious property interesting but not directly applicable
Complexity doesn’t improve on simple k-way merge for our constraints

Achieves bound? NO - Different problem domain (external memory sorting vs lazy k-way merge)

Note: Interesting for Stage 7 related work discussion

Summary of Candidate Algorithms

Comparison of candidate algorithms
Algorithm	Time	Space	Lazy?	Optimal?	Source
Linear Scan	\(O(Nk)\)	\(O(k)\)	Yes	No	Baseline
Binary Heap	\(O(N \log k)\)	\(O(k)\)	Yes	Yes	CLRS/TAOCP
Winner Tree	\(O(N \log k)\)	\(O(k)\)	Yes	Yes	TAOCP §5.4.1
Loser Tree	\(O(N \log k)\)	\(O(k)\)	Yes	Yes	TAOCP §5.4.1
D-ary Heap	\(O(N \log k)\)	\(O(k)\)	Yes	Yes	CLRS Prob 6-2
Pairwise	\(O(N \log k)\)	\(O(N)\)	No	No	Divide-conquer
Collect-Sort	\(O(N \log N)\)	\(O(N)\)	No	No	Naive
Funnelsort	\(O(N \log N)\)	Varies	No	No	Frigo et al.

Constant Factor Analysis

For the four optimal candidates:

Constant factors for optimal algorithms
Algorithm	Comparisons per `next()`	Memory Pattern
Binary Heap	\(2 \log_2 k\)	Array (excellent cache)
Winner Tree	\(\log_2 k\)	Pointer-based (poor cache)
Loser Tree	\(\log_2 k\)	Pointer-based (poor cache)
D-ary Heap (d=4)	\(3 \log_4 k \approx 1.5 \log_2 k\)	Array (excellent cache)

Observations:

Winner/loser trees: Fewer comparisons but worse memory access patterns
Heaps: More comparisons but better cache locality
Modern processors: Memory access often dominates comparison cost
D-ary heap: Best of both worlds? (fewer comparisons + array storage)

Conclusion

Discovery: Four algorithm families achieve \(\Omega(N \log k)\) lower bound:

Binary min-heap (standard, simple, good cache behavior)
Winner tournament tree (fewer comparisons, pointer-based)
Loser tournament tree (Knuth preferred, production-proven, simpler refill)
D-ary heap (trade-off parameter \(d\), d=4 common)

Linear scan competitive for \(k \leq 8\) despite poor asymptotics.

Open Question: Which is better in practice? (heap vs loser tree vs d-ary heap)

This constant factor analysis is deferred to Stage 3: Design Selection.