Collating Iterator: Formal Specification

Research Artifact

October 2025

Problem Statement

Design and implement CollatingIterator<T>, an iterator that merges \(k\) sorted input iterators into a single sorted output stream, where \(T\) implements Comparable<T>.

Inputs

Type: \(\texttt{Collection<Iterator<T>>}\) where \(T \texttt{ extends Comparable<T>}\)
Constraints:
- Each input iterator yields elements in non-decreasing order per compareTo()
- \(k \geq 0\) iterators (empty collection yields empty output)
- Each iterator is single-pass, forward-only
- No iterator in collection is null

Outputs

Type: \(\texttt{Iterator<T>}\) implementing standard Java Iterator interface
Guarantees:
- Elements yielded in globally non-decreasing order per compareTo()
- All elements from all input iterators appear exactly once in output
- Lazy evaluation: no element retrieved until next() called
- Output exhausted \(\Leftrightarrow\) all input iterators exhausted

Invariants

Let \(I = \{I_0, I_1, \ldots, I_{k-1}\}\) be the set of input iterators and \(O\) be the sequence of elements yielded so far.

Sorted Output Prefix: Elements yielded so far form non-decreasing sequence \[\forall i < j \text{ where both in } O: O[i] \leq O[j]\]
Conservation: Total elements yielded equals elements consumed from inputs \[|O| = \sum_{i=0}^{k-1} |\texttt{consumed}(I_i)|\]
Completeness: Output exhausted if and only if all inputs exhausted \[\neg \texttt{hasNext()} \Leftrightarrow \forall i \in [0, k): \neg I_i.\texttt{hasNext()}\]
Uniqueness: Each element from each input appears exactly once in output \[\texttt{multiset}(O \cup \texttt{remaining}) = \bigcup_{i=0}^{k-1} \texttt{multiset}(I_i)\] where \(\texttt{remaining}\) is the multiset of elements not yet yielded.
Laziness: No element consumed from any input until output element requested \[\texttt{consumed}(I_i) \text{ monotonically increases only when } \texttt{next()} \text{ called}\]

Assumptions

Total Order: compareTo() defines a total order (transitive, antisymmetric, complete)
Consistency: compareTo() is consistent with equals() (though duplicates across streams allowed)
No Exceptions: Input iterators do not throw exceptions during normal traversal
No Concurrency: No concurrent modification of underlying collections during iteration
Idempotent hasNext(): Multiple calls to hasNext() without intervening next() return same result
Non-null Elements: \(T\) elements are non-null (or nulls handled consistently by comparator)

Contract

Preconditions

Each iterator \(I_i\) is sorted: \(\forall j < j': I_i[j] \leq I_i[j']\)
No iterator in input collection is null
Input collection itself is non-null
Elements are mutually comparable (no ClassCastException)

Postconditions

Output contains exactly the multiset union of all input elements: \[\texttt{multiset}(\texttt{output}) = \bigcup_{i=0}^{k-1} \texttt{multiset}(I_i)\]
Output is sorted: \(\forall j < j': \texttt{output}[j] \leq \texttt{output}[j']\)
All input iterators exhausted when output exhausted: \[\neg \texttt{hasNext()} \Leftrightarrow \forall i: \neg I_i.\texttt{hasNext()}\]

Exceptions

NullPointerException: if input collection is null or contains null iterator
ClassCastException: if elements are not mutually comparable
NoSuchElementException: if next() called when hasNext() returns false
UnsupportedOperationException: if remove() called (optional operation)

Research Questions

This specification defines the problem. The following questions guide the analysis and design phases:

Theoretical Complexity

Lower Bound: What is the minimum number of comparisons required for any comparison-based k-way merge algorithm?
Time Complexity: What is the optimal asymptotic time complexity achievable? Express in terms of \(N = \sum_{i=0}^{k-1} |I_i|\) (total elements) and \(k\) (number of iterators).
Space Complexity: What is the minimum auxiliary space required? Can we achieve better than \(O(N)\)?
Per-Operation Costs: What are the costs of hasNext() and next() in an optimal implementation?

Design Alternatives

What data structures can support k-way merge?
What are the trade-offs between alternatives (comparisons, memory access patterns, implementation complexity)?
Are there crossover points where one approach becomes preferable (e.g., based on \(k\) size)?

Constant Factors (“Can We Do Better?”)

Given an asymptotically optimal algorithm:

How many comparisons per element in practice (not just big-O)?
What are the cache behavior characteristics?
What are the constant factor differences between theoretically equivalent approaches?
Can we improve constants without changing asymptotic complexity?

Implementation Considerations

How do constant factors vary across languages/platforms (Java, C++, Rust)?
What are the practical performance characteristics for typical \(k\) values (2, 10, 100, 1000)?
When do naive approaches (e.g., linear scan) outperform sophisticated algorithms?

Goal

The analysis phase should answer these questions rigorously, comparing alternatives and justifying design choices with both asymptotic analysis and empirical measurement.