LeetCode Reference: LeetCode 295 — Find Median from Data Stream (Hard) — An intuitive, ground-up guide for the rest of us.

When designing systems or solving algorithmic challenges, we often encounter scenarios where we need to find the median of a dynamically growing dataset. A live analytics dashboard tracking request latencies, system loads, or financial transactions is a classic example: data points arrive continuously, and we must report the median latency or median price at any moment.

In this guide, we will build the optimal solution for finding the median of a data stream from the ground up, starting from simple approaches and refining our strategy step-by-step to arrive at the optimal dual-heap pattern.


Understanding the Problem

The objective is to implement a data structure or algorithm that supports two operations:

  1. Add: Insert a new integer from the data stream.
  2. Find Median: Return the median of all elements seen so far.

The median is the middle value in an ordered list of numbers.

  • If the number of elements is odd, the median is the exact middle element.
  • If the number of elements is even, the median is the average of the two middle elements.

For example, given:

  • Initial numbers loaded: [5, 2, 8]
  • Incoming stream additions: [3, 10, 4]

After each addition, we must report the median of the cumulative dataset:

  1. Add 3 $\to$ dataset is [5, 2, 8, 3]. Sorted: [2, 3, 5, 8]. Median = (3 + 5) / 2 = 4.0.
  2. Add 10 $\to$ dataset is [5, 2, 8, 3, 10]. Sorted: [2, 3, 5, 8, 10]. Median = 5.
  3. Add 4 $\to$ dataset is [5, 2, 8, 3, 10, 4]. Sorted: [2, 3, 4, 5, 8, 10]. Median = (4 + 5) / 2 = 4.5.

Resulting medians: [4.0, 5.0, 4.5].


Step 1: The Brute Force Approach (Sort on Demand)

The most direct way to solve this is to collect all incoming numbers in a standard dynamic array (a list in Python) and sort the list every time a median is requested.

class BruteForceMedianFinder:
    def __init__(self):
        self.store = []

    def addNum(self, num: int) -> None:
        self.store.append(num)

    def findMedian(self) -> float:
        self.store.sort()
        n = len(self.store)
        if n % 2 == 1:
            return float(self.store[n // 2])
        else:
            return (self.store[n // 2 - 1] + self.store[n // 2]) / 2.0

Complexity

  • Add: $O(1)$ time. Appending to a dynamic array is amortized constant time.
  • Find Median: $O(N \log N)$ time. Sorting an array of size $N$ takes $O(N \log N)$ operations.
  • Space: $O(N)$ to store the numbers.

Why it Fails

If our stream contains $100,000$ numbers and we query the median after each number, sorting a list of that size repeatedly will cause a Time Limit Exceeded (TLE) error. We are doing redundant work by discarding the sorted order of the previous step and re-sorting everything.


Step 2: Binary Insertion (Keep the Array Sorted)

We can optimize the lookup time by keeping the array sorted as we insert numbers. Instead of sorting from scratch when finding the median, we search for the correct index for each incoming number and insert it in-place.

In Python, the bisect module implements binary search operations, and bisect.insort inserts an item into a sorted list while maintaining its sorted order.

import bisect

class SortedArrayMedianFinder:
    def __init__(self):
        self.store = []

    def addNum(self, num: int) -> None:
        # Binary search finds insertion point in O(log N)
        # Insertion in array takes O(N) due to element shifting
        bisect.insort(self.store, num)

    def findMedian(self) -> float:
        n = len(self.store)
        if n % 2 == 1:
            return float(self.store[n // 2])
        else:
            return (self.store[n // 2 - 1] + self.store[n // 2]) / 2.0

Complexity

  • Add: $O(N)$ time. Finding the correct index takes $O(\log N)$ time via binary search, but inserting an element into the middle of a dynamic array requires shifting all subsequent elements, which takes $O(N)$ time.
  • Find Median: $O(1)$ time. Since the array is always sorted, we access the middle elements directly by index.
  • Space: $O(N)$.

The Bottleneck

While we reduced the median lookup to $O(1)$, the addition operation is still linear ($O(N)$). For a stream of $N$ numbers, the total time complexity remains $O(N^2)$, which is still too slow for large data streams.


Step 3: The Intuition Shift (Divide and Conquer)

To achieve logarithmic insertion ($O(\log N)$) and constant median lookup ($O(1)$), we must realize that we do not need the entire array to be sorted.

We only need to know:

  1. The largest element of the smaller half of the dataset.
  2. The smallest element of the larger half of the dataset.

If we partition our data stream into two halves:

  • Left (Smaller Half): Contains the smaller half of all numbers.
  • Right (Larger Half): Contains the larger half of all numbers.

We can fetch the median by querying the boundary values between the two halves:

  • If the total count is odd, the median is the root of the larger half (or smaller half, depending on which holds the extra element).
  • If the total count is even, the median is the average of the largest value in the Left half and the smallest value in the Right half.
       [Smaller Half (Left)]              [Larger Half (Right)]
     [ 2, 3 ]   <-- max is 3           min is 5 -->   [ 5, 8 ]
                    \                               /
                     \---- Median is (3 + 5)/2 ----/

To implement this efficiently:

  • The Left half needs to quickly give us its maximum value. This is a Max-Heap.
  • The Right half needs to quickly give us its minimum value. This is a Min-Heap.

Both heaps support insertion in $O(\log N)$ time and retrieval of their root element in $O(1)$ time.


Step 4: Implementing the Invariant Rules

To ensure our structure yields the correct median at all times, we must maintain two invariants during insertions:

  1. Ordering Invariant: Every element in the Left (Max-Heap) must be less than or equal to every element in the Right (Min-Heap). $$\max(\text{Left}) \le \min(\text{Right})$$
  2. Size Invariant: The sizes of the two heaps must differ by at most 1. $$|\text{len(Left)} - \text{len(Right)}| \le 1$$

Insertion Workflow

When a new number num arrives:

  1. Route: Compare num with the maximum of the Left half.
    • If num <= max(Left), it belongs in the Left half.
    • Otherwise, it belongs in the Right half.
  2. Rebalance: If the size of one heap exceeds the other by more than 1, pop the root of the larger heap and push it into the other heap to restore balance.

Python Implementation: The Negation Trick (Standard / Interview-Safe)

Python’s standard heapq module only implements Min-Heaps. To implement a Max-Heap, we negate the values when inserting them and negate them again when retrieving them.

Below is the implementation that solves the stream variation of the problem. It accepts an initial list of numbers, loads them, and then processes a stream of additions, calculating the median after each addition.

import heapq

class Solution:
    def medianStream(self, nums: list[int], adds: list[int]) -> list[float]:
        # smaller_maxh stores negative values to simulate a Max-Heap
        smaller_maxh = []
        larger_minh = []

        def push_to_heaps(num: int):
            # 1. Decide which half to insert into
            if not smaller_maxh or num <= -smaller_maxh[0]:
                heapq.heappush(smaller_maxh, -num)
            else:
                heapq.heappush(larger_minh, num)
            
            # 2. Rebalance sizes so they differ by at most 1
            if len(smaller_maxh) > len(larger_minh) + 1:
                val = -heapq.heappop(smaller_maxh)
                heapq.heappush(larger_minh, val)
            elif len(larger_minh) > len(smaller_maxh) + 1:
                val = heapq.heappop(larger_minh)
                heapq.heappush(smaller_maxh, -val)

        # Pre-load starting elements (no median recorded yet)
        for num in nums:
            push_to_heaps(num)

        result = []
        # Process streaming additions
        for num in adds:
            push_to_heaps(num)

            # Calculate and append current median
            if len(smaller_maxh) > len(larger_minh):
                result.append(float(-smaller_maxh[0]))
            elif len(larger_minh) > len(smaller_maxh):
                result.append(float(larger_minh[0]))
            else:
                median = (-smaller_maxh[0] + larger_minh[0]) / 2.0
                result.append(median)

        return result

Modern Python 3.14+ Implementation (Native Max-Heap API)

In Python 3.14, the heapq module introduces native max-heap support via heappush_max and heappop_max. This eliminates the need for the negation trick, making the code much easier to read.

from heapq import heappush, heappop, heappush_max, heappop_max

class Solution314:
    def medianStream(self, nums: list[int], adds: list[int]) -> list[float]:
        smaller_maxh = []  # Managed as max-heap natively
        larger_minh = []   # Managed as min-heap

        def fix_heapq():
            if len(smaller_maxh) > len(larger_minh) + 1:
                heappush(larger_minh, heappop_max(smaller_maxh))
            elif len(larger_minh) > len(smaller_maxh) + 1:
                heappush_max(smaller_maxh, heappop(larger_minh))

        def is_to_smaller(num: int) -> bool:
            if not smaller_maxh:
                return True
            # smaller_maxh[0] is natively the maximum element
            return num <= smaller_maxh[0]

        def push_to_heaps(num: int):
            if is_to_smaller(num):
                heappush_max(smaller_maxh, num)
            else:
                heappush(larger_minh, num)
            fix_heapq()

        for num in nums:
            push_to_heaps(num)

        result = []
        for num in adds:
            push_to_heaps(num)

            if len(smaller_maxh) > len(larger_minh):
                result.append(float(smaller_maxh[0]))
            elif len(larger_minh) > len(smaller_maxh):
                result.append(float(larger_minh[0]))
            else:
                result.append((smaller_maxh[0] + larger_minh[0]) / 2.0)

        return result

Note: In interviews, it is recommended to use the negation trick unless you are certain the evaluation environment runs Python 3.14 or later.


Step-by-Step Execution Walkthrough

Let us trace how the dual-heap state transitions with the example input:

  • Starting array: nums = [5, 2, 8]
  • Stream additions: adds = [3, 10, 4]

Phase 1: Pre-loading nums

  1. Insert 5:
    • smaller_maxh is empty. Push -5 into smaller_maxh.
    • Heaps: smaller_maxh = [-5] (real: [5]), larger_minh = [].
  2. Insert 2:
    • 2 <= 5 (max of Left). Push -2 into smaller_maxh.
    • Heaps: smaller_maxh = [-5, -2] (real: [5, 2]), larger_minh = [].
    • Rebalance: len(Left) = 2, len(Right) = 0. Since 2 > 0 + 1, we pop from Left (-5) and push to Right (5).
    • Heaps: smaller_maxh = [-2] (real: [2]), larger_minh = [5].
  3. Insert 8:
    • 8 > 2 (max of Left). Push 8 into larger_minh.
    • Heaps: smaller_maxh = [-2] (real: [2]), larger_minh = [5, 8].
    • Rebalance: len(Right) = 2, len(Left) = 1. Balanced.

Phase 2: Processing adds and reporting medians

  1. Stream Add 3:
    • 3 > 2 (max of Left). Push 3 into larger_minh.
    • Heaps: smaller_maxh = [-2], larger_minh = [3, 8, 5].
    • Rebalance: len(Right) = 3, len(Left) = 1. Since 3 > 1 + 1, we pop from Right (3) and push to Left (-3).
    • Heaps: smaller_maxh = [-3, -2] (real: [3, 2]), larger_minh = [5, 8].
    • Median: Sizes are equal ($2$ and $2$). Median is (3 + 5) / 2.0 = 4.0.
  2. Stream Add 10:
    • 10 > 3 (max of Left). Push 10 into larger_minh.
    • Heaps: smaller_maxh = [-3, -2], larger_minh = [5, 8, 10].
    • Rebalance: len(Right) = 3, len(Left) = 2. Balanced.
    • Median: Right heap is larger. Median is larger_minh[0] = 5.0.
  3. Stream Add 4:
    • 4 > 3 (max of Left). Push 4 into larger_minh.
    • Heaps: smaller_maxh = [-3, -2], larger_minh = [4, 8, 10, 5].
    • Rebalance: len(Right) = 4, len(Left) = 2. Since 4 > 2 + 1, we pop from Right (4) and push to Left (-4).
    • Heaps: smaller_maxh = [-4, -3, -2] (real: [4, 3, 2]), larger_minh = [5, 8, 10].
    • Median: Sizes are equal ($3$ and $3$). Median is (4 + 5) / 2.0 = 4.5.

Complexity Analysis

  • Time Complexity:
    • Addition: $O(\log N)$. Standard binary heaps support insertion and removal of the root element in logarithmic time.
    • Median Lookup: $O(1)$. We only query the roots of the heaps, which takes constant time.
    • Overall Stream: $O((M + A) \log(M + A))$ where $M$ is the size of the initial dataset and $A$ is the number of streaming additions.
  • Space Complexity: $O(M + A)$ to store the data points in the heaps.

Common Pitfalls in Interviews

  1. Checking sizes without routing first: A common error is trying to keep the heaps balanced without routing the new element to the correct heap first. Always insert the element into the correct half first, and then rebalance.
  2. Empty Heap Crashes: When checking if a number belongs in the Left half (num <= -smaller_maxh[0]), always check if smaller_maxh is empty first. Attempting to access index 0 of an empty list raises an IndexError.
  3. Double Rebalancing: You only need to move at most one element during an insertion step. Do not use while loops for rebalancing; if/elif blocks are sufficient because we insert only one element at a time.
  4. Sign Errors: If using the negation trick, remember to add negative signs when pushing to or popping from the simulated Max-Heap, and negate it when accessing the top element.