T. S. Jayram, Satyen Kale and Erik Vee
Proceedings of 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006

We study the problem of computing aggregation operators on probabilistic data in an I/O efficient manner. Algorithms for aggregation operators such as SUM, COUNT, AVG, and MIN/MAX are crucial to applications on probabilistic databases. We give a generalization of the classical data stream model to handle probabilistic data, called probabilistic streams, in order to analyze the I/O-requirements of our algorithms. Whereas the algorithms for SUM and COUNT turn out to be simple, the problem is harder for both AVG and MIN/MAX. Although data stream algorithms typically use randomness, all of the algorithms we present are deterministic.

For MIN and MAX, we obtain efficient one-pass data stream algorithms for estimating each of these quantities with relative accuracy \((1 + \epsilon)\), using constant update time per element and \(O((\log R)/\epsilon)\) space, where each element has a value between \(1\) and \(R\). For AVG, we present a new data stream algorithm for estimating its value to a relative accuracy \((1 + \epsilon)\) in \(O(\log n)\) passes over the data with \(O((\log^2 n)/\epsilon)\) space and update time \(O((\log n)/\epsilon)\) per element. On the other hand, we prove a space lower bound of \(\Omega(n)\) for any exact one-pass deterministic data stream algorithm. Complementing this result, we also present an \(O(n(\log^2 n))\)-time exact deterministic algorithm which uses \(O(n)\) space (thus removing the data-streaming restriction), improving dramatically on the previous \(O(n^3)\)-time algorithm. Our algorithms for AVG involve a novel technique based on generating functions and numerical integration, which may be of independent interest.

Finally, we provide an experimental analysis and show that our algorithms, coupled with additional heuristics, have excellent performance over large data sets.