Teaching     Home
Dartmouth Logo
Computer Science
Dartmouth College

Computer Science 35/135
Data Stream Algorithms
Amit Chakrabarti


Fall 2015

Where: Sudikoff 115

When: 2A hour: TTh 14:00-15:50, X-hr W 16:15-17:20

Who: Instructor: Amit Chakrabarti, Sudikoff 107, Office hours: Mon 9:30-10:30 or by appointment

What: This course studies algorithms that process massive amounts of data; so massive that they will not fit in a computer's storage. As we shall see, this forces one to rethink even very basic algorithmic problems, such as counting the number of distinct elements, selection, and sorting. We shall study techniques to summarize such large amounts of data into succinct "sketches" that nevertheless retain important and useful information. We shall introduce the necessary mathematical tools along the way. Most techniques we shall study come from research in the last decade or so, although a few date as far back as the 1970s. The techniques we shall study have been applied successfully to web data compression, approximate query processing in databases, network measurement and signal processing. The course itself will focus on the underlying techniques rather than the specific applications.

Prerequisites: An undergraduate-level course in Algorithms (such as our CS 31)
  or else: A strong mathematics background and permission of the instructor.

Lecture Plan: The following is a rough outline, and will very likely be changed once the course gets going.

I. Basic Algorithms: Estimating Statistics
1. Sep 17 Data stream model; Our first algorithm: the majority problem.
2. Sep 22 The distinct-elements problem; Randomized algorithms; Approximation.
3. Sep 24 More modern results on distinct-elements.
4. Sep 26 Count Sketch; Count-Min Sketch; majority revisited; The heavy-hitters problem.
5. Sep 29 Frequency moments; The AMS algorithm; Improving a Basic Estimator.
6. Oct 1 The amazing AMS second moment (F2) algorithm; Dimension reduction; Johnson-Lindenstrauss Lemma.
7. Oct 6 Stable distributions; Estimating L1 distance; Indyk's algorithm and Lp norms, p ∈ (0,2).
8. Oct 8 The rest of the Lp norms; Sketch using precision sampling.
9. Oct 13 Proof of the Precision Sampling Lemma.
10. Oct 15 The median and selection problems; Random order versus adversarial order.
II. More Advanced Algorithms: Graphs, Geometry, Sequences
11. Oct 20 Geometric problems; Coresets; The min-enclosing-ball problem.
12. Oct 22 Metric streams; Clustering; k-center, k-median, k-means.
13. Oct 29 Finding a good k-center clustering; Doubling algorithm; Guha's algorithm.
14. Nov 3 Triangle counting; Mininum Spanning Trees.
15. Nov 5 Maximum Matchings.
III. Lower Bounds
16. Nov 10 Communication complexity; The index, disjointness and gap-hamming problems.
17. Nov 12 Some reductions from communication to streaming problems.
18. Nov 17 First set of student presentations
19. Nov 20 (starting 11:30am) Second set of student presentations

Textbooks: There is no set textbook for this course, but we have some evolving in-house course notes, thanks to the scribing efforts of students from the previous offering of the course. A good fraction of the material we shall study only resides in research papers. There is a good survey by Muthukrishnan, somewhat dated at this point, but still very useful for the basics.

Homework Sets: We will have four homework sets in total.

Here is a description of the grading principles for these homework sets. Please also note all other applicable policies, including how the honor principle applies to this course.

Class Presentation: In lieu of a final exam, students are required to read up on an advanced topic in data stream algorithms (usually represented by a research paper), and give a short, 30-minute presentation on its main ideas. Below is a list of suggested papers for such a presentation.