Algorithms and Data Structures for Massive Datasets Table of Contents
Return to Algorithms and Data Structures for Massive Datasets
contents
front matter
preface
acknowledgments
about this book
about the authors
about the cover illustration
1 Introduction
1.1 An example
An example: How to solve it
How to solve it, take two: A book walkthrough
1.2 The structure of this book
1.3 What makes this book different and whom it is for
1.4 Why is massive data so challenging for today’s systems?
The CPU memory performance gap
Memory hierarchy
Latency vs. bandwidth
What about distributed systems?
1.5 Designing algorithms with hardware in mind
Part 1 Hash-based sketches
2 Review of hash tables and modern hashing
2.1 Ubiquitous hashing
2.2 A crash course on data structures
2.3 Usage scenarios in modern systems
Deduplication in backup/storage solutions
Plagiarism detection with MOSS and Rabin–Karp fingerprinting
2.4 O(1)—What's the big deal?
2.5 Collision resolution: Theory vs. practice
2.6 Usage scenario: How Python’s dict does it
2.7 MurmurHash
2.8 Hash tables for distributed systems: Consistent hashing
A typical hashing problem
Hashring
Lookup
Adding a new node/resource
Removing a node
Consistent hashing scenario: Chord
Consistent hashing: Programming exercises
3 Approximate membership: Bloom and quotient filters
3.1 How it works
Insert
Lookup
3.2 Use cases
Bloom filters in networks: Squid
Bitcoin mobile app
3.3 A simple implementation
3.4 Configuring a Bloom filter
Playing with Bloom filters: Mini experiments
3.5 A bit of theory
Can we do better?
3.6 Bloom filter adaptations and alternatives
3.7 Quotient filter
Quotienting
Understanding metadata bits
Inserting into a quotient filter: An example
Python code for lookup
Resizing and merging
False positive rate and space considerations
3.8 Comparison between Bloom filters and quotient filters
4 Frequency estimation and count-min sketch
4.1 Majority element
General heavy hitters
4.2 Count-min sketch: How it works
Update
Estimate
4.3 Use cases
Top-k restless sleepers
Scaling the distributional similarity of words
4.4 Error vs. space in count-min sketch
4.5 A simple implementation of count-min sketch
Exercises
Intuition behind the formula: Math bit
4.6 Range queries with count-min sketch
Dyadic intervals
Update phase
Estimate phase
Computing dyadic intervals
5 Cardinality estimation and HyperLogLog
5.1 Counting distinct items in databases
5.2 HyperLogLog incremental design
The first cut: Probabilistic counting
Stochastic averaging, or “when life gives you lemons”
LogLog
HyperLogLog: Stochastic averaging with harmonic mean
5.3 Use case: Catching worms with HLL
5.4 But how does it work? A mini experiment
The effect of the number of buckets (m)
5.5 Use case: Aggregation using HyperLogLog
Part 2 Real-time analytics
6 Streaming data: Bringing everything together
6.1 Streaming data system: A meta example
Bloom-join
Deduplication
Load balancing and tracking the network traffic
6.2 Practical constraints and concepts in data streams
In real time
Small time and small space
Concept shifts and concept drifts
Sliding window model
6.3 Math bit: Sampling and estimation
Biased sampling strategy
Estimation from a representative sample
7 Sampling from data streams
7.1 Sampling from a landmark stream
Bernoulli sampling
Reservoir sampling
Biased reservoir sampling
7.2 Sampling from a sliding window
Chain sampling
Priority sampling
7.3 Sampling algorithms comparison
Simulation setup: Algorithms and data
8 Approximate quantiles on data streams
8.1 Exact quantiles
8.2 Approximate quantiles
Additive error
Relative error
Relative error in the data domain
8.3 T-digest: How it works
Digest
Scale functions
Merging t-digests
Space bounds for t-digest
8.4 Q-digest
Constructing a q-digest from scratch
Merging q-digests
Error and space considerations in q-digests
Quantile queries with q-digests
8.5 Simulation code and results
Part 3 Data structures for databases and external memory algorithms
9 Introducing the external memory model
9.1 External memory model: The preliminaries
9.2 Example 1: Finding a minimum
Use case: Minimum median income
9.3 Example 2: Binary search
Bioinformatics use case
Runtime analysis
9.4 Optimal searching
9.5 Example 3: Merging K sorted lists
Merging time/date logs
External memory model: Simple or simplistic?
9.6 What’s next
10 Data structures for databases: B-trees, Bε-trees, and LSM-trees
10.1 How indexing works
10.2 Data structures in this chapter
10.3 B-trees
B-tree balancing
Lookup
Insert
Delete
B+-trees
How operations on a B+-tree are different
Use case: B-trees in MySQL (and many other places)
10.4 Math bit: Why are B-tree lookups optimal in external memory?
Why B-tree inserts/deletes are not optimal in external memory
10.5 Bε-trees
Bε-tree: How it works
Buffering mechanics
Inserts and deletes
Lookups
Cost analysis
Bε-tree: The spectrum of data structures
Use case: Bε-trees in TokuDB
Make haste slowly, the I/O way
10.6 Log-structured merge-trees (LSM-trees)
The LSM-tree: How it works
LSM-tree cost analysis
Use case: LSM-trees in Cassandra
11 External memory sorting
11.1 Sorting use cases
Robot motion planning
Cancer genomics
11.2 Challenges of sorting in external memory: An example
Two-way merge-sort in external memory
11.3 External memory merge-sort (M/B-way merge-sort)
Searching and sorting in RAM vs. external memory
11.4 What about external quick-sort?
External memory two-way quick-sort
Toward external memory multiway quick-sort
Finding enough pivots
Finding good enough pivots
Putting it all back together
11.5 Math bit: Why is external memory merge-sort optimal?
11.6 Wrapping up
references
index