https://DevOpsCloud.io -- Cloud Monk Losang Jinpa

Algorithms and Data Structures for Massive Datasets Table of Contents

Return to Algorithms and Data Structures for Massive Datasets

contents

front matter

preface

acknowledgments

about this book

about the authors

about the cover illustration

1 Introduction

1.1 An example

An example: How to solve it

How to solve it, take two: A book walkthrough

1.2 The structure of this book

1.3 What makes this book different and whom it is for

1.4 Why is massive data so challenging for today’s systems?

The CPU memory performance gap

Memory hierarchy

Latency vs. bandwidth

What about distributed systems?

1.5 Designing algorithms with hardware in mind

Part 1 Hash-based sketches

2 Review of hash tables and modern hashing

2.1 Ubiquitous hashing

2.2 A crash course on data structures

2.3 Usage scenarios in modern systems

Deduplication in backup/storage solutions

Plagiarism detection with MOSS and Rabin–Karp fingerprinting

2.4 O(1)—What's the big deal?

2.5 Collision resolution: Theory vs. practice

2.6 Usage scenario: How Python’s dict does it

2.7 MurmurHash

2.8 Hash tables for distributed systems: Consistent hashing

A typical hashing problem

Hashring

Lookup

Adding a new node/resource

Removing a node

Consistent hashing scenario: Chord

Consistent hashing: Programming exercises

3 Approximate membership: Bloom and quotient filters

3.1 How it works

Insert

Lookup

3.2 Use cases

Bloom filters in networks: Squid

Bitcoin mobile app

3.3 A simple implementation

3.4 Configuring a Bloom filter

Playing with Bloom filters: Mini experiments

3.5 A bit of theory

Can we do better?

3.6 Bloom filter adaptations and alternatives

3.7 Quotient filter

Quotienting

Understanding metadata bits

Inserting into a quotient filter: An example

Python code for lookup

Resizing and merging

False positive rate and space considerations

3.8 Comparison between Bloom filters and quotient filters

4 Frequency estimation and count-min sketch

4.1 Majority element

General heavy hitters

4.2 Count-min sketch: How it works

Update

Estimate

4.3 Use cases

Top-k restless sleepers

Scaling the distributional similarity of words

4.4 Error vs. space in count-min sketch

4.5 A simple implementation of count-min sketch

Exercises

Intuition behind the formula: Math bit

4.6 Range queries with count-min sketch

Dyadic intervals

Update phase

Estimate phase

Computing dyadic intervals

5 Cardinality estimation and HyperLogLog

5.1 Counting distinct items in databases

5.2 HyperLogLog incremental design

The first cut: Probabilistic counting

Stochastic averaging, or “when life gives you lemons”

LogLog

HyperLogLog: Stochastic averaging with harmonic mean

5.3 Use case: Catching worms with HLL

5.4 But how does it work? A mini experiment

The effect of the number of buckets (m)

5.5 Use case: Aggregation using HyperLogLog

Part 2 Real-time analytics

6 Streaming data: Bringing everything together

6.1 Streaming data system: A meta example

Bloom-join

Deduplication

Load balancing and tracking the network traffic

6.2 Practical constraints and concepts in data streams

In real time

Small time and small space

Concept shifts and concept drifts

Sliding window model

6.3 Math bit: Sampling and estimation

Biased sampling strategy

Estimation from a representative sample

7 Sampling from data streams

7.1 Sampling from a landmark stream

Bernoulli sampling

Reservoir sampling

Biased reservoir sampling

7.2 Sampling from a sliding window

Chain sampling

Priority sampling

7.3 Sampling algorithms comparison

Simulation setup: Algorithms and data

8 Approximate quantiles on data streams

8.1 Exact quantiles

8.2 Approximate quantiles

Additive error

Relative error

Relative error in the data domain

8.3 T-digest: How it works

Digest

Scale functions

Merging t-digests

Space bounds for t-digest

8.4 Q-digest

Constructing a q-digest from scratch

Merging q-digests

Error and space considerations in q-digests

Quantile queries with q-digests

8.5 Simulation code and results

Part 3 Data structures for databases and external memory algorithms

9 Introducing the external memory model

9.1 External memory model: The preliminaries

9.2 Example 1: Finding a minimum

Use case: Minimum median income

9.3 Example 2: Binary search

Bioinformatics use case

Runtime analysis

9.4 Optimal searching

9.5 Example 3: Merging K sorted lists

Merging time/date logs

External memory model: Simple or simplistic?

9.6 What’s next

10 Data structures for databases: B-trees, Bε-trees, and LSM-trees

10.1 How indexing works

10.2 Data structures in this chapter

10.3 B-trees

B-tree balancing

Lookup

Insert

Delete

B+-trees

How operations on a B+-tree are different

Use case: B-trees in MySQL (and many other places)

10.4 Math bit: Why are B-tree lookups optimal in external memory?

Why B-tree inserts/deletes are not optimal in external memory

10.5 Bε-trees

Bε-tree: How it works

Buffering mechanics

Inserts and deletes

Lookups

Cost analysis

Bε-tree: The spectrum of data structures

Use case: Bε-trees in TokuDB

Make haste slowly, the I/O way

10.6 Log-structured merge-trees (LSM-trees)

The LSM-tree: How it works

LSM-tree cost analysis

Use case: LSM-trees in Cassandra

11 External memory sorting

11.1 Sorting use cases

Robot motion planning

Cancer genomics

11.2 Challenges of sorting in external memory: An example

Two-way merge-sort in external memory

11.3 External memory merge-sort (M/B-way merge-sort)

Searching and sorting in RAM vs. external memory

11.4 What about external quick-sort?

External memory two-way quick-sort

Toward external memory multiway quick-sort

Finding enough pivots

Finding good enough pivots

Putting it all back together

11.5 Math bit: Why is external memory merge-sort optimal?

11.6 Wrapping up

references

index