https://DevOpsCloud.io -- Cloud Monk Losang Jinpa

Algorithms and Data Structures for Massive Datasets Index

Return to Algorithms and Data Structures for Massive Datasets

index

Numerics

2d keys 219 – 220, 224 – 225

A

acceptance-rejection method 147

additive error 172 – 173

adversary argument 231

algorithms

comment data example 3 – 8

comment data as stream 7

comment data in database 8

how to solve 4 – 8

designing with hardware in mind 12 – 14

sampling from data streams 163 – 166

structure and purpose of book 8 – 10

analysis tier 122, 125

anchor slot 66

approximate membership

Bloom filters 50 – 52

quotient filters 63 – 72

approximate quantiles 168 – 193

additive error 172 – 173

exact quantiles 169 – 171

q-digest 184 – 189

constructing from scratch 184 – 186

error and space considerations in 188

merging 186 – 187

quantile queries with 188 – 189

relative error 173 – 174

simulation code and results 189 – 192

t-digest 174 – 184

digestion 175 – 176

merging 180 – 183

scale functions 177 – 180

space bounds for 183 – 184

arrival index intervals 175

(article-id → keyword_frequency) dictionaries 4

(article-id → keyword_frequency) hash tables 7

Art of Computer Programming, Volume 2, The (Knuth) 145

B

B+-trees 227 – 229

balanced binary search trees 22

bandwidth, latency vs. 12

Bernoulli sampling 7, 143 – 146

Bε-trees

buffering mechanics 233 – 234

cost analysis 236 – 238

deletes 236

I/Os 240

inserts 236

lookups 236

overview 233

spectrum of data structures 238

use case 238 – 239

BFPRT (Blum, Floyd, Pratt, Rivest, and Tarjan) 171

BFs (Bloom filters) 6, 50 – 52, 127

adaptations and alternatives 62 – 63

better false positive rates 61 – 62

configuring 56 – 59

inserting items into 51

lookups 51 – 52

quotient filters compared to 72 – 74

successful lookups 74

uniform random inserts 73

uniform random lookups 73 – 74

simple implementation of 55 – 56

theory 59 – 62

use cases 53 – 54

Bitcoin mobile app 54

Squid 53 – 54

biased parameter 165

biased reservoir sampling 151

biased sampling strategy 136

binary search 204 – 207

minimum median income 204 – 206

runtime analysis 206 – 207

bioinformatics 204 – 206

bitarray library 55

Bitcoin mobile app 54

BLOCK_SIZE_ELEMENTS 210 – 211

Bloom filters see BFs (Bloom filters)

Bloom-joins 126 – 128

Blum, Floyd, Pratt, Rivest, and Tarjan (BFPRT) 171

brokers 129

B-trees 219 – 230

balancing 220 – 221

deletes 224 – 227, 231 – 232

inserts 221 – 224, 231 – 232

lookups 221, 230 – 232

use case 229 – 230

bucket_occupied bit 66 – 67, 70

buffer_in list 210

buildFingerTables(self) method 46

C

cache digests 53

cancer genomics 250 – 251

cardinality estimation 98 – 118

aggregation using HLL 114 – 116

counting distinct items in databases 99 – 100

incremental design 100 – 109

use cases for HLL 109 – 110, 114 – 116

Cassandra 245

centroid 175

chain sampling 156 – 160

Chernoff bounds 60

Chord 44 – 47

chordLookup(self,hashValue) method 47

ChunkStash [1] 24

close operation 204

clustered index 216

clusters 66

CMS (count-min sketch) 75 – 97

error vs. space in 88

estimate operation 80 – 81

general heavy-hitters problem 78 – 79

majority problem 76 – 79

range queries with 91 – 97

computing dyadic intervals 95 – 97

dyadic intervals 91 – 92

estimate phase 94 – 95

update phase 93 – 94

simple implementation of 88 – 91

update operation 80

use cases 82 – 87

scaling distributional similarity of words 85 – 87

top-k restless sleepers 82 – 85

collision resolution 29 – 32

comment data example 3 – 8

comment data as stream 7

comment data in database 8

how to solve 4 – 8

(comment-id → frequency) dictionary 4

(comment-id → frequency) hash table 6 – 7

compression parameter 184

concept drift 133

concept shifts 133

consistent hashing 34 – 47

adding new nodes/resources 39 – 41

Chord 44 – 47

hashring 36 – 38

lookup 38 – 39

removing nodes 41 – 44

typical problem 35

constant density 169

constant-time operations 28 – 29, 52

convex hull 249

CountMinSketch class 89

count(v) 185

crawlers 136

cuckoo hashing 32

CurrentSample object 165

D

DAM (disk-access model) 197 – 214

binary search 204 – 207

runtime analysis 206 – 207

use case 204 – 206

finding minimum 201 – 204

merging K sorted lists 209 – 213

optimal searching 207 – 209

overview 199 – 201

simple vs. simplistic 213

data intensive, meaning of 3

data stream data (DSD) objects 163

data stream task (DST) class 163, 165

data structures

comment data 3 – 8

as stream 7

in database 8

solving 4 – 8

overview 22 – 23

structure and purpose of book 8 – 10

deduplication 24, 128 – 129

in backup/storage solutions 24 – 26

delete operation 62

dictionaries 22

dict key-value dictionary 32

dict library 4

dict type 46

digest 174

distance method 38

DISTINCT keyword 99

distributed systems

hash tables for 34 – 47

adding new nodes/resources 39 – 41

Chord 44 – 47

hashring 36 – 38

lookup 38 – 39

removing nodes 41 – 44

typical problem 35

massive datasets and 12

distributional similarity 85

dkeys 224

DSC_Sample() function 164

DSC_Sample class 165

DSD (data stream data) objects 163

DSD_Gaussians() function 163

DSD_ReadCSV class 164

DST (data stream task) class 163, 165

dyadic intervals

computing 95 – 97

overview 91 – 92

E

ε-approximate φ quantile 172

empirical relative errors 175

estimate operation 79 – 81, 83

estimate query 87

estimation

count-min sketch 80 – 81, 94 – 95

streaming data 135, 139 – 141

event stream 7

external-memory algorithms 2, 8

external memory sorting 248 – 265

challenges of 251 – 255

external memory merge-sort 255 – 257

external quick-sort 258 – 263

multiway 259 – 260

pivots 260 – 262

two-way 258 – 259

use cases 249 – 251

cancer genomics 250 – 251

robot motion planning 249 – 250

external quick-sort 258 – 263

multiway 259 – 260

pivots 260 – 262

two-way 258 – 259

F

file_names list 210

file_processed list 211

files_loc list 210

fingerprint 63

fingerprint variable 65

fingerTable attribute 46

frequency estimation 75 – 97

error vs. space in count-min sketch 88

estimate operation 80 – 81

general heavy-hitters problem 78 – 79

majority problem 76 – 79

range queries with count-min sketch 91 – 97

simple implementation of count-min sketch 88 – 91

update operation 80

use cases for count-min sketch 82 – 87

fully merged t-digest 179

G

general heavy-hitters problem 78 – 79

gripping power 170

H

hash128 function 34

hash64 function 34

hash-based sketching data structures 8

hashing 19 – 47

collision resolution 29 – 32

consistent hashing 34 – 47

adding new nodes/resources 39 – 41

Chord 44 – 47

hashring 36 – 38

lookup 38 – 39

removing nodes 41 – 44

typical problem 35

constant-time operations 28 – 29

MurmurHash 33 – 34

ubiquitous nature of 20 – 22

usage scenarios 24 – 28

deduplication 24 – 26

plagiarism detection 26 – 28

Python dict key-value dictionary 32 – 33

HashMap library 4

hashring 36 – 38

HashRing class 36 – 38, 46

hashValue attribute 37

h-bit hash 64

HDFS query processor (HQP) 128

HLL1[1..m] HyperLogLog 115

HLL2[1..m] HyperLogLog 115

HLL (HyperLogLog) 98 – 118

aggregation using 114 – 116

counting distinct items in databases 99 – 100

effect of number of buckets 113 – 114

incremental design 100 – 109

error and space considerations in 109

LogLog 105 – 106

probabilistic counting 101 – 103

stochastic averaging 103 – 104

stochastic averaging with harmonic mean 106 – 109

use cases 109 – 110, 114 – 116

HLL_UNION[1..m] HyperLogLog 115

Horvitz-Thompson type estimator 140

HQP (HDFS query processor) 128

HyperLogLog data structure 7

I

inclusion probability 150 – 151

indexing 216 – 218

inserts

Bloom filters 51, 73

B-trees 221 – 224, 231 – 232

quotient filters 66 – 69, 73

inverse probability integral transform 145

inverted index 26

is_shifted bit 66

ith bucket 103

K

keys fingers 45

K-mers 204

krandom bits 74

k-wise independent hash functions 31

L

landmark stream,sampling from 143 – 156

Bernoulli sampling 143 – 146

biased reservoir sampling 151 – 156

reservoir sampling 146 – 151

latency, bandwidth vs. 12

leveling merge policy 243

light node 54

likes attribute 7

limited working storage 133

linear probing 29

linked lists 22

load balancing 130 – 132

load shedding 125

LogLog 105 – 106

error and space considerations in 105 – 106

super LogLog 106

lookupNode method 39, 41, 45

lookups

Bloom filters 51 – 52, 73 – 74

B-trees 221, 230 – 232

hash tables 38 – 39

quotient filters 69 – 71, 73 – 74

LSM-trees (log-structured merge-trees) 240 – 245

cost analysis 245

overview 241 – 244

use case 245

M

M/B-way merge-sort (external memory merge-sort) 255 – 257

majority problem 76 – 79

map library 4

marked attribute 95

massive datasets 1 – 15

challenges of 10 – 12

CPU memory performance gap 10 – 11

distributed systems 12

latency vs. bandwidth 12

memory hierarchy 11 – 12

comment data 3 – 8

as stream 7

in database 8

solving 4 – 8

structure and purpose of book 8 – 10

median of medians algorithm (Blum, Floyd, Pratt, Rivest, and Tarjan) (BFPRT) 171

memory

CPU memory performance gap 10 – 11

memory hierarchy 11 – 12

mergeability 183

message-queuing tier 122

min-heap 83

min variable 202

mmh3 MurmurHash wrapper 88

mmh3 package 33

MOSS (Measure of Software Similarity) 26 – 28

moveResources helper method 39

Murmur 33

MurmurHash 33 – 34

MySQL 229 – 230

N

network traffic tracking 130 – 132

Node class 37, 46

nodes

adding 39 – 41

removing 41 – 44

O

old cluster 182

O(log n) nodes 45

one pass 132

online aggregation 143

open addressing 29

open operation 204

P

PERTURB_SHIFT constant 33

perturb variable 33

pivot 219

plagiarism detection 26 – 28

PMI (pointwise mutual information) 85

Poisson sampling 146

priority sampling 160 – 163

PRNG (pseudo-random number generator algorithm) 144

probabilistic counting 101 – 103

probability of residing at N 151

probability p 60

product_id attribute 99

Psaltis, Andrew G. 122

Python

dict key-value dictionary 32 – 33

quotient filter lookups 69 – 71

Q

q-digest 184 – 189

constructing from scratch 184 – 186

error and space considerations in 188

merging 186 – 187

quantile queries with 188 – 189

quantile 170

query processing time 133

quotient 64

QuotientFilter class 69

quotient filters 63 – 72

Bloom filters compared to 72 – 74

successful lookups 74

uniform random inserts 73

uniform random lookups 73 – 74

false positive rate and space considerations 72

inserting items into 66 – 69

metadata bits 65 – 66

Python code for lookups 69 – 71

quotienting 64 – 65

resizing and merging 71 – 72

storing 70 – 71

quotienting 64

R

Rabin-Karp fingerprinting 26 – 28

range queries

with count-min sketch 91 – 97

computing dyadic intervals 95 – 97

dyadic intervals 91 – 92

estimate phase 94 – 95

update phase 93 – 94

read_block function 202

readline() function 207

read operation 204

read optimized 8

read-write optimized 8

real-time analytics 132

relative error

in data domain 174

overview 173 – 174

remainder 64

representative sampling 136

reservoir sampling

biased 151 – 156

overview 146 – 151

residing probabilities 152

resources dictionary 38, 41

restriction rule 106

robot motion planning 249 – 250

rolling hashes 26

run_continued bit 66, 70

runs 65, 242

S

sampling from data streams 135 – 167

algorithms comparison 163 – 166

biased strategy 136 – 139

from landmark stream 143 – 156

Bernoulli sampling 143 – 146

biased reservoir sampling 151 – 156

reservoir sampling 146 – 151

from sliding window 156 – 163

chain sampling 156 – 160

priority sampling 160 – 163

seed parameter 33

seek() function 204, 207

seek operation 204

SELECT operation 99

session_id attribute 99

signed parameter 33

simple random sample (SRS) 136, 160

simplified payment verification (SPV) 54

sketch 174

sliding window model 133 – 135

sampling from 156 – 163

chain sampling 156 – 160

priority sampling 160 – 163

Slot class 69

small space 132

small time 133

sort() function 248

sorted arrays 22

sorting and selection problem 171

spatial locality 12

speed of aging factor 153

SPV (simplified payment verification) 54

squeezing 148 – 149

Squid 53 – 54

SRS (simple random sample) 136, 160

SSTs (string-sorted tables) 49, 245

stochastic averaging

overview 103 – 104

with harmonic mean 106 – 109

streaming data 121 – 141

approximate quantiles on 168 – 193

additive error 172 – 173

exact quantiles 169 – 171

q-digest 184 – 189

relative error 173 – 174

relative error in data domain 174

simulation code and results 189 – 192

t-digest 174 – 184

estimation 135, 139 – 141

meta example 126 – 132

Bloom-joins 126 – 128

deduplication 128 – 129

load balancing and tracking network traffic 130 – 132

practical constraints and concepts 132 – 135

concept shifts and concept drifts 133

real-time analytics 132

sliding window model 133 – 135

small time and small space 133

sampling 135 – 167

algorithms comparison 163 – 166

biased strategy 136 – 139

landmark stream 143 – 156

sliding window 156 – 163

Streaming Data (Psaltis) 122

stream package 156, 163 – 164

succinct data structures 6

sudden bursts 130

summary 174

T

t-digest 174 – 184

digestion 175 – 176

merging 180 – 183

scale functions 177 – 180

space bounds for 183 – 184

tdigest library 189

tell() function 207

tiering merge policy 244

time/date logs, merging 210 – 213

external memory version 210 – 213

RAM version 210

timestamp attribute 99

TokuDB 238 – 239

top-k PMIs 86

top-k queries 76

top-k restless sleepers 82 – 83, 85

top-k trending queries 75

truncation rule 106

two-way merge-sort 252

U

unclustered index 216

unsorted array 22

update operation 80, 93 – 94

(user-id, amount) pair 82 – 84

(user-id,sensor-id) pair 82

user_ip_address attribute 99

V

visit_duration attribute 99

W

weight 175

weighted Bloom filter 63

(word, context), pair 87

write amplification effect 244

write operation 204

write optimized 8

writes 240