practical_monitoring_-_effective_strategies_for_the_real_world

Practical Monitoring - Effective Strategies for the Real World by Mike Julian

Table of Contents

Monitoring Principles

I. Monitoring Principles

1. Monitoring Anti-Patterns

Avoid Cargo-Culting Tools

Sometimes, You Really Do Have to Build It

The Single Pane of Glass Is a Myth

OS Metrics Aren’t Very Useful — for Alerting

Collect Your Metrics More Often

Anti-Pattern #4: Using Monitoring as a Crutch

Anti-Pattern #5: Manual Configuration

Wrap-Up

Monitoring Design Patterns

2. Monitoring Design Patterns

  • Pattern #1: Composable Monitoring The Components of a Monitoring Service

Pattern #2: Monitor from the User Perspective

Pattern #3: Buy, Not Build It’s Cheaper

You’re (Probably) Not an Expert at Architecting These Tools

SaaS Allows You to Focus on the Company’s Product

No, Really, SaaS Is Actually Better

Pattern #4: Continual Improvement

Wrap-Up

Alerts, On-Call, and Incident Management

3. Alerts, On-Call, and Incident Management

What Makes a Good Alert? Stop Using Email for Alerts

Write Runbooks

Arbitrary Static Thresholds Aren’t the Only Way

Delete and Tune Alerts

Use Maintenance Periods

Attempt Automated Self-Healing First

On-Call Fixing False Alarms

Cutting Down on Needless Firefighting

Building a Better On-Call Rotation

Incident Management

Postmortems

Wrap-Up

4. Statistics Primer Before Statistics in Systems Operations

Math to the Rescue!

Statistics Isn’t Magic

Mean and Average

Median

Seasonality

Quantiles

Standard Deviation

Wrap-Up

II. Monitoring Tactics

5. Monitoring the Business Business KPIs

Two Real-World Examples Yelp

Reddit

Tying Business KPIs to Technical Metrics

My App Doesn’t Have Those Metrics!

Finding Your Company’s Business KPIs

Wrap-Up

6. Frontend Monitoring The Cost of a Slow App

Two Approaches to Frontend Monitoring

Document Object Model (DOM) Frontend Performance Metrics

OK, That’s Great, but How Do I Use This?

Logging

Synthetic Monitoring

Wrap-Up

7. Application Monitoring Instrumenting Your Apps with Metrics How It Works Under the Hood

Monitoring Build and Release Pipelines

Health Endpoint Pattern

Application Logging Wait a Minute…Should I Have a Metric or a Log Entry?

What Should I Be Logging?

Write to Disk or Write to Network?

Serverless / Function-as-a-Service

Monitoring Microservice Architectures

Wrap-Up

8. Server Monitoring Standard OS Metrics CPU

Memory

Network

Disk

Load

SSL Certificates

SNMP

Web Servers

Database Servers

Load Balancers

Message Queues

Caching

DNS

NTP

Miscellaneous Corporate Infrastructure DHCP

SMTP

Monitoring Scheduled Jobs

Logging Collection

Storage

Analysis

Wrap-Up

9. Network Monitoring The Pains of SNMP What Is SNMP?

How Does It Work?

A Word on Security

How Do I Use SNMP?

Interface Metrics

Interface and Logging

Recap

Configuration Tracking

Voice and Video

Routing

Spanning Tree Protocol (STP)

Chassis CPU and Memory

Hardware

Flow Monitoring

Capacity Planning Working Backward

Forecasting

Wrap-up

10. Security Monitoring Monitoring and Compliance

User, Command, and Filesystem Auditing Setting Up auditd

auditd and Remote Logs

Host Intrusion Detection System (HIDS)

rkhunter

Network Intrusion Detection System (NIDS)

Wrap-Up

11. Conducting a Monitoring Assessment Business KPIs

Frontend Monitoring

Application and Server Monitoring

Security Monitoring

Alerting

Wrap-Up

A. An Example Runbook: Demo App Demo App

Metadata

Escalation Procedure

External Dependencies

Internal Dependencies

Tech Stack

Metrics and Logs

Alerts

B. Availability Chart

Index

A

assessment example, Alerting

defining, What Makes a Good Alert?

desensitization to, Delete and Tune Alerts

email for, Stop Using Email for Alerts

flapping detection, Before Statistics in Systems Operations

maintenance period use for, Use Maintenance Periods

with Nagios, Before Statistics in Systems Operations

Amazon, The Cost of a Slow App

anti-patterns, Monitoring Anti-Patterns-Wrap-Upcheckbox monitoring, Anti-Pattern #3: Checkbox Monitoring-Collect Your Metrics More Often

manual configuration, Anti-Pattern #5: Manual Configuration

monitoring as a crutch, Anti-Pattern #4: Using Monitoring as a Crutch

monitoring-as-a-job, Anti-Pattern #2: Monitoring-as-a-Job

tool obsession, Anti-Pattern #1: Tool Obsession-The Single Pane of Glass Is a Myth

build and release pipeline monitoring, Monitoring Build and Release Pipelines-Monitoring Build and Release Pipelines

health endpoint patterns, Health Endpoint Pattern-Health Endpoint Pattern

instrumenting with metrics, Instrumenting Your Apps with Metrics-How It Works Under the Hood

logging, Application Logging-Write to Disk or Write to Network?

metrics versus log entries, Wait a Minute…Should I Have a Metric or a Log Entry?

microservice architectures, Monitoring Microservice Architectures-Monitoring Microservice Architectures

serverless platforms, Serverless / Function-as-a-Service

application performance monitoring (APM) tools (see APM tools)

arbitrary static thresholds, Arbitrary Static Thresholds Aren’t the Only Way

arithmetic mean, Mean and Average-Mean and Average

audisp-remote, auditd and Remote Logs

auditd, User, Command, and Filesystem Auditing-auditd and Remote Logs

auto-healing, Attempt Automated Self-Healing First

automation importance, Anti-Pattern #5: Manual Configuration

availability chart, Availability Chart

availability reporting, Analytics and Reporting-Analytics and Reporting

average, Mean and Average-Mean and Average

B

bad habits (see anti-patterns)

bandwidth, Interface Metrics, Interface Metrics

BGP routing, Routing

blackbox monitoring, Two Approaches to Frontend Monitoring

buffers, Memory

build and release pipeline monitoring, Monitoring Build and Release Pipelines-Monitoring Build and Release Pipelines

burn rate, Business KPIs

business KPIs (see KPIs (key performance indicators))

C

caches/caching, Memory, Caching

canary endpoint monitoring, Health Endpoint Pattern

capacity planning, Capacity Planning

churn rate, Business KPIs

cloud infrastructures, Monitoring Is Multiple Complex Problems Under One Name

cloud versus traditional architectures, Anti-Pattern #5: Manual Configuration

communication liaison, Incident Management

compliance, Monitoring and Compliance

composable monitoring, Pattern #1: Composable Monitoring-Alertingalerting, Alerting

analytics and reporting, Analytics and Reporting-Analytics and Reporting

data collection, Data collection-Logs

data storage, Data storage-Data storage

visualization, Visualization-Visualization

configuration tracking, Configuration Tracking

console statement, Logging

consumption rate, Message Queues

continual improvement, Pattern #4: Continual Improvement

cost considerations, It’s Cheaper, No, Really, SaaS Is Actually Better

cost of goods sold (COGS), Business KPIs

cost per customer, Business KPIs

counters, Metrics

CPU usage, CPU, CPU and Memory

customer acquisition cost (CAC), Business KPIs

customer churn, Business KPIs

customer lifetime value (LTV), Business KPIs

D

daily active users (DAU), Business KPIs

dashboards, Visualization

data collection, Data collection-Logs

data storage, Data storage-Data storage

data visualization, Visualization-Visualization

database server performance, Database Servers-Database Servers

design patterns, Monitoring Design Patterns-Wrap-Upbuying tools versus building, Pattern #3: Buy, Not Build-No, Really, SaaS Is Actually Better

composable monitoring, Pattern #1: Composable Monitoring-Alerting

continual improvement, Pattern #4: Continual Improvement

monitoring from user perspective, Pattern #2: Monitor from the User Perspective, Monitoring the Business

DHCP, DHCP

disk performance, Disk-Disk

distributed tracing, Monitoring Microservice Architectures-Monitoring Microservice Architectures

DNS servers, DNS

DOM (Document Object Model), Document Object Model (DOM)

E

email alerts, Stop Using Email for Alerts

errors, Interface Metrics, Interface Metrics

Etsy, Monitoring Build and Release Pipelines

evicted items, Caching

F

false alarms, Fixing False Alarms

firefighting mode, Cutting Down on Needless Firefighting

flapping detection, Before Statistics in Systems Operations

flow monitoring, Flow Monitoring-Flow Monitoring

follow-the-sun (FTS) rotations, Building a Better On-Call Rotation

forecasting, Forecasting

frontend monitoring, Frontend Monitoring-Wrap-Upassessment example, Frontend Monitoring

defining, Frontend Monitoring

logging, Logging

Navigation Timing API, Navigation Timing API-Navigation Timing API

performance importance, The Cost of a Slow App-The Cost of a Slow App

Real User Monitoring (RUM), Two Approaches to Frontend Monitoring

speed index, Speed Index

synthetic monitoring, Two Approaches to Frontend Monitoring, Synthetic Monitoring

function-as-a-service, Serverless / Function-as-a-Service

G

gauges, Metrics

Google Analytics, Two Approaches to Frontend Monitoring, OK, That’s Great, but How Do I Use This?

gross profit margin, Business KPIs

H

habits, bad (see anti-patterns)

health endpoint pattern monitoring, Health Endpoint Pattern-Health Endpoint Pattern

hit/miss ratio, Caching

host intrusion detection system (HIDS), Host Intrusion Detection System (HIDS)-rkhunter

I

incident commander (IC), Incident Management

incident management, Incident Management-Incident Management

IOPS (I/O per Second), Disk, Database Servers

iostat, Disk

IPFIX, Flow Monitoring

J

J-Flow, Flow Monitoring

JavaScript, Document Object Model (DOM)

jitter, Interface Metrics

K

keepalives, Web Servers

KPIs (key performance indicators), Business KPIs-Wrap-Updetermining, Finding Your Company’s Business KPIs-Finding Your Company’s Business KPIs, Business KPIs-Business KPIs

Reddit example, Reddit-Tying Business KPIs to Technical Metrics

tying to technical metrics, Tying Business KPIs to Technical Metrics-My App Doesn’t Have Those Metrics!

Yelp example, Yelp-Yelp

L

latency, Interface Metrics

line graphs, Visualization

load, Load

load balancers, Load Balancers

log analysis, Analysis

log collection, Logs-Logs, Logging

log entries, Application Logging-Write to Disk or Write to Network?

log levels, What Should I Be Logging?

log storage, Data storage, Storage, auditd and Remote Logs

logging, Logging

LTV (lifetime value), Business KPIs

M

maintenance periods, Use Maintenance Periods

manual configuration, Anti-Pattern #5: Manual Configuration

mean, Mean and Average-Mean and Average

median, Median

memory usage, CPU and Memory

memory used, Memory-Memory

message queues, Message Queues

metricsbandwidth, Interface Metrics, Interface Metrics

CPU usage, CPU

disk performance, Disk-Disk

errors, Interface Metrics, Interface Metrics

jitter, Interface Metrics

latency, Interface Metrics

load, Load

memory used, Memory-Memory

network performance, Network

SNMP (Simple Network Management Protocol), Interface Metrics-Interface Metrics

standard OS, Standard OS Metrics-Load

throughput, Interface Metrics, Interface Metrics

versus log entries, Wait a Minute…Should I Have a Metric or a Log Entry?

metrics collection, Metrics

metrics collection frequency, Collect Your Metrics More Often

metrics storage, Data storage

MIBs (management information base files), How Does It Work?

microservice architectures, Monitoring Microservice Architectures-Monitoring Microservice Architectures

monitoringreasons for ineffectiveness of, Anti-Pattern #3: Checkbox Monitoring-Collect Your Metrics More Often

monitoring assessment example, Conducting a Monitoring Assessment-Wrap-Up

monitoring service components, The Components of a Monitoring Service-Alerting(see also monitoring components)

monthly active users (MAU), Business KPIs

monthly recurring revenue, Business KPIs

N

Nagios, Pattern #1: Composable Monitoring, Arbitrary Static Thresholds Aren’t the Only Wayalerting with, Before Statistics in Systems Operations

statistics in, Math to the Rescue!

Navigation Timing API, Navigation Timing API-Navigation Timing API

NetFlow, Flow Monitoring

network intrusion detection system (NIDS), Network Intrusion Detection System (NIDS)-Network Intrusion Detection System (NIDS)

network monitoring, Network Monitoring-Wrap-upcapacity planning, Capacity Planning

configuration tracking, Configuration Tracking

CPU and memory usage, CPU and Memory

device chassis, Chassis

flow monitoring, Flow Monitoring-Flow Monitoring

hardware, Hardware

routing protocols, Routing

SNMP (see SNMP (Simple Network Management Protocol))

spanning tree protocol (STP), Spanning Tree Protocol (STP)

voice and video performance, Voice and Video

network performance, Network

network taps, Network Intrusion Detection System (NIDS)

normal distributions, Standard Deviation

NPS (net promoter score), Business KPIs

number of paying customers, Business KPIs

O

Observability Teams, Anti-Pattern #2: Monitoring-as-a-Job

Observer Effect, The, Monitoring Is Multiple Complex Problems Under One Name

OIDs (object identifiers), How Does It Work?

on-call, On-Call-Building a Better On-Call Rotationcompensation, Building a Better On-Call Rotation

rotations for, Building a Better On-Call Rotation-Building a Better On-Call Rotation

tools for, Building a Better On-Call Rotation

OOMKiller, Memory

OS metrics alerts, OS Metrics Aren’t Very Useful — for Alerting

OSPF routing, Routing

overreliance on monitoring, Anti-Pattern #4: Using Monitoring as a Crutch

P

page load times, The Cost of a Slow App-The Cost of a Slow App

percentiles, Quantiles

persistent connections, Web Servers

pie charts, Visualization

Pinterest, The Cost of a Slow App

postmortems, Postmortems

protocol changes, Spanning Tree Protocol (STP)

pull model of data collection, Data collection

push model of data collection, Data collection

Q

QoS (quality of service) monitoring, Voice and Video

qps (queries per second), Database Servers, DNS

quantiles, Quantiles

queue length, Message Queues

R

real user monitoring (RUM), Two Approaches to Frontend Monitoring

Reddit, Reddit-Tying Business KPIs to Technical Metrics

reporting and analytics, Analytics and Reporting-Analytics and Reporting

req/sec (requests per second), Web Servers

return codes, Health Endpoint Pattern

revenue per customer, Business KPIs

rkhunter, rkhunter-rkhunter

root bridge changes, Spanning Tree Protocol (STP)

rootkits, Host Intrusion Detection System (HIDS)-rkhunter

routing protocols, Routing

rsyslog, auditd and Remote Logs

run rate, Business KPIs

runbooksabuse of, Anti-Pattern #5: Manual Configuration

example, An Example Runbook: Demo App-Alerts

linking alerts to, Write Runbooks

S

SaaS services, Pattern #3: Buy, Not Build-No, Really, SaaS Is Actually Better

scheduled jobs, Monitoring Scheduled Jobs-Monitoring Scheduled Jobs

scribe, Incident Management

seasonality, Seasonality

security information and event management (SIEM) system, Network Intrusion Detection System (NIDS)

security monitoring, Security Monitoring-Wrap-Upassessment example, Security Monitoring

auditing users, commands, and filesystems, User, Command, and Filesystem Auditing-auditd and Remote Logs

compliance, Monitoring and Compliance

host intrusion detection system (HIDS), Host Intrusion Detection System (HIDS)-rkhunter

network intrusion detection system (NIDS), Network Intrusion Detection System (NIDS)-Network Intrusion Detection System (NIDS)

server monitoring, Server Monitoring-Wrap-Upassessment example, Application and Server Monitoring-Application and Server Monitoring

caching, Caching

database servers, Database Servers-Database Servers

DHCP, DHCP

DNS, DNS

load balancer metrics, Load Balancers

log analysis, Analysis

log collection, Logging

log storage, Storage

message queues, Message Queues

NTP servers, NTP

scheduled jobs, Monitoring Scheduled Jobs-Monitoring Scheduled Jobs

SMTP, SMTP

SNMP, SNMP(see also SNMP (Simple Network Management Protocol)

SSL certificates, SSL Certificates

standard OS metrics, Standard OS Metrics-Load

web server performance, Web Servers-Web Servers

serverless platforms, Serverless / Function-as-a-Service

severity levels, What Should I Be Logging?

sFlow, Flow Monitoring

Shopzilla, The Cost of a Slow App

SLA (Service Level Availability)

SLA (service-level availability), Analytics and Reporting-Analytics and Reporting

slaves, Database Servers, DNS

smoothing, Mean and Average

SNMP (Simple Network Management Protocol), SNMP, The Pains of SNMP-Recapbackground, What Is SNMP?

codec in use, Voice and Video

command line use, How Do I Use SNMP?-That’s great, Mike. But where’s the list of OIDs I should monitor?

interface and logging, Interface and Logging

interface metrics, Interface Metrics-Interface Metrics

securing, A Word on Security

traps, How Does It Work?

versions, How Does It Work?

spanning tree protocol (STP), Spanning Tree Protocol (STP)

SPAs (single-page apps), Frontend Monitoring

speed index, Speed Index

SSL certificates, SSL Certificates

standard deviation, Standard Deviation-Standard Deviation

statistics, Statistics Primer-Wrap-Upmean and average, Mean and Average-Mean and Average

median, Median

quantiles/percentiles, Quantiles

seasonality, Seasonality

standard deviation, Standard Deviation-Standard Deviation

StatsD, Instrumenting Your Apps with Metrics-How It Works Under the Hood, Serverless / Function-as-a-Service

status endpoint monitoring, Health Endpoint Pattern

strip charts, Visualization

structured logs, Logs-Logs, Application Logging

subject matter experts (SMEs), Incident Management

synthetic monitoring, Two Approaches to Frontend Monitoring, Synthetic Monitoring

syslog forwarding, Collection

syslogd/syslog-ng, auditd and Remote Logs

systems resiliency and stability, Cutting Down on Needless Firefighting

T

TCP versus UDP, Collection

throughput, Interface Metrics, Interface Metrics

tools, Anti-Pattern #1: Tool Obsession-The Single Pane of Glass Is a Mythbuilding, Sometimes, You Really Do Have to Build It

buying versus building, Pattern #3: Buy, Not Build-No, Really, SaaS Is Actually Better

cargo-culting tools, Avoid Cargo-Culting Tools

choosing, Monitoring Is Multiple Complex Problems Under One Name-Avoid Cargo-Culting Tools

cost considerations, It’s Cheaper, No, Really, SaaS Is Actually Better

mapping to dashboards, The Single Pane of Glass Is a Myth

observation tools, Monitoring Is Multiple Complex Problems Under One Name

standardization of, Monitoring Is Multiple Complex Problems Under One Name

tool creep, Monitoring Is Multiple Complex Problems Under One Name

tool fragmentation, Monitoring Is Multiple Complex Problems Under One Name

total addressable market (TAM), Business KPIs

traditional versus cloud architectures, Anti-Pattern #5: Manual Configuration

TSDB (time series database), Data storage

U

UDP versus TCP, Collection

unstructured logs, Logs-Logs, Application Logging

user perspective in monitoring, Pattern #2: Monitor from the User Perspective, Monitoring the Business

V

visualization of data, Visualization-Visualization

voice and video performance, Voice and Video

W

web server performance, Web Servers-Web Servers

WebpageTest.org, Two Approaches to Frontend Monitoring, Speed Index, Synthetic Monitoring

weekly active users (WAU), Business KPIs

whitebox monitoring, Two Approaches to Frontend Monitoring

Y

Yelp, Yelp-Yelp

Z

practical_monitoring_-_effective_strategies_for_the_real_world.txt · Last modified: 2020/11/20 01:12 by 127.0.0.1