Statistics for Data Science, Written for Software Developers

Outcome focus: Defined a practical estimate-review workflow that helps software developers report effect size, confidence intervals, p-values, sampling bias, and classification metrics without treating statistics as glossary trivia.

The dashboard said the feature won.

The rollout still made the product worse.

The first time I saw that kind of failure, the model or experiment was not obviously broken. The chart had a higher conversion rate. The p-value was below the threshold. The metric owner was excited. The implementation was clean. What failed was the statistical review around the decision: the effect was tiny, the sample was biased toward power users, the uncertainty interval barely cleared zero, and the result was being interpreted as if it applied to the whole customer base.

Software developers do not need to become statisticians before working on data products, ranking systems, A/B platforms, ML features, or analytics pipelines.

But we do need enough statistics to know when a number is answering the wrong question.

A 2020 introductory post on statistics for data science has a useful beginner table of contents: population, sample, variables, central tendency, variability, relationships, probability distributions, probability, and accuracy. That outline is a reasonable starting map. The problem is that a glossary-style map is not enough for engineering decisions, and some common beginner definitions are too loose:

a statistical model is not the same thing as a population parameter;
mode is not only relevant for discrete data, although it is often most natural there;
normal distributions do not always have mean 0 and standard deviation 1; that is the standard normal;
a continuous distribution does not mean every outcome is equally likely; that is a uniform distribution;
CDF means cumulative distribution function, not cumulative density function;
R-squared is not limited to simple linear regression, although careless interpretation gets worse as models get more complex;
accuracy is a metric, not a guarantee that the model supports the decision.

Those corrections are not academic nitpicks. They are the difference between shipping a decision tool and shipping confidence theater.

The useful mental model for a developer is simple:

Statistical work turns observed samples into qualified decisions about a larger population or process.

That is the stack: sample, estimate, uncertainty, bias, decision.

Everything else in this post hangs on that.

The Minimum Operating Vocabulary#

Start with the words that show up in code reviews, metric reviews, and model cards.

A population is the group, process, event stream, or system behavior you want to reason about. For a product experiment, it might be all eligible users. For a service metric, it might be all requests in production. For an ML model, it might be future prediction requests, not merely the CSV you trained on.

A sample is the subset you actually observe. NIST's engineering handbook section on populations and sampling describes the practical reason sampling exists: it is often impossible or impractical to measure everyone, so we use a sample to make inferences about a target population. NIST also warns that facts about a sample are not automatically facts about the population; adequacy depends on representativeness, sample size, population variability, and desired precision.

A variable is a measured attribute. In software terms, it is a column, event property, feature, label, log field, or derived metric. Variables have types:

Type	Example	Statistical implication
Numeric continuous	latency, revenue, temperature	mean and standard deviation may be meaningful; distributions matter
Numeric discrete	count of retries, number of purchases	Poisson/binomial-style thinking may fit
Categorical nominal	browser, plan type, region	compare proportions or encode carefully
Categorical ordinal	severity level, satisfaction bucket	order exists, spacing may not
Text/image/audio	review text, screenshot, support call	needs representation before most statistical modeling

A parameter is the true population value you usually do not know: the real conversion rate, mean latency, defect rate, or regression coefficient in the process you care about.

A statistic is the value computed from the sample: sample conversion rate, sample mean, sample standard deviation, sample correlation, validation precision.

An estimate is a statistic used as a guess for a parameter.

That last word matters most for software teams. A dashboard number is usually an estimate. A model metric is usually an estimate. A launch-readiness claim is usually an estimate. Treating estimates as facts is how teams overfit their decision-making to the last slice of data.

Descriptive Statistics Are Debugging Tools#

Descriptive statistics summarize what you observed. They do not prove causality. They do not certify future behavior. They help you debug the data before you trust the analysis.

Measures of central tendency answer "where is the data centered?"

Mean: the arithmetic average. Sensitive to outliers.
Median: the middle value after sorting. More robust to outliers.
Mode: the most frequent value. Useful for categories, counts, and repeated numeric values.

Measures of variability answer "how spread out is the data?"

Range: max minus min. Fast, but dominated by extremes.
Variance: average squared distance from the mean.
Standard deviation: square root of variance, back in the original unit.
Interquartile range: spread between the 25th and 75th percentiles.
Z-score: how many standard deviations a point sits from the mean.

For a software developer, the mistake is reaching for the mean too quickly.

Mean API latency can look stable while the p95 is getting worse. Mean revenue per user can rise because one enterprise customer upgraded. Mean queue time can look fine while a small segment waits far too long. Descriptive statistics should include shape, not only center.

descriptive-review.py

import pandas as pd
 
summary = (
    events
    .groupby("plan")["latency_ms"]
    .agg(
        n="count",
        mean="mean",
        median="median",
        p90=lambda s: s.quantile(0.90),
        p95=lambda s: s.quantile(0.95),
        std="std",
        min="min",
        max="max",
    )
    .sort_values("p95", ascending=False)
)
 
print(summary)

That tiny table catches more production reality than a single average.

Inference Starts When You Leave the Sample#

Statistical inference begins when you use observed data to say something about the unobserved population or future process.

NIST describes statistical tests as mechanisms for making quantitative decisions about a process: the test asks whether there is enough evidence to reject a null hypothesis. A confidence interval, in NIST's wording, addresses how well a sample statistic estimates the underlying population value by giving a range likely to contain the parameter of interest.

For developers, the practical sequence is:

Define the population.
Inspect the sample.
Compute the estimate.
Quantify precision.
Check bias.
Decide whether the effect is large enough to matter.

The estimate by itself is not the decision.

Suppose an onboarding experiment changes conversion from 5.00% to 5.25%.

That is a 5% relative lift, but only a 0.25 percentage-point absolute lift. Those are both true. They sound very different. The relative number is useful for comparing proportional improvement. The absolute number is useful for capacity planning, revenue modeling, and deciding whether the change is worth rollout risk.

experiment-estimate.py

from math import sqrt
from statistics import NormalDist
 
control = {"n": 100_000, "converted": 5_000}
treatment = {"n": 100_000, "converted": 5_250}
 
p_control = control["converted"] / control["n"]
p_treatment = treatment["converted"] / treatment["n"]
 
effect = p_treatment - p_control
relative_lift = effect / p_control
 
se = sqrt(
    p_treatment * (1 - p_treatment) / treatment["n"]
    + p_control * (1 - p_control) / control["n"]
)
 
ci_low = effect - 1.96 * se
ci_high = effect + 1.96 * se
 
pooled = (
    control["converted"] + treatment["converted"]
) / (
    control["n"] + treatment["n"]
)
 
pooled_se = sqrt(
    pooled * (1 - pooled) * (1 / treatment["n"] + 1 / control["n"])
)
 
z = effect / pooled_se
p_value = 2 * (1 - NormalDist().cdf(abs(z)))
 
print(f"control rate: {p_control:.3%}")
print(f"treatment rate: {p_treatment:.3%}")
print(f"absolute effect: {effect:.3%}")
print(f"relative lift: {relative_lift:.1%}")
print(f"95% CI: [{ci_low:.3%}, {ci_high:.3%}]")
print(f"p-value: {p_value:.4f}")

Expected output:

control rate: 5.000%
treatment rate: 5.250%
absolute effect: 0.250%
relative lift: 5.0%
95% CI: [0.057%, 0.443%]
p-value: 0.0112

That result is not "the feature improves conversion by 5%."

A more honest read is: in this sample, treatment conversion was 0.25 percentage points higher; under the assumptions of the test, a 95% confidence interval for the absolute difference is about 0.057 to 0.443 percentage points; the p-value is about 0.011.

Then the engineering and product questions begin:

Is a 0.057 percentage-point lower-bound effect worth the operational complexity?
Was the sample representative of the users who will receive the feature?
Did the experiment run long enough to include weekday/weekend behavior?
Did assignment happen at the right unit: user, account, household, request, device?
Did telemetry drop events differently between groups?
Does the feature hurt a secondary metric?

Statistics does not remove judgment. It forces the judgment onto the page.

Effect Size Goes First#

Your notes named the right trio:

effect size,
confidence interval or standard error,
p-value.

That is the order I would teach.

The effect size is the magnitude of the thing you care about. It should be reported in a unit the audience can interpret:

Question	Good effect-size language
Did conversion improve?	absolute percentage-point difference and relative lift
Did average latency improve?	milliseconds saved, plus percent change
Did a model catch more fraud?	recall increase at the same review capacity
Did a support classifier reduce workload?	tickets avoided per 10,000 tickets
Did a treatment affect a numeric score?	mean difference and standardized difference

Standardized effect sizes help when raw units are hard to compare across studies. UCLA's statistical consulting notes on effect size and power analysis describe Cohen's d as a difference in means divided by a standard deviation, with several variants depending on how that standard deviation is estimated. In software product work, I reach for standardized effect sizes only after I can state the raw effect in product language.

cohens-d.py

from math import sqrt
 
def cohens_d(mean_a, sd_a, n_a, mean_b, sd_b, n_b):
    pooled_sd = sqrt(
        ((n_a - 1) * sd_a**2 + (n_b - 1) * sd_b**2)
        / (n_a + n_b - 2)
    )
    return (mean_b - mean_a) / pooled_sd
 
print(cohens_d(
    mean_a=71.2, sd_a=14.5, n_a=800,
    mean_b=73.0, sd_b=14.1, n_b=820,
))

That might output around 0.126.

Depending on the domain, that could be meaningful or trivial. Cohen's conventional labels can be a starting vocabulary, but they should never replace product context. A small standardized effect on fraud loss could matter. A larger standardized effect on a vanity metric might not.

For binary outcomes, raw and ratio effects are often more useful:

risk difference = p_treatment - p_control
risk ratio      = p_treatment / p_control
odds            = p / (1 - p)
odds ratio      = odds_treatment / odds_control
log odds ratio  = log(odds_ratio)

Log odds ratios show up in logistic regression because logistic models are linear on the log-odds scale. UCLA's notes on interpreting odds ratios in logistic regression explain the practical interpretation: a coefficient is a change in log odds; exponentiating it gives an odds ratio.

The warning: odds ratios are easy to overstate when the outcome is common. A 2x odds ratio is not automatically a 2x probability. Convert back to probabilities when talking to non-statistical audiences.

Precision Is Not the Same as Importance#

Precision in inference means "how tightly estimated is the effect?"

A confidence interval is one way to communicate precision. NIST's confidence-interval explanation gives the frequentist interpretation: if the same population were sampled repeatedly and intervals were built the same way, about 95% of those intervals would bracket the true parameter for a 95% confidence procedure.

That is not the same as saying there is a 95% probability that this exact interval contains the parameter. In frequentist framing, the parameter is fixed and the interval is produced by a random sampling process.

Software teams do not need to debate philosophy in every metric review. They do need to stop reporting a point estimate without uncertainty.

Bad:

Treatment increased conversion by 5%.

Better:

Treatment was +0.25 percentage points versus control.
95% CI: +0.057 to +0.443 percentage points.
The lower bound still clears our +0.05 percentage-point launch threshold,
but only barely, so we should inspect segment effects before full rollout.

The second version tells a reviewer how fragile the decision is.

Standard error is the same family of idea. It describes how much an estimate would vary across repeated samples. Confidence intervals often come from an estimate plus or minus a critical value times standard error:

estimate +/- critical_value * standard_error

For many engineering decisions, bootstrapping is also useful because it can estimate uncertainty without deriving a custom formula for every metric.

bootstrap-ci.py

import numpy as np
 
rng = np.random.default_rng(42)
 
def bootstrap_ci(values, metric, rounds=5_000, alpha=0.05):
    values = np.asarray(values)
    estimates = []
 
    for _ in range(rounds):
        sample = rng.choice(values, size=len(values), replace=True)
        estimates.append(metric(sample))
 
    low, high = np.quantile(estimates, [alpha / 2, 1 - alpha / 2])
    return low, high
 
p95_low, p95_high = bootstrap_ci(
    values=events["latency_ms"],
    metric=lambda sample: np.quantile(sample, 0.95),
)
 
print(p95_low, p95_high)

Bootstrap intervals are not magic either. They inherit sample bias and can behave poorly with tiny samples, dependent observations, or weird tail behavior. They are still a practical tool for making uncertainty visible.

The P-Value Is Smaller Than People Make It#

The American Statistical Association's 2016 p-value statement should be required reading for anyone shipping experiment platforms. The ASA's six principles include that p-values can indicate incompatibility between data and a specified model; they do not measure the probability that the hypothesis is true; decisions should not be based only on a p-value threshold; full reporting and transparency are required; p-values do not measure effect size or importance; and a p-value alone is not a good measure of evidence.

For developers, translate that into code-review language:

p < 0.05 means "review the result," not "merge the decision."

A tiny p-value can come from a huge sample and a useless effect. A non-significant p-value can hide a meaningful effect in an underpowered analysis. Repeatedly trying segments until one turns significant is not discovery; it is a false-positive generator unless the analysis plan accounts for multiple comparisons.

Use p-values as one diagnostic:

Is the observed result surprising under the null model?
Was the hypothesis defined before looking?
How many tests did we run?
Are assumptions reasonable?
What is the effect size?
What does the confidence interval allow?
What would we decide if the p-value were 0.049 versus 0.051?

The last question is a good smell test. If the decision flips entirely on the third decimal place, the process is too brittle.

Probability Is the Runtime Model of Uncertainty#

Probability answers how likely something is under a model.

Basic probability terms show up everywhere in data science:

Event: something that can happen, like "user converts."
Random variable: a variable whose value depends on chance or an uncertain process.
Conditional probability: P(A | B), the probability of A given B.
Bayes' theorem: a rule for updating probability after evidence.

Bayes' theorem is often written:

P(A | B) = P(B | A) * P(A) / P(B)

The developer version:

posterior = likelihood * prior / evidence

In classification systems, the base rate often dominates intuition.

Suppose only 1% of accounts are fraudulent. A model with high sensitivity can still produce many false positives if the false-positive rate is not tiny. That is why precision, recall, and prevalence need to be discussed together.

base-rate-check.py

prevalence = 0.01
sensitivity = 0.90
specificity = 0.95
 
true_positive_rate = prevalence * sensitivity
false_positive_rate = (1 - prevalence) * (1 - specificity)
 
precision = true_positive_rate / (true_positive_rate + false_positive_rate)
 
print(f"precision: {precision:.1%}")

Expected output:

precision: 15.4%

A test can catch 90% of fraud cases and still produce a review queue where most flagged accounts are not fraud. That is not a contradiction. It is base-rate math.

Distributions Are Assumption Objects#

Distributions describe how values are expected to appear.

They are not decorations for textbook chapters. They encode assumptions about the data-generating process.

The NIST gallery of distributions is a useful reference because it separates continuous distributions, such as normal, uniform, t, exponential, and lognormal, from discrete distributions, such as binomial and Poisson.

Use distributions as questions:

Distribution	Software example	What to check
Normal	measurement error around a stable process	symmetric-ish errors, no heavy tail surprise
t	small-sample mean inference with estimated variance	heavier tails than normal
Uniform	random assignment, randomized IDs	equal probability across a range
Binomial	conversions out of `n` independent eligible users	fixed trials, two outcomes, stable probability
Poisson	count of events in a time window	event rate, independence, overdispersion
Lognormal	durations, spend, file sizes	right skew, multiplicative process

NIST's normal distribution page defines the general normal density with location parameter mu and scale parameter sigma, and notes that mu = 0 and sigma = 1 is the standard normal. Its t distribution page notes that the t distribution has heavier tails and approaches normal as degrees of freedom grow. Its binomial and Poisson pages are equally useful anchors: binomial for successes in fixed trials, Poisson for event counts in an interval.

The beginner mistake is choosing a distribution because it is familiar.

Latency is often not normal. Revenue is often not normal. Counts with bursts are often not Poisson. A conversion outcome may be binomial at the user level but not at the request level if the same user appears repeatedly. The model assumption has to match the unit of observation.

Relationships Are Not Causes#

Covariance measures how two variables vary together, but its scale depends on the units.

Correlation normalizes covariance so the result sits between -1 and 1. It is easier to compare, but narrower than people think. Pearson correlation is about linear association. Spearman correlation is about monotonic rank association. Neither proves causation.

For software teams, correlation is useful for:

spotting duplicate or redundant metrics;
detecting possible leakage;
understanding feature families;
checking linear-model assumptions;
building a conversation artifact.

Correlation is dangerous when it becomes a causal story.

If error rate rises when traffic rises, traffic may be stressing the system. Or a deployment may have changed both routing and logging. Or a bot attack may have changed request mix. Or the metric pipeline may be dropping successful responses. A correlation plot gives you the next debugging question, not the root cause.

R-squared has a similar communication trap. In regression, R-squared describes the proportion of variance explained relative to a baseline. Scikit-learn lists r2_score among regression metrics, alongside mean absolute error and mean squared error. A higher R-squared can still be useless if the prediction errors are operationally unacceptable. A lower R-squared can still be useful if the model ranks cases well enough for triage.

Do not let relationship metrics substitute for the decision metric.

Accuracy Is Usually Too Small a Word#

The notes end with accuracy terms: true positive, true negative, false positive, false negative, sensitivity, specificity, positive predictive value, negative predictive value.

That is the right doorway into ML evaluation.

Scikit-learn's model-evaluation docs define precision as the classifier's ability not to label negative samples as positive, and recall as the ability to find all positive samples. In binary classification:

precision = TP / (TP + FP)
recall    = TP / (TP + FN)

Recall is also called sensitivity. Specificity is:

specificity = TN / (TN + FP)

Negative predictive value is:

NPV = TN / (TN + FN)

Accuracy is:

accuracy = (TP + TN) / (TP + FP + TN + FN)

Accuracy can be fine when classes are balanced and mistake costs are similar. It is a bad headline metric for rare events.

classification-review.py

from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
 
y_true = [0, 0, 0, 0, 0, 1, 1, 1]
y_pred = [0, 0, 0, 0, 0, 0, 1, 0]
 
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
 
specificity = tn / (tn + fp)
npv = tn / (tn + fn)
 
print({
    "accuracy": accuracy_score(y_true, y_pred),
    "precision": precision_score(y_true, y_pred),
    "recall": recall_score(y_true, y_pred),
    "specificity": specificity,
    "negative_predictive_value": npv,
})

Expected output:

{
  'accuracy': 0.75,
  'precision': 1.0,
  'recall': 0.3333333333333333,
  'specificity': 1.0,
  'negative_predictive_value': 0.7142857142857143
}

That classifier looks perfect on precision and specificity. It misses two-thirds of the positives. Whether that is acceptable depends on the job. Fraud review, disease screening, churn outreach, and content moderation all have different mistake costs.

The metric should reflect the operational cost of errors.

Sampling Bias Beats Fancy Modeling#

Sampling bias means the sample does not represent the population you want to reason about.

Google's ML guide on data quality and interpretation puts this plainly: sampling biases such as survivorship bias, self-selection bias, and recall bias can skew data and lead to flawed conclusions. The same guide asks practitioners to inspect who collected data, how it was collected, what instruments or humans introduced errors, and what the data does not communicate.

Software systems create their own sampling bias:

telemetry only fires after login, so anonymous drop-off disappears;
mobile crash reports overrepresent users who opted into diagnostics;
model training data contains only users who survived previous business rules;
customer-support labels overrepresent users motivated to complain;
event logs double-count power users;
request-level samples pretend repeated requests are independent users;
warehouse backfills change historical definitions after the model was trained.

Before asking whether an estimate is significant, ask whether the sample is pointed at the right population.

sample-bias-review.md

## Population
Who or what do we want this result to apply to?
 
## Sampling frame
Which rows could possibly appear in the dataset?
 
## Missing groups
Who cannot appear because of product flow, instrumentation, eligibility, opt-in,
retention, privacy, platform, language, device, geography, or time?
 
## Unit of analysis
Is one row a user, account, session, request, order, device, or event?
 
## Dependency
Can the same user or account appear many times?
 
## Timing
Does the sample include seasonality, weekday/weekend effects, releases, incidents,
campaigns, and policy changes?
 
## Label reality
Is the label a direct measurement, a proxy, a human judgment, or a downstream artifact?

I would rather ship a simple model with an honest sampling review than a complex model trained on a quietly biased dataset.

The Estimate Card#

Here is the artifact I wish more software teams used in experiment reviews, model reviews, and analytics PRs.

estimate-card.md

# Estimate Card
 
## Decision
What decision will this estimate influence?
 
## Population
What population or future process should this generalize to?
 
## Sample
- Source:
- Time window:
- Unit of analysis:
- N:
- Inclusion/exclusion rules:
 
## Estimate
- Metric:
- Point estimate:
- Baseline:
- Absolute effect:
- Relative effect:
 
## Precision
- Standard error:
- Confidence interval:
- Method used:
 
## Statistical test
- Null hypothesis:
- Alternative hypothesis:
- P-value:
- Number of tests/segments reviewed:
 
## Bias review
- Missing groups:
- Overrepresented groups:
- Instrumentation risk:
- Label/proxy risk:
 
## Operational decision
- Minimum meaningful effect:
- False-positive cost:
- False-negative cost:
- Rollout/rollback condition:
- Follow-up measurement:

That template makes a team say the quiet parts out loud.

It also keeps statistics connected to engineering reality. If nobody can name the population, the metric is not ready. If nobody knows the unit of analysis, the standard error is suspect. If nobody can state a minimum meaningful effect, the p-value will become the decision. If nobody checked telemetry, the estimate may be measuring the logging system.

What to Learn First#

If you are a software developer moving deeper into data science, learn statistics in this order:

Population, sample, parameter, statistic, estimate.
Data types and units of analysis.
Mean, median, variance, standard deviation, percentiles, and z-scores.
Distributions as assumptions: normal, t, binomial, Poisson, uniform, lognormal.
Covariance, correlation, and why association is not causation.
Effect size in raw units before standardized units.
Standard error and confidence intervals.
P-values and their limits.
Sampling bias, survivorship bias, self-selection bias, and proxy labels.
Classification metrics: confusion matrix, precision, recall, specificity, NPV, F1, ROC, PR curves.
Regression metrics: MAE, RMSE, R-squared, residual plots.
Experiment design: randomization unit, power, multiple comparisons, guardrail metrics.

That sequence is intentionally decision-shaped. It starts with what the data represents, moves through estimates and uncertainty, then lands on model and experiment metrics.

The textbook order is often probability first. That is fine for a course. For developers doing applied work, the first habit should be asking what a number is allowed to mean.

The Practical Standard#

Do not present a data-science result as a single number.

Present an estimate card.

Show the effect size prominently. Use units the audience can interpret. Add confidence intervals or standard errors so reviewers can see precision. Report the p-value when a hypothesis test is part of the analysis, but do not let it carry the decision alone. Name the population. Describe the sample. Call out sampling bias. Tie the metric to the operational cost of false positives and false negatives.

Most bad statistical decisions in software are not caused by developers failing to remember a formula.

They happen because the team turns an estimate into a fact too early.

Keep the estimate attached to its sample, uncertainty, bias, and decision. Your dashboards will get less flashy. Your launches will get safer.