AI/ML Explained: Time-Series Forecasting

Day 35 · 2026-06-21
For: engineers from a non-AI background · Level: Intermediate

ARIMAAutoRegressive Integrated Moving Average

classic statsstationarity
One-line analogy

ARIMA predicts the next value by "regressing onto its own history" — a linear mix of the last few observations (AR) plus the last few prediction errors (MA). The middle letter, I (differencing), is the key: it first turns the data from a "cumulative total" into "increments," exactly like using Prometheus's rate() to turn an ever-climbing counter into a per-second delta. A counter that only goes up can't be modeled directly; after differencing it becomes stationary and predictable.

What it solves + how it works

Many real series carry a trend (steady climb), but nearly all classic statistical models require the series to be "stationary" — mean and variance that don't drift over time. Why does non-stationarity break things? The model fits one fixed set of parameters over the entire history; if the mean keeps moving, those parameters are fitting an "average state" that doesn't actually exist, so extrapolation must distort. ARIMA(p,d,q) has three knobs, each doing one job:

  • AR(p): predict the present from a weighted sum of the last p observations, y_t = c + φ₁y_{t-1} + … + φ_p y_{t-p}, where φ is "how much weight each past value gets";
  • I(d): difference d times to remove trend. First-order differencing y'_t = y_t − y_{t-1} turns a trended non-stationary series into one that fluctuates around a fixed mean;
  • MA(q): correct using the last q prediction errors, absorbing the lagged effect of random shocks.

Intuition: difference to "flatten" the data, then linearly extrapolate from "past values + past errors." The difference between AR and MA is what they remember: AR remembers past actual values (trend momentum), MA remembers past prediction errors (the system hasn't finished digesting a sudden shock). They're complementary — AR alone misses the decay tail of a one-off disturbance, MA alone misses persistent momentum. The seasonal variant is SARIMA (it adds another AR/MA/differencing set spaced at the seasonal period), modeling overlapping rhythms like "within-day" plus "within-week."

Code
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# series: one column of values ordered by time (e.g. monthly server cost)
series = pd.read_csv("cost.csv", index_col="month", parse_dates=True)["usd"]

# order=(p,d,q): 2nd-order AR + 1 differencing + 2nd-order MA
model = ARIMA(series, order=(2, 1, 2))
res = model.fit()

# forecast 6 steps ahead, with a confidence interval
fc = res.get_forecast(steps=6)
print(fc.predicted_mean)            # point forecast
print(fc.conf_int(alpha=0.05))      # 95% interval, quantifies uncertainty
Pitfall + where you'd use it
"ARIMA captures long, complex patterns" — wrong. At heart it's a linear, short-memory model; multiple seasonalities, holidays, and nonlinear jumps stump it (that's SARIMA / Prophet territory). And picking d wrong (over-differencing) introduces spurious correlation.
📌 BigCat scenario: forecasting a side project's monthly cloud cost or blog traffic — "weak trend + single variable" series — needs only a dozen lines of ARIMA. Don't reach for a deep model first: it's overkill and harder to explain.
Takeaway + question
💡 ARIMA = difference to flatten the data, then linearly extrapolate from past values and past errors. Simple, interpretable, robust on small data.
🤔 Which of your monitoring metrics are "counters" and which are "rates"? Which kind belongs directly in ARIMA?

Exponential Smoothing & Kalman FilterExponential Smoothing & Kalman Filter

recursive estimationstate space
One-line analogy

Exponential smoothing is just EWMA (exponentially weighted moving average) — you've met it all over the backend: TCP RTT estimation, load-balancer rolling averages, Prometheus's irate. In one line: new estimate = α×new observation + (1−α)×old estimate, where older data decays exponentially in weight. The Kalman filter is its upgrade: it additionally tracks "how certain am I about the current estimate" (variance) and uses that to dynamically decide whether to trust the new observation or the old prediction more.

What it solves + how it works

Real data is noisy: you can't fully trust the latest point (jitter) nor cling to the old mean (lag). Exponential smoothing compromises with a fixed α; Holt-Winters extends it into three parallel smoothing equations that track level, trend, and season with independent coefficients, then sum them into a forecast. Expanding the formula shows where "exponential" comes from:

s_t = αy_t + α(1−α)y_{t-1} + α(1−α)²y_{t-2} + …

The Kalman filter goes further, looping through two stages each step:

Kalman filter: predict → update loop

① Predict push state forward with a model; uncertainty (variance) grows
a new observation arrives (itself noisy)
② Update fuse prediction vs observation, weighted by Kalman gain K
large K → trust observation; small K → trust prediction (set by both variances)
back to ① , recurse point by point

The Kalman gain K is essentially an adaptive "trust dial" — isomorphic to adaptive backoff and dynamic TTLs you know; also like fusing readings from several unreliable replicas in a distributed system, weighted by each one's credibility (variance). Versus fixed-α smoothing, Kalman's edge is that K self-tunes over time: when observations are clean it raises K (fast follow), when they're noisy it lowers K (hold steady and don't get dragged). Smoothing can't do this — once α is set, it's frozen. The cost: you must explicitly write "how the state evolves" and "how big the observation noise is," so modeling effort is higher.

Code
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# three-parameter exponential smoothing (Holt-Winters): trend + season (period=12)
model = ExponentialSmoothing(
    series,
    trend="add",            # additive trend
    seasonal="add",         # additive season
    seasonal_periods=12,        # 12 months per cycle
)
res = model.fit()              # learns α and other smoothing coefficients
print(res.forecast(6))         # forecast 6 periods ahead
Pitfall + where you'd use it
"Bigger α tracks more closely, so it's better" — wrong. Large α = oversensitive to noise (high variance); small α = sluggish (high bias). This is the classic bias-variance tradeoff — no universal value; tune to the signal-to-noise ratio.
📌 BigCat scenario: when tracking personal/household metrics (weight, daily focus time, a child's routine), exponential smoothing gives you a de-noised trend line — far more reliable for decisions than staring at the raw jitter. Don't let a single day's swing set the tempo.
Takeaway + question
💡 Exponential smoothing = memory with exponentially decaying weights; Kalman = smoothing with "uncertainty management" that auto-tunes trust.
🤔 Kalman's "fuse multiple signals weighted by variance" — is that the same math as your distributed reads "pick a source weighted by replica health"?

ProphetProphet (Facebook / Meta)

additive modelinterpretable
One-line analogy

Prophet decomposes a time series into independent components that add up: long-term trend + yearly cycle + weekly cycle + holiday effects + noise — like splitting a system into microservices, each owning one piece, each separately visualizable and configurable. It's a GAM (Generalized Additive Model), not a black box — you can read what each component is doing.

What it solves + how it works

ARIMA wants you to know how to tune p,d,q — business analysts can't. It's also fiddly with missing values, outliers, and multiple seasonalities (weekly + yearly at once). Prophet (Taylor & Letham 2018) was designed for "forecasting at scale with an analyst in the loop" — turning the model into a few knobs people can understand and adjust directly:

Prophet additive decomposition

y(t) = g(t) trend + s(t) seasonality + h(t) holidays + ε noise

g: piecewise linear/logistic trend, auto-detects changepoints
s: periodic component fit with a Fourier series
h: holidays/events you declare explicitly (school breaks, big sales)

Each term can be plotted on its own, and you can inject domain priors ("this day is a big sale, volume doubles"). Two design choices worth understanding: changepoints — Prophet scatters candidate trend turning points and uses regularization to zero out most weights, keeping only real inflections (a sparse-modeling idea that prevents trend overfitting); and additive vs multiplicative seasonality — if seasonal swings grow with the baseline (higher level, taller peaks), switch to multiplicative, otherwise additive systematically underestimates the peaks. This is why it caught on with business teams: not highest accuracy, but easy + interpretable + robust to missing values and outliers.

Code
from prophet import Prophet
import pandas as pd

# Prophet convention: two columns, ds=date, y=value
df = pd.read_csv("traffic.csv")        # has ds, y columns

m = Prophet(yearly_seasonality=True,
            weekly_seasonality=True)
m.add_country_holidays(country_name="CN")  # inject China holidays
m.fit(df)

future = m.make_future_dataframe(periods=30)  # next 30 days
forecast = m.predict(future)
m.plot_components(forecast)               # plot trend/season/holidays separately
Pitfall + where you'd use it
"Prophet is a forecasting miracle, better than ARIMA at everything" — wrong. Researchers have criticized its "good by default" reputation: on short series, no clear seasonality, or complex interactions with external regressors, it may not win. Its strength is ease + interpretability, not accuracy crown. Honestly: it's often a strong baseline, not the endpoint.
📌 BigCat scenario: forecasting weekly/yearly traffic for a blog or product, feeding in school breaks and long holidays — Prophet gives an interpretable decomposition plot in a few lines, so you can see at a glance "is the rise trend, or a holiday effect?"
Takeaway + question
💡 Prophet = decompose a series into interpretable additive components, prioritizing "humans can read and adjust it" over "machine accuracy is highest."
🤔 When interpretability (you see every component) clashes with higher accuracy (but a black box), which matters more in your own decision scenarios?

Time-Series Foundation ModelsTime-Series Foundation Models · TimesFM

foundation modelzero-shot
One-line analogy

Treat "predict the next value" as "predict the next token" — exactly what an LLM does. TimesFM (Google 2024) is a decoder-only transformer, pretrained on hundreds of billions of real time-points, that then zero-shot forecasts on series it has never seen, without retraining for that series. The analogy: upgrading from "every project builds its own forecasting model" to "calling a shared, pretrained piece of infrastructure."

What it solves + how it works

ARIMA / Prophet must be fit per series and can't capture complex nonlinear long-range dependencies; earlier deep time-series models had to be trained per dataset, a high bar. The foundation-model route ports the LLM playbook over:

TimesFM: predict numbers like tokens

raw series cut into non-overlapping patches decoder-only Transformer predict future patches

patch: cut continuous values into short segments (like ViT image patches), shortening the sequence
causal attention: only look at the past, autoregress forward

Patching (cutting continuous values into short non-overlapping segments, like ViT image patches, like merging characters into tokens) has two benefits: it shrinks sequence length to 1/patch, easing attention's O(n²) cost; and it makes the model learn on "local shapes (the small waveform inside a patch)" rather than single points — closer to how a human reads a chart. It also outputs quantile forecasts rather than a single point, giving uncertainty bands for free. The paper (arXiv 2310.10688, ICML 2024) reports: after pretraining on hundreds of billions of real time-points, zero-shot performance can approach supervised models trained specifically on a single dataset — that's the substance behind the word "foundation model."

Code
# pip install timesfm —— Google's official SDK
# note: foundation-model libraries iterate fast; defer to the official repo
import timesfm

tfm = timesfm.TimesFm(
    hparams=timesfm.TimesFmHparams(horizon_len=24),
    checkpoint=timesfm.TimesFmCheckpoint(
        huggingface_repo_id="google/timesfm-1.0-200m"),
)

# zero-shot directly: give history, ask for the future, no per-series training
point_forecast, _ = tfm.forecast(
    inputs=[history_values],   # list; each entry is one history segment
    freq=[0],                  # 0=high freq (daily/hourly), 1=weekly/monthly, 2=low
)
print(point_forecast)
Pitfall + where you'd use it
"With foundation models, ARIMA / Prophet are obsolete" — wrong. Honestly: a big model is overkill for short series and small data, with higher cost and latency and worse interpretability; classic methods stay first choice when you need "few data points, explainability, or speed." Foundation models win on "many heterogeneous series, zero-shot, no patience for tuning."
📌 BigCat scenario: when you want forecasts for dozens of different metrics in one shot and don't want to tune each, use TimesFM's zero-shot as a fast baseline; if one series really matters, refine it with ARIMA / Prophet.
Takeaway + question
💡 Time-series foundation models = turn "predict the next segment of numbers" into "predict the next token," trading pretraining for zero-shot generality.
🤔 When a field gets a "pretrain + zero-shot" foundation model, do the existing specialized methods die out, or retreat to the "little data / need explanation" niche?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. ARIMA uses "differencing," Prophet uses a "trend component," Transformers use "autoregression" — how do their philosophies of handling trend fundamentally differ?
Three worldviews. ARIMA treats trend as interference to remove: difference it flat into a stationary series, then model the fluctuations — trend isn't understood, it's "subtracted away." Prophet treats trend as a component to model explicitly and interpretably: fit it with piecewise linear/logistic functions and auto-find changepoints; trend is a first-class citizen you can plot alone. The Transformer presupposes no trend form at all — it throws trend, seasonality, and noise into one data distribution and lets massive pretraining make the model "have seen all kinds of trends" so it generalizes; trend is memorized implicitly. In engineering terms: the first is "normalize it away then process," the second is "split into observable subsystems," the third is "learn end-to-end, no feature engineering." Later means less manual work, more data hunger, harder to explain.
2. The Kalman filter's predict→update loop — which distributed/systems patterns is it isomorphic to?
At heart it's Bayesian updating: prior (prediction) + likelihood (observation) → posterior (updated estimate), recursing once per data point. Many things are isomorphic: (a) sensor fusion — multiple unreliable sensors weighted by variance, exactly what Kalman gain does; (b) adaptive backoff / dynamic TTL — adjust trust in the new signal vs the old state based on "how accurate were recent predictions"; (c) replica selection in distributed reads — pick a source weighted by replica health/latency variance, same mathematical skeleton; (d) more abstractly, it's the seed of online learning: O(1)-memory incremental state update without recomputing all history. Understand Kalman gain K = "how much to trust new data" and you'll recognize that dial across many system designs.
3. Can time-series foundation models' zero-shot really replace classic methods? When are classic methods still irreplaceable?
Not a full takeover soon — rather layered coexistence. The foundation-model sweet spot: many series, heterogeneous, no patience for per-series tuning, OK with a black box, GPU available. Where classic methods are irreplaceable: (a) tiny data — dozens of points, where a big model has nothing to work with and ARIMA/ETS is steadier; (b) need explanation — finance, medicine, anywhere you must tell a human "why it will rise," where Prophet's decomposition plots are unbeatable; (c) need fast/cheap — edge devices, high-frequency online prediction, where a few-line statistical model crushes a big model on latency and cost; (d) strong domain priors — when you know the holiday/promo rules exactly, injecting them explicitly beats letting the model guess. The pattern mirrors NLP: foundation models raise the floor of the "general baseline," but specialized methods keep a moat in niche scenarios.
4. Almost all classic time-series methods assume "stationarity." Is that as dangerous as assuming "load is stationary" in distributed systems? What happens when the assumption breaks?
Equally dangerous, and it fails similarly. The stationarity assumption = "the future's statistical regularities match the past's." The moment a structural break (regime change) hits — pandemic, policy, a product blowing up, a phase transition — the model is confidently wrong: it keeps extrapolating the old distribution with narrow confidence bands, exactly like capacity planning off historical means getting blown through by a traffic spike. The remedies mirror systems design: (a) monitor residuals/drift, watch forecast error like an SLO and alert + retrain past a threshold; (b) use locally adaptive models (Kalman, sliding windows) rather than one-shot global fits; (c) don't trust point forecasts alone — read the interval, treat uncertainty as a first-class citizen; (d) accept that some breaks are fundamentally unpredictable, and design for "detect fast + recover fast" instead of "prophesy precisely." Forecasting maturity is knowing when forecasting fails.