ARIMA predicts the next value by "regressing onto its own history" — a linear mix of the last few observations (AR) plus the last few prediction errors (MA). The middle letter, I (differencing), is the key: it first turns the data from a "cumulative total" into "increments," exactly like using Prometheus's rate() to turn an ever-climbing counter into a per-second delta. A counter that only goes up can't be modeled directly; after differencing it becomes stationary and predictable.
Many real series carry a trend (steady climb), but nearly all classic statistical models require the series to be "stationary" — mean and variance that don't drift over time. Why does non-stationarity break things? The model fits one fixed set of parameters over the entire history; if the mean keeps moving, those parameters are fitting an "average state" that doesn't actually exist, so extrapolation must distort. ARIMA(p,d,q) has three knobs, each doing one job:
y_t = c + φ₁y_{t-1} + … + φ_p y_{t-p}, where φ is "how much weight each past value gets";y'_t = y_t − y_{t-1} turns a trended non-stationary series into one that fluctuates around a fixed mean;Intuition: difference to "flatten" the data, then linearly extrapolate from "past values + past errors." The difference between AR and MA is what they remember: AR remembers past actual values (trend momentum), MA remembers past prediction errors (the system hasn't finished digesting a sudden shock). They're complementary — AR alone misses the decay tail of a one-off disturbance, MA alone misses persistent momentum. The seasonal variant is SARIMA (it adds another AR/MA/differencing set spaced at the seasonal period), modeling overlapping rhythms like "within-day" plus "within-week."
import pandas as pd from statsmodels.tsa.arima.model import ARIMA # series: one column of values ordered by time (e.g. monthly server cost) series = pd.read_csv("cost.csv", index_col="month", parse_dates=True)["usd"] # order=(p,d,q): 2nd-order AR + 1 differencing + 2nd-order MA model = ARIMA(series, order=(2, 1, 2)) res = model.fit() # forecast 6 steps ahead, with a confidence interval fc = res.get_forecast(steps=6) print(fc.predicted_mean) # point forecast print(fc.conf_int(alpha=0.05)) # 95% interval, quantifies uncertainty
Exponential smoothing is just EWMA (exponentially weighted moving average) — you've met it all over the backend: TCP RTT estimation, load-balancer rolling averages, Prometheus's irate. In one line: new estimate = α×new observation + (1−α)×old estimate, where older data decays exponentially in weight. The Kalman filter is its upgrade: it additionally tracks "how certain am I about the current estimate" (variance) and uses that to dynamically decide whether to trust the new observation or the old prediction more.
Real data is noisy: you can't fully trust the latest point (jitter) nor cling to the old mean (lag). Exponential smoothing compromises with a fixed α; Holt-Winters extends it into three parallel smoothing equations that track level, trend, and season with independent coefficients, then sum them into a forecast. Expanding the formula shows where "exponential" comes from:
s_t = αy_t + α(1−α)y_{t-1} + α(1−α)²y_{t-2} + …
The Kalman filter goes further, looping through two stages each step:
The Kalman gain K is essentially an adaptive "trust dial" — isomorphic to adaptive backoff and dynamic TTLs you know; also like fusing readings from several unreliable replicas in a distributed system, weighted by each one's credibility (variance). Versus fixed-α smoothing, Kalman's edge is that K self-tunes over time: when observations are clean it raises K (fast follow), when they're noisy it lowers K (hold steady and don't get dragged). Smoothing can't do this — once α is set, it's frozen. The cost: you must explicitly write "how the state evolves" and "how big the observation noise is," so modeling effort is higher.
from statsmodels.tsa.holtwinters import ExponentialSmoothing # three-parameter exponential smoothing (Holt-Winters): trend + season (period=12) model = ExponentialSmoothing( series, trend="add", # additive trend seasonal="add", # additive season seasonal_periods=12, # 12 months per cycle ) res = model.fit() # learns α and other smoothing coefficients print(res.forecast(6)) # forecast 6 periods ahead
Prophet decomposes a time series into independent components that add up: long-term trend + yearly cycle + weekly cycle + holiday effects + noise — like splitting a system into microservices, each owning one piece, each separately visualizable and configurable. It's a GAM (Generalized Additive Model), not a black box — you can read what each component is doing.
ARIMA wants you to know how to tune p,d,q — business analysts can't. It's also fiddly with missing values, outliers, and multiple seasonalities (weekly + yearly at once). Prophet (Taylor & Letham 2018) was designed for "forecasting at scale with an analyst in the loop" — turning the model into a few knobs people can understand and adjust directly:
Each term can be plotted on its own, and you can inject domain priors ("this day is a big sale, volume doubles"). Two design choices worth understanding: changepoints — Prophet scatters candidate trend turning points and uses regularization to zero out most weights, keeping only real inflections (a sparse-modeling idea that prevents trend overfitting); and additive vs multiplicative seasonality — if seasonal swings grow with the baseline (higher level, taller peaks), switch to multiplicative, otherwise additive systematically underestimates the peaks. This is why it caught on with business teams: not highest accuracy, but easy + interpretable + robust to missing values and outliers.
from prophet import Prophet import pandas as pd # Prophet convention: two columns, ds=date, y=value df = pd.read_csv("traffic.csv") # has ds, y columns m = Prophet(yearly_seasonality=True, weekly_seasonality=True) m.add_country_holidays(country_name="CN") # inject China holidays m.fit(df) future = m.make_future_dataframe(periods=30) # next 30 days forecast = m.predict(future) m.plot_components(forecast) # plot trend/season/holidays separately
Treat "predict the next value" as "predict the next token" — exactly what an LLM does. TimesFM (Google 2024) is a decoder-only transformer, pretrained on hundreds of billions of real time-points, that then zero-shot forecasts on series it has never seen, without retraining for that series. The analogy: upgrading from "every project builds its own forecasting model" to "calling a shared, pretrained piece of infrastructure."
ARIMA / Prophet must be fit per series and can't capture complex nonlinear long-range dependencies; earlier deep time-series models had to be trained per dataset, a high bar. The foundation-model route ports the LLM playbook over:
Patching (cutting continuous values into short non-overlapping segments, like ViT image patches, like merging characters into tokens) has two benefits: it shrinks sequence length to 1/patch, easing attention's O(n²) cost; and it makes the model learn on "local shapes (the small waveform inside a patch)" rather than single points — closer to how a human reads a chart. It also outputs quantile forecasts rather than a single point, giving uncertainty bands for free. The paper (arXiv 2310.10688, ICML 2024) reports: after pretraining on hundreds of billions of real time-points, zero-shot performance can approach supervised models trained specifically on a single dataset — that's the substance behind the word "foundation model."
# pip install timesfm —— Google's official SDK # note: foundation-model libraries iterate fast; defer to the official repo import timesfm tfm = timesfm.TimesFm( hparams=timesfm.TimesFmHparams(horizon_len=24), checkpoint=timesfm.TimesFmCheckpoint( huggingface_repo_id="google/timesfm-1.0-200m"), ) # zero-shot directly: give history, ask for the future, no per-series training point_forecast, _ = tfm.forecast( inputs=[history_values], # list; each entry is one history segment freq=[0], # 0=high freq (daily/hourly), 1=weekly/monthly, 2=low ) print(point_forecast)