AI/ML Deep Dive: Federated & Privacy Learning

Day 40 · 2026-06-26
For: engineers with coding experience, outside the AI field

Federated Learning

distributed trainingdata locality
One-line analogy

Traditional ML is "move data to compute"—pull everything into a central data center, then train. Federated learning flips this to "move compute to data"—the same idea as the data locality you know from MapReduce: don't move the data, push the code down. The difference is that the reduce stage here aggregates not data but the model gradients each node computed locally; raw data never leaves the user's device.

What problem it solves + how it works

Pain point: hospital records, phone keyboard logs, bank transactions—this data is both sensitive and scattered; law (GDPR) and commercial interest both forbid centralizing it. But without centralizing it you can't train a good model. The federated answer: the model goes to the data, not the data to the model.

The mechanism is FedAvg (Federated Averaging), proposed by McMahan et al. in 2016—a repeating loop: the server sends the current global model to a batch of clients → each client trains a few steps on its local data → it sends back only the weight update (not the data) → the server takes a data-size-weighted average to form the new global model.

The aggregation formula is plain: w ← Σ (n_k / n) · w_k. w_k is client k's trained local weights, n_k its sample count, n the total. Intuition: whoever has more data gets more say in the average—it's just a weighted reduce.

One FedAvg communication round:
Central server w
↓ push global model
Phone A Hospital B Bank C ← train locally, data stays put
↑ upload only update Δw
Server weight-averages by n_k → new w
↑ loop dozens–hundreds of rounds. Communication, not compute, is the bottleneck
Code example
import numpy as np

# FedAvg's core is really just this weighted-average fn (frameworks: Flower/FedML)
def fedavg(client_weights, client_sizes):
    # client_weights: list of each client's locally trained weight vectors
    # client_sizes:   each client's local sample count n_k
    total = sum(client_sizes)
    # Weight by sample count: more data → larger share
    global_w = sum(
        (n_k / total) * w_k
        for w_k, n_k in zip(client_weights, client_sizes)
    )
    return global_w  # the global model to push next round

# Simulate: 3 clients, raw data never leaves them
w = fedavg([np.array([1.,2]), np.array([3.,4]), np.array([2.,2])],
            [100, 50, 200])  # bank(200) has the most say
print(w)  # [1.857 2.286] — biased toward the data-rich client
Common misconception + use case
Misconception: "data stays put = privacy is safe." Wrong. Sharing only gradients also leaks—research shows training samples can be reconstructed from uploaded gradients (gradient inversion / model inversion attacks). Federated learning only solves "data isn't centralized," not "is the uploaded content leaky"—so it almost always needs the three cryptographic/statistical tools below layered on top.
📌 Super-individual scenario: you want to fine-tune a local small model on private notes spread across devices (work laptop, home machine, phone) without pooling them. The federated mindset—"each device computes updates locally, only merge weights"—is the right mental model: treat it as "distributed training" over your personal data.
Takeaway + question
💡 Federated learning is the "move compute, not data" distributed paradigm; FedAvg is essentially a data-size-weighted reduce.
🤔 If a client's uploaded "update" is maliciously crafted (poisoning), does weighted averaging get contaminated? How would you design Byzantine fault tolerance on the aggregation side?

Differential Privacy

privacy mathquantifiable guarantee
One-line analogy

Differential privacy is like a database query's "controlled redaction": you can ask "what's the average salary," but the system mixes in carefully calibrated noise so you can't reverse-engineer Zhang's salary by diffing "with Zhang vs. without Zhang." The key: privacy no longer rests on "I feel it's safe" but on a guarantee quantifiable by a number ε and accumulable as a budget—like the rate-limit quota (token bucket) you know: each query spends budget, and when it's gone you can't query more.

What problem it solves + how it works

Pain point: traditional "anonymization" (drop names, drop IDs) keeps getting broken—a few side fields cross-referenced can re-identify a person. DP gives a harder definition: a randomized algorithm M satisfies ε-differential privacy iff for any two datasets D, D′ differing by one record and any output S:

Pr[M(D) ∈ S] ≤ eε · Pr[M(D′) ∈ S]

Intuition: whether or not your record is included, the output distribution barely changes (bounded by the factor eε). So an attacker seeing the result can't tell whether you were in the dataset—your presence is "drowned out." The smaller ε (the privacy budget), the more alike the two distributions, the stronger the privacy—but the more noise and lower the accuracy. This is an unavoidable dial between privacy and utility.

How to implement? The classic Laplace mechanism: add Laplace noise to the true answer, with magnitude = sensitivity / ε. Sensitivity = how much changing one record can shift the answer at most (a "count" query has sensitivity 1). Applied to deep learning, this is Abadi et al. 2016's DP-SGD: during training, first clip each sample's gradient (bounding single-sample influence = controlling sensitivity), then add Gaussian noise, and use the "moments accountant" to precisely tally the total ε spent across training.

The ε dial's trade-off:
ε=0.1 strong privacy high noise, low accuracy
ε=1.0 balanced  common range
ε=8.0 weak privacy low noise, high accuracy
↑ no "free privacy": ε is a budget you must price explicitly
Code example
import numpy as np

# Laplace mechanism: a differentially private release of a "count" query
def private_count(true_count, epsilon, sensitivity=1.0):
    # Count query: adding/removing one record shifts result by at most 1
    scale = sensitivity / epsilon       # noise magnitude ∝ 1/ε
    noise = np.random.laplace(loc=0, scale=scale)
    return true_count + noise          # release the noised result

true = 1000  # truly 1000 patients have some disease
print(private_count(true, epsilon=0.1))   # strong privacy → big noise, maybe 987 or 1013
print(private_count(true, epsilon=2.0))   # weak privacy → small noise, ~1000.4

# For neural nets, the official lib Opacus (PyTorch) takes over in one line:
# from opacus import PrivacyEngine
# model, optim, loader = PrivacyEngine().make_private(...)  # auto clip + noise
Common misconception + use case
Misconception: "any ε is fine, I added noise anyway." Wrong. ε has no universal safe threshold, and it accumulates—querying the same data 10 times costs 10 ε's added up (composition theorem). Quietly setting ε to tens means no protection. It's a budget to be strictly accounted, not a one-time switch.
📌 Decision-support scenario: when evaluating a data product that claims to be "anonymized," ask "what's your ε, and how do you account for it?"—it instantly separates real privacy engineering from marketing. It's the habit of trading a vague "sense of safety" for an auditable number.
Takeaway + question
💡 DP turns "privacy" from an adjective into a quantifiable, accumulable, priceable mathematical guarantee ε; the core is "with or without your one record, the result is nearly the same."
🤔 Federated learning (concept ① today) can still leak from gradients alone, while DP-SGD adds noise to gradients. When combining them, should the noise be added on the client locally or at server aggregation? What are the trust assumptions of each?

Homomorphic Encryption

cryptographycompute on ciphertext
One-line analogy

Ordinary encryption is like locking data in a safe—to use it you must unlock it, and at that instant the plaintext is exposed to the server. Homomorphic encryption (HE) is a "safe with gloves": you can operate inside the box directly, never opening it. In your world: the server can run SUM / dot product directly on encrypted database fields, returns ciphertext, and only you (holding the private key) can open it—while the server never saw a single plaintext.

What problem it solves + how it works

Pain point: you want to use cloud compute or a third-party model on sensitive data (records, finances) but don't trust them. Uploading plaintext = leak; computing locally = no compute power. HE lets you "outsource the computation without handing over the data."

The math core is "preserving operation structure": the encryption function E satisfies E(a) ⊕ E(b) = E(a + b). That is, some operation ⊕ on ciphertext corresponds, after decryption, to addition on plaintext. A scheme supporting both ciphertext addition and multiplication, for arbitrary depth, is Fully Homomorphic Encryption (FHE)—first constructed by Craig Gentry in 2009 using ideal lattices, a decades-long cryptographic milestone.

Why was it impossible before? Each ciphertext operation accumulates noise, multiplication especially fast, and once noise exceeds a threshold it can no longer be decrypted. Gentry's breakthrough was bootstrapping: use homomorphic operations to "decrypt the ciphertext itself and re-encrypt," resetting the noise, enabling unlimited-depth computation. The cost is extreme slowness—the root reason HE isn't yet mainstream. In ML the more common scheme is CKKS (Cheon et al. 2017), supporting approximate real-number arithmetic, well suited to neural inference that tolerates small errors.

Homomorphic inference (data stays encrypted throughout):
You: plaintext x → encrypt → ciphertext E(x)
↓ upload
Cloud server: run model on E(x) → E(y) sees neither x nor y
↓ return ciphertext
You: decrypt E(y) with private key → plaintext y
↑ security cost = orders of magnitude slower; CKKS trades precision for practicality
Code example
import tenseal as ts  # OpenMined's HE library, wraps CKKS

# 1) Build a context = generate keys + set CKKS parameters
ctx = ts.context(ts.SCHEME_TYPE.CKKS, poly_modulus_degree=8192,
                  coeff_mod_bit_sizes=[60, 40, 40, 60])
ctx.global_scale = 2**40
ctx.generate_galois_keys()

# 2) Encrypt two vectors (think: your private features)
a = ts.ckks_vector(ctx, [1.0, 2.0, 3.0])
b = ts.ckks_vector(ctx, [4.0, 5.0, 6.0])

# 3) Dot product directly on ciphertext—server-side, never sees plaintext
enc_dot = a.dot(b)           # entirely ciphertext arithmetic
print(enc_dot.decrypt())     # ≈ [32.0] = 1*4+2*5+3*6, approximate but accurate enough
Common misconception + use case
Misconception: "with FHE, private computation is solved." Reality: FHE is thousands to tens of thousands of times slower than plaintext; running a full large-model inference is still impractical. So it's often downgraded—applying HE only to the most sensitive sliver (e.g., a final similarity match), with MPC / DP for the rest. Treat it as a "nuclear option," not a default switch.
📌 Cross-disciplinary scenario: HE is a pure example of "operating correctly on content without understanding it"—it resonates directly with your interest in "form vs. meaning" and "does computation require understanding" (the Chinese Room argument). A system that computes correctly yet "doesn't understand" is itself philosophical material about mind and computation.
Takeaway + question
💡 HE fully decouples "computing" from "seeing"—the server can compute but can't look, at a huge performance cost kept alive by bootstrapping.
🤔 If one day FHE were fast enough to run full model inference, the "the cloud must be trusted" assumption would vanish. How would that reshape your judgment of "data sovereignty" and AI business models?

Secure Multi-Party Computation

cryptographysecret sharing
One-line analogy

Secure multi-party computation (MPC) solves a seemingly paradoxical problem: several parties jointly compute a shared result, yet none reveals its own input. Analogy to the distributed quorum you know: data is split into "secret shares" held by each party, a single share is meaningless, and only combined per protocol does it yield a result—except MPC goes further: what's combined is the "computed result," not the raw data itself. The classic primer is Yao's "Millionaires' Problem": two millionaires want to know who is richer without revealing how much each has.

What problem it solves + how it works

Pain point: several hospitals want to jointly study a drug's efficacy, but none may see the others' records; several banks want joint fraud detection but are limited by competition and compliance. They need "collaborative computation, zero-trust sharing."

The easiest mechanism is additive secret sharing: split a secret number x into a sum of random numbers x = x₁ + x₂ + x₃ (mod p), each party getting just one xᵢ. Any single share is a uniform random number leaking zero information; but since addition can be done by each party on its own share independently, summing the result shares recovers the true value of x + yand no one ever saw the raw x, y. Multiplication is harder (needs extra protocols like Beaver triples), but the idea is the same.

In federated learning the most important application is Bonawitz et al. 2017's Secure Aggregation: clients use pairwise masks to "additively share" their gradients with the server, which can only recover the sum of all gradients—never any single client's gradient—and it tolerates clients dropping out midway. This exactly fills the hole in concept ① ("sharing only gradients still leaks").

Additive secret sharing: compute x+y without revealing x, y
A holds x=5 split→ x₁=12 x₂=-7 (sum=5)
B holds y=3 split→ y₁=-4 y₂=7 (sum=3)
each party adds shares locally: node1: 12+(-4)=8   node2: -7+7=0
publicly merge: 8 + 0 = 8 = x+y ✓ no one saw 5 or 3
Code example
import random
P = 2**31 - 1  # a large prime; all arithmetic is mod P

def share(secret, n=3):
    # split into n shares: first n-1 random, last makes the sum = secret
    parts = [random.randrange(P) for _ in range(n - 1)]
    parts.append((secret - sum(parts)) % P)
    return parts  # any single share is uniform random, leaks zero info

# Two parties each split their secret, distribute to 3 compute nodes
sx, sy = share(5), share(3)  # Alice=5, Bob=3, not told to each other

# Each node adds only the shares it holds (local, no communication)
node_sums = [(a + b) % P for a, b in zip(sx, sy)]

# Publicly merge node results → recover x+y, but 5 and 3 never exposed
print(sum(node_sums) % P)  # 8 ✓
Common misconception + use case
Misconception: "MPC is encrypted, so the output is safe too." Wrong. MPC only protects the process (no inputs leak mid-way), not the result itself. If the final "joint statistic" can itself re-identify someone (e.g., a rare disease with only one patient), it still leaks—so MPC is often layered with differential privacy: MPC guards the process, DP guards the result.
📌 Personal-project scenario: you and a friend want to compare investment returns without revealing principal or holdings—additive secret sharing computes the "average return" without exposing anyone's numbers. It's a general template for "collaborate without showing your cards," well beyond technical settings.
Takeaway + question
💡 MPC achieves "collaborative computation, zero-trust sharing" via secret sharing, but it only protects the process, not the output—result privacy still needs DP.
🤔 These four techniques (FL/DP/HE/MPC) protect different things: data location, individual identifiability, computation visibility, input confidentiality. Why do production systems usually combine them rather than pick one?

How They Combine

They aren't mutually exclusive substitutes but building blocks protecting different layers; production systems usually stack them:

TechniqueProtectsMain costTrust assumption
Federated Learningdata not centralized (location)communication, Non-IIDsemi-honest server
Differential Privacyindividual unidentifiable (result)accuracy dropnoise-adder trusted
Homomorphic Encryptioninvisible during compute (process)orders of magnitude slowerno need to trust compute
Secure MPCeach party's input secret (process)many communication roundsmajority don't collude

Typical stack: Federated learning sets the frame (data stays put) → Secure Aggregation / MPC hides single gradients from the server → Differential privacy noises the model against individual re-inference → add HE for the most sensitive slivers if needed. In one line: FL guards location, MPC/HE guard process, DP guards result.

Further Reading

Deep Questions

1. Federated learning claims "data stays put," so why still layer DP / MPC? What does sharing only gradients actually leak?
A gradient is not a harmless summary of data but a high-fidelity signal of "how the data shapes the model." Gradient inversion research shows: with small batches and a known model architecture, you can approximately reconstruct original training samples from uploaded gradients—recognizable faces, readable text. Intuitively a gradient = partial derivative of loss w.r.t. parameters; it encodes "to fit this sample, which way should the model move," carrying far more information than expected. So federated learning only solves physical location (data not centrally stored), not information leakage. Two fixes: (a) Secure Aggregation / MPC lets the server see only "the sum of all gradients"—single-point inversion fails; (b) DP-SGD adds noise to gradients so inversion, even with access, reconstructs poorly. This is exactly why the three-layer defense must stack—any single layer has a gap.
2. The ε "privacy budget" accumulates. What does it share with the rate-limit quotas and distributed resource accounting you know?
Structurally identical: both are global accounting of a finite resource. DP's "composition theorem" says: k ε-DP queries on the same dataset cost at worst k·ε (advanced composition can improve this to ~√k). This maps almost one-to-one to token-bucket rate limiting: each query "spends budget," and when exhausted, service is refused—except here the spent resource, "privacy," is non-renewable: there's no "bucket refills," once spent it's lost forever. The engineering implications match distributed accounting: (a) you need a centralized budget ledger tracking cumulative ε per dataset; (b) guard against "same query re-skinned" bypassing the budget; (c) design a budget allocation strategy, giving scarce ε to the most valuable analyses first. Your quota/rate-limit experience transfers directly—just swap "QPS" for "privacy loss."
3. HE "computes but can't see," MPC "collaborates without showing cards"—where's the difference and trade-off?
Both decouple "computation" from "data visibility," but along opposite routes. HE is single-party encryption, outsourced compute: one person encrypts data and ships it to the cloud, which computes on ciphertext—compute is centralized, communication minimal (one round trip), but compute overhead explodes (thousands to tens of thousands of times slower), and usually only the data owner can decrypt. MPC is multi-party collaboration, shared secrets: data is split into shares across parties, computed jointly via many communication rounds—single operations far faster than HE, but rounds grow with computation depth, and security relies on "majority don't collude." Trade-offs: (a) single-party outsourcing, poor network, shallow compute (e.g., one encrypted similarity match) → HE; (b) peer parties, good network, lots of interaction (e.g., multi-institution joint modeling) → MPC; (c) modern systems often mix, picking the cheapest per segment by "multiplicative depth" and "communication cost." They're complementary, not one replacing the other.
4. All four techniques have a hard "privacy vs. utility / cost" trade-off. Is there "free privacy"?
No—and this isn't insufficient engineering effort, it's a fundamental information-theoretic constraint. Intuition: privacy's essence is "an attacker can't distinguish individuals from the output," while a useful result necessarily carries information about the input—to make the result useful it must reflect the data; to protect privacy it must be "insensitive" to any single record. The two goals are mathematically opposed. DP pays via "add noise → drop accuracy," HE/MPC via "cryptographic overhead → drop performance," federated learning via "don't centralize → communication cost + Non-IID slower convergence." It's a Pareto frontier: you can only pick a point between "privacy strength" and "utility/cost," not max both. So the real engineering question is never "do we want privacy" but "how much loss can this scenario tolerate for how strong a guarantee"—treat it as an explicitly priced decision, not a compliance checkbox to fudge. This is also why "what's your ε" is far more valuable than "we take privacy seriously."
5. Turning "privacy" from gut feeling into a quantifiable ε—what methodological lesson does that hold for other fields?
This is a methodological event worth savoring across disciplines. Before DP, "anonymous" and "privacy" were adjectives, legal talk; attack and defense talked past each other, unfalsifiable—every "anonymization" scheme was broken by a new attack because there was no definition spelling out what "secure" means. Dwork et al.'s contribution wasn't just an algorithm but an adversarial, provable, composable mathematical definition of privacy (ε-DP), letting defenders prove guarantees and bounding attackers. This "formalize the fuzzy concept" paradigm often marks a paradigm shift: cryptography defined "secure" as "indistinguishable from random," complexity theory defined "hard" as NP, information theory defined "information" as entropy. Two lessons: (a) when a field is stuck in "everyone disagrees, nothing falsifiable," what's missing is often not more schemes but a good definition; (b) a good definition is quantifiable, composable, adversarial (worst-case adversary). For the "hard concepts" you care about—consciousness, alignment—progress may likewise hinge on finding that formalization that makes people say "oh, so that's what it means."