Building a real-time stylized talking avatar — in the browser, on every platform

The shape of the project

One demo, three repositories

The visible artifact is a single static web page. Behind it sit three codebases that each solved a different hard problem: how to train the look, how to ship it in a browser, and how to make it fast enough for a phone.

2.45M

params in the shipping generator — distilled down from a 217M teacher

~2.1 ms

per-frame GPU inference in the in-house int8 engine

4.7 MB

FP16 model the browser downloads — once, then cached

frames leave the device — webcam, model and voice are all local

The pipeline, end to end

Any facewebcam / rendered head

→

478 landmarksMediaPipe → "fancy" render

→

Stylized facepix2pix UNet on WebGPU

Procrustes alignment to a canonical template sits between steps 2 and 3, so a face at any angle/scale maps into the framing the model was trained on.

🎓 The training repo

pytorch-CycleGAN-and-pix2pix (fork)

Builds the dataset (Stable Diffusion stylization + IC-Light relighting + LivePortrait pose/viseme synthesis), trains the unet_256 teacher, and distills it into a tiny mobile_unet_256 student designed to be quantization- and WebGPU-friendly.

🌐 The web app

RTUKChatBot — single static page, no build step

Webcam → MediaPipe → Procrustes → ONNX generator on WebGPU/WebNN, plus a procedural animation rig (blink, head-sway, emotion, lip-sync) and a local-first chat avatar (RAG + on-device LLM + TTS/STT). This is the portfolio piece next to this blog.

⚙️ The native engine

nativeGAN — int8 cGAN inference in Rust + wgpu

A from-scratch GPU inference engine that parses the ONNX graph and runs it as hand-written WGSL compute shaders — the same code on Windows native and in the browser. Its WGSL was ported back into the web app as the default mobile backend, beating onnxruntime-web on phones.

Repo 1 · pytorch-CycleGAN-and-pix2pix

Training the look — and distilling it small

The model maps a rendered landmark image → a GTA-IV-illustrated, cyberpunk-lit face. There were no off-the-shelf labels for that, so most of the work was manufacturing the training distribution, then compressing a big teacher into a phone-sized student without losing it.

1 · From "avatar→webcam" to "landmarks→face"

The task was re-framed early: instead of mapping a specific avatar to a specific face, the input became a rendered MediaPipe FaceMesh (468 points + iris = 478) drawn in a deliberately "fancy" style — filled face oval, eyes, lips, tessellation, contours, iris circles — and the target is the masked stylized face. Keying off landmarks makes the model robust to any face. To accept landmarks from any camera, a 3D Procrustes similarity transform maps detected points onto a canonical template, removing pose, scale and translation before the GAN ever sees them.

2 · Stylization: Stable Diffusion, stacked

The face targets were stylized into a GTA-IV illustrated look with SD 1.5 + ControlNet-canny + IP-Adapter-FaceID + a GTA LoRA, in a two-pass img2img refine so the LoRAs could stack without collapsing identity. (IP-Adapter-FaceID needs its negative+positive embeds stacked, torch.cat([neg, pos]) → shape [2,1,512] — a small detail that cost real time.)

3 · The relighting breakthrough — IC-Light FBC

Lighting LoRAs never produced the dramatic chiaroscuro the reference had. The fix was IC-Light — not a LoRA but a UNet replacement taking extra latent channels (8 for foreground-conditioned, 12 for foreground+background). Two findings made it usable for a dataset:

Text→image relighting destroys identity (the face is regenerated). An img2img init ≈ 0.5 preserves identity while still relighting.
Background-conditioned (FBC) with one fixed background image gives consistent lighting across every subject — exactly the property a coherent training set needs.

GTA-stylized face before relighting — BEFORE GTA stylization pass — flat daylight

same face after IC-Light cyberpunk relighting — AFTER IC-Light FBC relight — consistent cyberpunk key + neon city

4 · Synthesizing the data the recordings lacked

The ~275 real recordings barely covered head turns or speaking mouths. Rather than collect more, the missing distribution was synthesized with LivePortrait, driven parametrically — a pitch×yaw×roll grid of implicit-keypoint transforms with no driving video. Animating a few curated stylized heroes across the grid yields many identity-, style- and lighting-consistent frames; landmarks are re-derived per frame for clean pairs. The |yaw|=30° profiles were held out as a pose-extreme test split.

synthesized head-yaw sequence from LivePortrait — **Parametric pose synthesis** — one hero rotated yaw 0°→30° with no driving video. The student trained on these scored **59% lower LPIPS** on held-out extreme poses than the previous version — it renders profiles it never saw in real data.

5 · Distillation — 217M teacher → 2.45M student

Real-time in-browser inference needs a tiny model. The teacher (unet_256, up to ngf=128, 217M params at 512px) is frozen and a small mobile_unet_256 student is trained to mimic it — no new labels, the targets are the teacher's outputs, loss is L1 + VGG perceptual. The student architecture is deliberately deployment-shaped: Upsample+Conv instead of ConvTranspose (no checkerboarding, INT8-friendly), post-activation [Conv, BN, ReLU], and FloatFunctional skip-concats so the graph fuses cleanly on export.

⚠ Hard-won: the GAN runs away in the decay phase

With an adversarial term, late in training the discriminator overpowers the generator (D_fake → 0.02, L1/VGG climb 25–40%). The model regresses, it doesn't plateau — so latest is often worse than an earlier epoch. Always save often and pick the best epoch; a GAN-free fine-tune (L1+VGG only) both fixed the regression and lowered the quality floor. That fine-tune is the shipping recipe.

6 · The resolution lesson

The single biggest GAN insight: a UNet trained at 512 but run at 256 is badly blurry — and not because of the lower resolution itself. A fully-convolutional UNet is resolution-flexible but not resolution-invariant: its BatchNorm statistics, receptive-field scale and decoder are tuned to 512-pixel spacing, so at 256 it synthesizes low-frequency mush. Distilling natively at 256 removed the mismatch — measured ~5–6× sharper (Laplacian variance), beating even a clean mipmap of the 512 output. A later check found the all-256 pipeline landed within 3.8% of the 512-teacher pipeline at one-third the training cost: student capacity, not teacher resolution, is the ceiling.

7 · Making the mouth read — v4 → v5 → v6

Speaking mouths were the last weak spot, because the pose synthesis had frozen expression. The fix arrived in three steps, and it's a clean demonstration of where a fix has to land:

v4 — mouth data. Transplant viseme expression-deltas (LivePortrait open_lip/laugh, later talking.pkl real speech selected by farthest-point sampling) onto each hero. The teacher learned mouths almost perfectly (0.0070) but the student barely moved (0.0411) — a distillation capacity limit, not a data limit.
v5 — mouth-weighted loss. A pose-aligned mouth mask weights lip pixels 5× in the distillation L1/VGG → 0.0286 (−31%).
v6 — GAN-free fine-tune. Warm-start from v5's best epoch, discriminator off → 0.0234 (−44% vs v3).

viseme mouth-shape comparison across model variants — **Speech-viseme quality across variants.** Left: the landmark input (note the colour-coded mouth + brows). Then target, the shipping ngf=16 student, and wider candidates. The shipping student tracks distinct mouth shapes — bilabial closures, lip-teeth, vowels — at 2.45M params. A data fix lands in the **teacher**; loss-weighting is what carries it through distillation into a tiny student.

Model lineage

Checkpoint	Arch	Params	Change	Headline
landmarks2face	unet_256	54.4M	photoreal landmarks→face	LPIPS 0.019
landmarks2gtaface	unet_256	54.4M	+ GTA stylization (teacher)	LPIPS 0.034
…v2_mobile_distill	mobile_256	2.45M	first cyberpunk student	—
…v3_pix2pix	unet_256	217M	+ LivePortrait pose synth, 512px	—
…v3_native256	mobile_256	2.45M	256-native drop-in	pose 0.061
…v4_mobile_distill	mobile_256	2.45M	mouth data only	mouth 0.0411
…v5_mouthw	mobile_256	2.45M	+ mouth-weighted loss	mouth 0.0286
…v5_mouthw_ganfree (v6)	mobile_256	2.45M	+ GAN-free fine-tune	mouth 0.0234
…speech_ngf16_ganfree SHIP	mobile_256	2.45M	+ speech visemes, drop-in	speech 0.0190
…speech_ngf24_ganfree	mobile_256	5.51M	capacity upgrade (not drop-in)	speech 0.0164

LPIPS vs target, lower = better. Pose-extreme dropped v2 0.153 → ngf24 0.0467 (teacher ceiling 0.037); mouth-region v3 0.0416 → v6 0.0234 (teacher ceiling 0.0070).

Repo 2 · RTUKChatBot

Shipping it in the browser

A single static HTML page, no build step. The interesting work was getting a quantized CNN to run correctly and fast across three execution backends, keeping a per-frame loop allocation-free, and layering a believable animation rig on top.

Execution-provider strategy: WebNN → WebGPU → WASM

On startup the page detects which backends the browser actually exposes and tries them best-first:

WebNN (navigator.ml) maps ops to CoreML on Apple and DirectML on Windows. For a UNet-style CNN this is typically 3–10× faster than WebGPU because CoreML can route eligible ops to the Neural Engine.
WebGPU (navigator.gpu) is the universal fallback, configured powerPreference:"high-performance" to pick the discrete GPU.
WASM (multi-threaded) is where quantized models run — for a reason worth its own callout.

◆ The INT8 saga

ORT-web's WebGPU execution provider saturates quantized UNets — broken across QDQ/QOperator and Conv/ConvTranspose op sets (blank-white output). The same INT8 model runs correctly and faster on the WASM EP (~441 ms vs ~1283 ms on the test rig). One more trap: int32 bias DequantizeLinear nodes carry an all-zero zero_point that the WebGPU EP rejects — it has to be stripped at export time. Lesson: route INT8 to WASM; keep FP16 on WebGPU.

The output upscaler

A 256² generator output benefits enormously from a good upscale. The chain went bicubic → a B-spline 4-tap filter with linear-light filtering and CAS-style adaptive sharpening — sharpen in linear light, strong on skin, gentle near edges.

Resolution, by platform

Once the resolution lesson was internalized, the app ships two natively-trained models: 256 (the mobile default — bandwidth + perf) and a higher-quality 512 (desktop default; ?res=512 opt-in on mobile). Both run at the resolution they were trained at — no mismatch blur.

256 vs 512 model output compared in-browser — **256 (mobile) vs 512 (desktop), in-browser.** Same landmarks, two natively-trained models. The 512 model resolves more skin and hair detail; 256 stays the phone default because it's ~4× cheaper per frame and a smaller download.

Cross-origin isolation for threaded WASM

Multi-threaded WASM needs SharedArrayBuffer, which needs crossOriginIsolated == true. The headers set COOP: same-origin + COEP: credentialless. credentialless (rather than require-corp) was the key choice: MediaPipe's .task model is served without a CORP header and would be blocked by require-corp; credentialless allows no-credential cross-origin loads while still unlocking threads. Browsers without it (Safari < 17.4) degrade gracefully to single-threaded.

The animation rig — additive, in canonical space

Beyond raw webcam tracking, the landmarks can be driven procedurally: automatic blink (~15/min), idle head-sway, segmented emotion presets, and lip-sync (from an audio file, the mic, or typed text spoken by TTS). Every channel is additive and applied in canonical template space after Procrustes alignment.

◆ Procrustes both ways

To animate in a clean canonical space and keep the subject's real head pose, the deltas are applied in template space, then the Procrustes transform is inverted back to the live camera frame. Because inverse(forward(src)) = src, the real pose/scale/position is recovered exactly while template-space deltas ride along, correctly rotated, as (1/s)·Rᵀ·d. Head-sway is a rigid rotation and must be the last 3D step — applied before alignment it would be cancelled out.

The chat avatar

A later Chat tab turns the face into a local-first conversational avatar: RAG (BM25 + MiniLM embeddings) feeds an on-device LLM (WebLLM), the reply is spoken by TTS, and the speech audio simultaneously drives the mouth — each TTS WAV is fed to an analyser that maps RMS→jaw-open while the text schedules the viseme shapes. Speech-to-text is Whisper (or the native API). The whole thing is DOM-decoupled so it bolts onto the existing render loop without touching the original lab tool. Nothing is sent to a server.

The hard part · iOS & Android

Making it run on a phone

"Works on my desktop" is the easy 80%. Mobile browsers — iOS Safari especially — impose a tight memory budget, gate audio behind user gestures, and vary wildly in GPU throughput. Most of the platform-specific engineering lives here.

iOS Safari

HTTPS is mandatory for the camera. getUserMedia simply won't fire on iOS over plain HTTP — dev uses an ngrok/mkcert HTTPS tunnel.
Audio is gesture-gated — and the mouth depends on it. Safari only starts an AudioContext from inside a tap handler. The context was being created after the async reply, so it stayed suspended → no audio and no mouth movement (the visemes are driven by analysing the playing audio). Fixed by unlocking on a user gesture and adding a per-answer Play button; avatar audio plays via a plain HTMLAudioElement.
Tight memory budget → OOM-reloads. Per-frame tensor/typed-array allocations accumulate faster than GC reclaims and crash the tab. The fix was a disciplined hot loop (below) plus halving chat audio memory by single-decoding the spoken WAV.
Zero-model-memory voice by default. Piper's ~60 MB neural voice OOMs iOS, so mobile defaults to the browser's native Web Speech API (the OS synthesizes — zero model memory); the mouth is lip-synced from the reply text instead of the audio spectrum.
FAQ, not full LLM. iOS gets a lightweight retrieval FAQ rather than the desktop's on-device LLM, to stay inside the memory ceiling.
Small but real: stop iOS zooming when the chat input is focused; performance.memory is unavailable on Safari, so the live memory readout shows mem n/a there.

Android Chrome

WebGPU on Chrome 121+ is the baseline. Where present it drives the same path as desktop; the front camera and threaded WASM both work.
GPU throughput varies enormously across Android devices, which is what the device-tiered frame-rate system (right) is for — a flagship and a budget phone shouldn't run the same cadence.
On-device debugging drives an attached handset over adb + Playwright connectOverCDP for real WebGPU/chat repro on actual hardware, not just an emulator.
performance.memory is exposed on Chrome/Android, so the live heap readout (and the leak-trend probe) work there — invaluable for catching the kind of per-frame leak that silently kills iOS.

◆ The allocation-disciplined hot loop (why iOS stopped crashing)

The inference loop is engineered to run indefinitely on a memory-constrained browser: output tensors are disposed every frame (ORT-web doesn't auto-free the backend GPU/WASM memory they hold); the input tensor is reused — one ort.Tensor over a persistent buffer, mutated in place; and the landmark canvas readback is zero-alloc, reading through a reused WebGL2 buffer instead of getImageData() (which allocates a 256 KB ImageData per call). A live memory readout made the original leak obvious; a headless probe now asserts the JS heap stays flat over time.

Device-tiered frame rate & the temporal weave

Mobile runs on tiers — weak → 10 fps, mid → 20, strong → 40 — all on the single 256 model, defaulting to a fixed safe MID that real mid-range phones sustain. A measured auto-tune (measure real per-frame cost, then step down on sustained shortfall) is opt-in, because measuring during init can mis-read a capable device. A separate temporal weave path runs four distilled 128×128 generators round-robin (one per frame), weaving each into a 2×2 sub-pixel phase of a persistent 256 buffer — temporal supersampling at roughly ¼ the per-frame GAN cost.

10 / 20 / 40

fps tiers — weak / mid / strong devices

¼ cost

temporal weave: four 128² nets vs one 256² net

native int8

the default mobile inference backend (next section)

~60 MB

saved per session by defaulting iOS to native TTS

⚠ The deploy trap that only bites previews

The R2 model bucket's CORS only allows the production origin, so *.pages.dev preview deploys can't fetch the GAN at all. The only way to test the real model path is to deploy to main — a gotcha worth knowing before burning an afternoon on a "broken" preview.

Repo 3 · nativeGAN

An in-house int8 engine — Rust + wgpu

onnxruntime-web is a big dependency, and its WebGPU EP couldn't run the quantized model. So the inference engine was rebuilt from scratch: parse the ONNX graph, run it as hand-written WGSL compute shaders, record the whole forward pass into a single command buffer, submit once per frame. The same engine runs on Windows native and in the browser.

nativeGAN side-by-side input and int8 GAN output in the browser — **Live int8 inference in Chrome via WebGPU.** Left: input frame. Right: the GAN output, produced entirely by hand-written WGSL compute shaders — no onnxruntime in the loop. Bit-exact against the ORT oracle at ~2.1 ms/frame.

"The native way" — bypass ORT, parse the ONNX into wgpu

An offline tool (export_weights.py) turns an int8 QDQ ONNX file into a flat model.json + weights.bin. The engine reads that and builds a GPU program: int8 convolution over packed activations/weights (a packed 4×int8 dot product), inputs pre-quantized once, liveness-based buffer reuse, and bind groups built once and recorded cheaply each frame. WGSL is shared verbatim between the native and web builds.

comparison of int8 quantization variants — **Validating quantization variants.** Each row is a subject; columns compare the float reference against int8 strategies. The engine is validated three ways: a pytest L0 that reproduces onnxruntime numerically, a GPU integration test vs ORT, and an L5 Playwright gate that screenshots real Chrome + WebGPU.

Milestone	What	State
M0	Offline int8 export + onnxruntime oracle	done
M1	wgpu int8 engine — bit-exact vs ORT, ~2.1 ms/frame	done
M2	Windows native video (Media Foundation) → live side-by-side	done
M3	Web / WebGPU — browser-validated via Playwright	done
MT	Temporal acceleration (N-phase ensemble, optical-flow reproject)	planned
M4	Mobile (iOS / Android via cargo-mobile2)	planned

◆ How it folds back into the web app

The engine's WGSL was ported to vanilla JS as scripts/native_gan.js and is now the default inference backend on mobile — fully GPU-resident, feeding the same bicubic display and temporal weave, and the fastest path there. Desktop keeps onnxruntime; ?backend=native / ?backend=ort force an A/B. Three repos, one shipping result.

If you take six things away

Key learnings

Train and infer at the same resolution

A fully-convolutional UNet is resolution-flexible, not resolution-invariant. Run a 512-trained net at 256 and you get low-frequency mush — fix it by distilling at the deployment resolution, not by upscaling.

Synthesize the distribution you're missing

Parametric LivePortrait poses and transplanted viseme deltas bought generalization the real recordings never could — head turns and speaking mouths the camera never captured.

A fix lands in the teacher; weighting carries it through

Adding mouth data fixed the teacher (0.0070) but not the 2.45M student (0.0411). A 5× mouth-weighted distillation loss is what pushed the capability into the tiny model.

Pick the execution provider at runtime, per device

WebNN→WebGPU→WASM, chosen from what the browser actually exposes. Quantized UNets are broken on ORT-web's WebGPU EP — they belong on WASM (correct and faster).

Mobile is a memory-and-gesture problem

iOS OOM-reloads on per-frame allocations and refuses audio outside a tap. Reuse every buffer, dispose tensors each frame, unlock audio on gesture, and default to a zero-model-memory voice.

Sometimes you write the runtime

When the off-the-shelf runtime can't run your model, parsing the ONNX into your own WGSL is a viable answer — and it became the fastest mobile path in the product.

Two further notes that cost real time: the RTX Blackwell (sm_120) needs PyTorch cu128 — cu121 builds fail silently, so an entire model version trained on the wrong GPU before anyone noticed. And a tiny student is easily data-loading bound: pin_memory + persistent_workers + an in-RAM cache cut epoch time 36 s → 12 s.