Real-time generative avatar · runs entirely client-side
A webcam frame becomes face landmarks, the landmarks become a GTA-cyberpunk-styled face through a distilled pix2pix generator, and the whole forward pass runs on the GPU via WebGPU — no server, no upload, ~2 ms a frame. This is the engineering story behind it: training the model, distilling it small enough for a phone, and the platform work that made it run everywhere.
The shape of the project
The visible artifact is a single static web page. Behind it sit three codebases that each solved a different hard problem: how to train the look, how to ship it in a browser, and how to make it fast enough for a phone.
Procrustes alignment to a canonical template sits between steps 2 and 3, so a face at any angle/scale maps into the framing the model was trained on.
pytorch-CycleGAN-and-pix2pix (fork)
Builds the dataset (Stable Diffusion stylization + IC-Light relighting + LivePortrait pose/viseme synthesis), trains the unet_256 teacher, and distills it into a tiny mobile_unet_256 student designed to be quantization- and WebGPU-friendly.
RTUKChatBot — single static page, no build step
Webcam → MediaPipe → Procrustes → ONNX generator on WebGPU/WebNN, plus a procedural animation rig (blink, head-sway, emotion, lip-sync) and a local-first chat avatar (RAG + on-device LLM + TTS/STT). This is the portfolio piece next to this blog.
nativeGAN — int8 cGAN inference in Rust + wgpu
A from-scratch GPU inference engine that parses the ONNX graph and runs it as hand-written WGSL compute shaders — the same code on Windows native and in the browser. Its WGSL was ported back into the web app as the default mobile backend, beating onnxruntime-web on phones.
Repo 1 · pytorch-CycleGAN-and-pix2pix
The model maps a rendered landmark image → a GTA-IV-illustrated, cyberpunk-lit face. There were no off-the-shelf labels for that, so most of the work was manufacturing the training distribution, then compressing a big teacher into a phone-sized student without losing it.
The task was re-framed early: instead of mapping a specific avatar to a specific face, the input became a rendered MediaPipe FaceMesh (468 points + iris = 478) drawn in a deliberately "fancy" style — filled face oval, eyes, lips, tessellation, contours, iris circles — and the target is the masked stylized face. Keying off landmarks makes the model robust to any face. To accept landmarks from any camera, a 3D Procrustes similarity transform maps detected points onto a canonical template, removing pose, scale and translation before the GAN ever sees them.
The face targets were stylized into a GTA-IV illustrated look with SD 1.5 + ControlNet-canny + IP-Adapter-FaceID + a GTA LoRA, in a two-pass img2img refine so the LoRAs could stack without collapsing identity. (IP-Adapter-FaceID needs its negative+positive embeds stacked, torch.cat([neg, pos]) → shape [2,1,512] — a small detail that cost real time.)
Lighting LoRAs never produced the dramatic chiaroscuro the reference had. The fix was IC-Light — not a LoRA but a UNet replacement taking extra latent channels (8 for foreground-conditioned, 12 for foreground+background). Two findings made it usable for a dataset:
The ~275 real recordings barely covered head turns or speaking mouths. Rather than collect more, the missing distribution was synthesized with LivePortrait, driven parametrically — a pitch×yaw×roll grid of implicit-keypoint transforms with no driving video. Animating a few curated stylized heroes across the grid yields many identity-, style- and lighting-consistent frames; landmarks are re-derived per frame for clean pairs. The |yaw|=30° profiles were held out as a pose-extreme test split.
Real-time in-browser inference needs a tiny model. The teacher (unet_256, up to ngf=128, 217M params at 512px) is frozen and a small mobile_unet_256 student is trained to mimic it — no new labels, the targets are the teacher's outputs, loss is L1 + VGG perceptual. The student architecture is deliberately deployment-shaped: Upsample+Conv instead of ConvTranspose (no checkerboarding, INT8-friendly), post-activation [Conv, BN, ReLU], and FloatFunctional skip-concats so the graph fuses cleanly on export.
⚠ Hard-won: the GAN runs away in the decay phase
With an adversarial term, late in training the discriminator overpowers the generator (D_fake → 0.02, L1/VGG climb 25–40%). The model regresses, it doesn't plateau — so latest is often worse than an earlier epoch. Always save often and pick the best epoch; a GAN-free fine-tune (L1+VGG only) both fixed the regression and lowered the quality floor. That fine-tune is the shipping recipe.
The single biggest GAN insight: a UNet trained at 512 but run at 256 is badly blurry — and not because of the lower resolution itself. A fully-convolutional UNet is resolution-flexible but not resolution-invariant: its BatchNorm statistics, receptive-field scale and decoder are tuned to 512-pixel spacing, so at 256 it synthesizes low-frequency mush. Distilling natively at 256 removed the mismatch — measured ~5–6× sharper (Laplacian variance), beating even a clean mipmap of the 512 output. A later check found the all-256 pipeline landed within 3.8% of the 512-teacher pipeline at one-third the training cost: student capacity, not teacher resolution, is the ceiling.
Speaking mouths were the last weak spot, because the pose synthesis had frozen expression. The fix arrived in three steps, and it's a clean demonstration of where a fix has to land:
open_lip/laugh, later talking.pkl real speech selected by farthest-point sampling) onto each hero. The teacher learned mouths almost perfectly (0.0070) but the student barely moved (0.0411) — a distillation capacity limit, not a data limit.| Checkpoint | Arch | Params | Change | Headline |
|---|---|---|---|---|
| landmarks2face | unet_256 | 54.4M | photoreal landmarks→face | LPIPS 0.019 |
| landmarks2gtaface | unet_256 | 54.4M | + GTA stylization (teacher) | LPIPS 0.034 |
| …v2_mobile_distill | mobile_256 | 2.45M | first cyberpunk student | — |
| …v3_pix2pix | unet_256 | 217M | + LivePortrait pose synth, 512px | — |
| …v3_native256 | mobile_256 | 2.45M | 256-native drop-in | pose 0.061 |
| …v4_mobile_distill | mobile_256 | 2.45M | mouth data only | mouth 0.0411 |
| …v5_mouthw | mobile_256 | 2.45M | + mouth-weighted loss | mouth 0.0286 |
| …v5_mouthw_ganfree (v6) | mobile_256 | 2.45M | + GAN-free fine-tune | mouth 0.0234 |
| …speech_ngf16_ganfree SHIP | mobile_256 | 2.45M | + speech visemes, drop-in | speech 0.0190 |
| …speech_ngf24_ganfree | mobile_256 | 5.51M | capacity upgrade (not drop-in) | speech 0.0164 |
LPIPS vs target, lower = better. Pose-extreme dropped v2 0.153 → ngf24 0.0467 (teacher ceiling 0.037); mouth-region v3 0.0416 → v6 0.0234 (teacher ceiling 0.0070).
Repo 2 · RTUKChatBot
A single static HTML page, no build step. The interesting work was getting a quantized CNN to run correctly and fast across three execution backends, keeping a per-frame loop allocation-free, and layering a believable animation rig on top.
On startup the page detects which backends the browser actually exposes and tries them best-first:
navigator.ml) maps ops to CoreML on Apple and DirectML on Windows. For a UNet-style CNN this is typically 3–10× faster than WebGPU because CoreML can route eligible ops to the Neural Engine.navigator.gpu) is the universal fallback, configured powerPreference:"high-performance" to pick the discrete GPU.◆ The INT8 saga
ORT-web's WebGPU execution provider saturates quantized UNets — broken across QDQ/QOperator and Conv/ConvTranspose op sets (blank-white output). The same INT8 model runs correctly and faster on the WASM EP (~441 ms vs ~1283 ms on the test rig). One more trap: int32 bias DequantizeLinear nodes carry an all-zero zero_point that the WebGPU EP rejects — it has to be stripped at export time. Lesson: route INT8 to WASM; keep FP16 on WebGPU.
A 256² generator output benefits enormously from a good upscale. The chain went bicubic → a B-spline 4-tap filter with linear-light filtering and CAS-style adaptive sharpening — sharpen in linear light, strong on skin, gentle near edges.
Once the resolution lesson was internalized, the app ships two natively-trained models: 256 (the mobile default — bandwidth + perf) and a higher-quality 512 (desktop default; ?res=512 opt-in on mobile). Both run at the resolution they were trained at — no mismatch blur.
Multi-threaded WASM needs SharedArrayBuffer, which needs crossOriginIsolated == true. The headers set COOP: same-origin + COEP: credentialless. credentialless (rather than require-corp) was the key choice: MediaPipe's .task model is served without a CORP header and would be blocked by require-corp; credentialless allows no-credential cross-origin loads while still unlocking threads. Browsers without it (Safari < 17.4) degrade gracefully to single-threaded.
Beyond raw webcam tracking, the landmarks can be driven procedurally: automatic blink (~15/min), idle head-sway, segmented emotion presets, and lip-sync (from an audio file, the mic, or typed text spoken by TTS). Every channel is additive and applied in canonical template space after Procrustes alignment.
◆ Procrustes both ways
To animate in a clean canonical space and keep the subject's real head pose, the deltas are applied in template space, then the Procrustes transform is inverted back to the live camera frame. Because inverse(forward(src)) = src, the real pose/scale/position is recovered exactly while template-space deltas ride along, correctly rotated, as (1/s)·Rᵀ·d. Head-sway is a rigid rotation and must be the last 3D step — applied before alignment it would be cancelled out.
A later Chat tab turns the face into a local-first conversational avatar: RAG (BM25 + MiniLM embeddings) feeds an on-device LLM (WebLLM), the reply is spoken by TTS, and the speech audio simultaneously drives the mouth — each TTS WAV is fed to an analyser that maps RMS→jaw-open while the text schedules the viseme shapes. Speech-to-text is Whisper (or the native API). The whole thing is DOM-decoupled so it bolts onto the existing render loop without touching the original lab tool. Nothing is sent to a server.
The hard part · iOS & Android
"Works on my desktop" is the easy 80%. Mobile browsers — iOS Safari especially — impose a tight memory budget, gate audio behind user gestures, and vary wildly in GPU throughput. Most of the platform-specific engineering lives here.
getUserMedia simply won't fire on iOS over plain HTTP — dev uses an ngrok/mkcert HTTPS tunnel.AudioContext from inside a tap handler. The context was being created after the async reply, so it stayed suspended → no audio and no mouth movement (the visemes are driven by analysing the playing audio). Fixed by unlocking on a user gesture and adding a per-answer Play button; avatar audio plays via a plain HTMLAudioElement.performance.memory is unavailable on Safari, so the live memory readout shows mem n/a there.adb + Playwright connectOverCDP for real WebGPU/chat repro on actual hardware, not just an emulator.performance.memory is exposed on Chrome/Android, so the live heap readout (and the leak-trend probe) work there — invaluable for catching the kind of per-frame leak that silently kills iOS.◆ The allocation-disciplined hot loop (why iOS stopped crashing)
The inference loop is engineered to run indefinitely on a memory-constrained browser: output tensors are disposed every frame (ORT-web doesn't auto-free the backend GPU/WASM memory they hold); the input tensor is reused — one ort.Tensor over a persistent buffer, mutated in place; and the landmark canvas readback is zero-alloc, reading through a reused WebGL2 buffer instead of getImageData() (which allocates a 256 KB ImageData per call). A live memory readout made the original leak obvious; a headless probe now asserts the JS heap stays flat over time.
Mobile runs on tiers — weak → 10 fps, mid → 20, strong → 40 — all on the single 256 model, defaulting to a fixed safe MID that real mid-range phones sustain. A measured auto-tune (measure real per-frame cost, then step down on sustained shortfall) is opt-in, because measuring during init can mis-read a capable device. A separate temporal weave path runs four distilled 128×128 generators round-robin (one per frame), weaving each into a 2×2 sub-pixel phase of a persistent 256 buffer — temporal supersampling at roughly ¼ the per-frame GAN cost.
⚠ The deploy trap that only bites previews
The R2 model bucket's CORS only allows the production origin, so *.pages.dev preview deploys can't fetch the GAN at all. The only way to test the real model path is to deploy to main — a gotcha worth knowing before burning an afternoon on a "broken" preview.
Repo 3 · nativeGAN
onnxruntime-web is a big dependency, and its WebGPU EP couldn't run the quantized model. So the inference engine was rebuilt from scratch: parse the ONNX graph, run it as hand-written WGSL compute shaders, record the whole forward pass into a single command buffer, submit once per frame. The same engine runs on Windows native and in the browser.
An offline tool (export_weights.py) turns an int8 QDQ ONNX file into a flat model.json + weights.bin. The engine reads that and builds a GPU program: int8 convolution over packed activations/weights (a packed 4×int8 dot product), inputs pre-quantized once, liveness-based buffer reuse, and bind groups built once and recorded cheaply each frame. WGSL is shared verbatim between the native and web builds.
| Milestone | What | State |
|---|---|---|
| M0 | Offline int8 export + onnxruntime oracle | done |
| M1 | wgpu int8 engine — bit-exact vs ORT, ~2.1 ms/frame | done |
| M2 | Windows native video (Media Foundation) → live side-by-side | done |
| M3 | Web / WebGPU — browser-validated via Playwright | done |
| MT | Temporal acceleration (N-phase ensemble, optical-flow reproject) | planned |
| M4 | Mobile (iOS / Android via cargo-mobile2) | planned |
◆ How it folds back into the web app
The engine's WGSL was ported to vanilla JS as scripts/native_gan.js and is now the default inference backend on mobile — fully GPU-resident, feeding the same bicubic display and temporal weave, and the fastest path there. Desktop keeps onnxruntime; ?backend=native / ?backend=ort force an A/B. Three repos, one shipping result.
If you take six things away
A fully-convolutional UNet is resolution-flexible, not resolution-invariant. Run a 512-trained net at 256 and you get low-frequency mush — fix it by distilling at the deployment resolution, not by upscaling.
Parametric LivePortrait poses and transplanted viseme deltas bought generalization the real recordings never could — head turns and speaking mouths the camera never captured.
Adding mouth data fixed the teacher (0.0070) but not the 2.45M student (0.0411). A 5× mouth-weighted distillation loss is what pushed the capability into the tiny model.
WebNN→WebGPU→WASM, chosen from what the browser actually exposes. Quantized UNets are broken on ORT-web's WebGPU EP — they belong on WASM (correct and faster).
iOS OOM-reloads on per-frame allocations and refuses audio outside a tap. Reuse every buffer, dispose tensors each frame, unlock audio on gesture, and default to a zero-model-memory voice.
When the off-the-shelf runtime can't run your model, parsing the ONNX into your own WGSL is a viable answer — and it became the fastest mobile path in the product.
Two further notes that cost real time: the RTX Blackwell (sm_120) needs PyTorch cu128 — cu121 builds fail silently, so an entire model version trained on the wrong GPU before anyone noticed. And a tiny student is easily data-loading bound: pin_memory + persistent_workers + an in-RAM cache cut epoch time 36 s → 12 s.