We introduce FiCA, a Feed-forward, instant Gaussian Codec Avatar generation pipeline that creates lifelike avatars from a single portrait image. Generating a photorealistic and drivable avatar from just a single image is significantly challenging due to the limited visual information available to accurately infer the 3D appearance and geometry of human heads. To address this, we develop a novel system that combines human-centric vision foundation models with a diffusion model. This system is designed to fully exploit partial visual observations to generate lifelike human avatars. Our proposed diffusion model learns a generative mapping from these partial observations to complete and authentic 3D mesh reconstruction. Additionally, we introduce a feed-forward mesh refinement network that enhances the fidelity and identity preservation of the generated avatars, eliminating the need for person-specific test-time optimization. By leveraging a universal prior model that decodes a generated mesh into a set of 3D Gaussians, we generate a photorealistic 3D Gaussian avatar, capable of being driven with novel expressions in real-time. Our experiments demonstrate that the avatars generated by our feed-forward approach faithfully represent diverse identities and surpass the visual quality of avatars produced by recent competing methods.
FiCA first uses fine-tuned Sapiens models to obtain per-pixel UV and vertex coordinates and normal estimation, and unwraps to partial RGB, visibility mask, normal and vertex coordinates in UV space. Then, our UV space diffusion model takes these partial UV maps, input image's CLIP embedding, and random noise to generate complete textures and geometry UV maps. The learned UV refinement network takes the generated texture and geometry as input, rich visual features of the rendered mesh image and the input image as conditions, and performs feed-forward texture and geometry refinement. Finally, the universal prior model gets expression codes, mesh texture, and geometry as inputs to generate Gaussian Codec Avatars and drive in real-time.
FiCA can generate high-quality Gaussian avatars from images casually captured on mobile phones. Avatars generated by FiCA can plausibly represent the subject, capturing details such as hairstyle and skin tones. Additionally, the generated avatars can be animated consistently across different identities while preserving user-specific facial details.
Given an input portrait image and a driving expression from the same person, we visualize the animation results of the generated avatars from all the methods. While the competing methods struggle with expressing detailed or extreme facial movements, FiCA shows superior avatar reconstruction quality.
As an exciting future application of FiCA, we can use FiCA to lift the stylized portrait image to 3D space within a few seconds. Still, FiCA has never been trained with such stylized or synthetic images. Therefore, future research on domain generalization would be an interesting direction.
@article{youwang2026fica,
title = {FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image},
author = {Youwang, Kim and Yang, Zhengyu and Ge, Liuhao and Rong, Yu and Bagautdinov, Timur and Zhaoen, Su and Sopher, Nir and Popovi\'c, Jovan and Deng, Teng and Oh, Tae-Hyun and Cao, Chen},
journal = {arXiv preprint, xxxx.xxxxx},
year = {2026},
}