TryOnDiffusion: A Tale of Two UNets
Generates photorealistic visualizations of garments on individuals while accommodating significant body pose and shape changes
TryOnDiffusion operates on the principle of preserving garment details while accommodating significant body pose and shape changes. Unlike previous approaches that either focus solely on garment detail preservation or allow try-on with desired shape and pose but lack garment details, TryOnDiffusion unifies these aspects within a single network architecture.
TryOnDiffusion's Parallel-UNet architecture comprises of two UNets – one dedicated to the person and the other to the garment. These UNets work in parallel, with the garment being implicitly warped via a cross-attention mechanism to accommodate pose and body changes. This unified approach ensures that garment warp and person blend occur as part of a single process, rather than as separate tasks.
The pipeline of TryOnDiffusion begins with preprocessing, where the target person and garment are segmented and pose is computed. These inputs are then fed into the 128×128 Parallel-UNet to create a try-on image, which is subsequently upscaled and refined using the 256×256 Parallel-UNet and standard super resolution diffusion techniques, resulting in a high-resolution visualization of the garment on the individual.
TryOnDiffusion also incorporates pose embeddings, which are computed separately for the person and garment poses and fused into the UNets using an attention mechanism. Additionally, feature-wise linear modulation (FiLM) is employed across all scales to further enhance the synthesis process.
Comments
None