ZeroDiffusion: Clean zero terminal SNR models
Training 1.5 base model + Experimental inpainting model
huggingface.co
https://huggingface.co/drhead/ZeroDiffusionThese models are intended to help researchers and finetuners wishing to make zero terminal SNR models. Adapting a model to zero terminal SNR and v-prediction can take quite a bit of time to converge, and some of these findings suggest that having the text encoder unfrozen during this time can have undesired effects, so this model should hopefully help people avoid those issues. The author also found that zero terminal SNR models have good potential for task-specific training, and hopes to see more people attempt that.
Training parameters:
- The model is resumed from Stable Diffusion 1.5's EMA weights, using v-prediction and a zero terminal SNR noise schedule.
- It is trained on min-SNR-gamma 5, for timestep bias correction and convergence speed
- It is trained with IP noise gamma of 0.1, for regularization and convergence speed
- It is trained with multi-resolution bucketing, covering 5 aspect ratios with roughly the same pixel count as 512x512 images (640x384 being the tallest/widest bucket)
- Caption dropout of 0.1 was applied, similarly to Stable Diffusion 1.5's original training.
- EMA weights were trained with a decay value of 0.9999, initialized from the same SD1.5 EMA weights. I recommend using the EMA weights for inference, as their outputs tend to be much higher quality.
- The text encoder was frozen during training.
Training dataset:
- The dataset used is a subset of high-resolution images (defined as having the shortest side >= 1024px) from LAION Aesthetics v2 5+.
- No filtering was applied for pwatermark or punsafe, so the model may be more biased towards unsafe images or watermarked images. Including watermark in a negative prompt is highly effective for removing watermarks.
- LAION Aesthetics v2 5+ has a very strong bias towards product images, and this is likely to show up in outputs, especially given the caption dropout.


Comments
None