Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning
Detects visual hallucinations in cartoon character images by leveraging pose-aware in-context visual learning with VLMs
In a bid to address concerns surrounding visual hallucinations in Text-to-Image (TTI) models, a groundbreaking method has been proposed to enhance detection accuracy, particularly in non-photorealistic styles such as cartoon characters. This novel approach, known as pose-aware in-context visual learning (PA-ICVL), integrates pose information with RGB images within a contextual history to enable more precise identification of visual defects.
By leveraging Vision-Language Models (VLMs) and incorporating pose guidance from a fine-tuned pose estimator, the system significantly improves the TTI model's ability to detect visual hallucinations. Unlike baseline methods relying solely on RGB images, the integration of pose data allows for more accurate decision-making, resulting in a marked enhancement in detection performance.
The system is fine-tuned on a dataset comprising 2D illustrations and rendered 2D images from 3D model shapes to enhance pose estimation accuracy. This training dataset, consisting of 2400 images across animation, illustration, and cartoon domains, achieves high PCKh scores, indicating robust pose estimation capabilities.
During training, specific prompts are utilized for TTI input, guiding the model in detecting visual hallucinations effectively. These prompts include designing 2D motion frame pixel style characters and defining hallucinations as anomalies such as missing limbs or abnormal limb counts. The system is trained to classify images based on human anatomy abnormalities, enabling it to distinguish between correct images and those containing visual hallucinations.
Comments
None