ChatClothes: An AI-Powered Virtual Try-On System

Zhang, YuchaoChatClothes: An AI-Powered Virtual Try-On SystemAuckland University of Technology2025My UniversityMy UniversityYan, Wei Qi2025-11-252025-11-252025enThesishttp://hdl.handle.net/10292/20210OpenAccessWith the advancement of deep learning, latent diffusion models, and large language models (LLMs), virtual try-on (VTON) has emerged as a promising solution for personalized fashion experiences in online shopping, digital design, and augmented retail. This thesis proposes ChatClothes, a modular and multimodal VTON system that integrates controllable diffusion-based generation with dialogue-driven garment interaction. The system architecture is orchestrated by Dify, with ComfyUI managing the visual generation pipeline and Ollama hosting local LLMs. At its core, ChatClothes employs DeepSeek, a customized large language model that interprets natural language instructions and transforms them into structured prompts for image generation and interactive refinement. This prompt-based guidance enhances semantic alignment and enables intuitive user control beyond predefined attribute labels. To improve structural consistency and detail fidelity in image synthesis, this work introduces Low-Rank Adaptation (LoRA) for fine-tuning the original OOTDiffusion model. Without altering the backbone architecture, this strategy focuses on enhancing pose alignment, hand generation accuracy, and garment texture reconstruction. By integrating LoRA modules, the model achieves effective adaptation and fine-grained refinement even under limited training resources. To support garment classification, YOLO12n-LC, a lightweight variant based on YOLO12n, is developed to balance accuracy, speed, and model size. It achieves competitive performance across multiple clothing categories while maintaining feasibility for device-level deployment. A complete system workflow connects image preprocessing, language understanding, garment classification, image synthesis, and output evaluation. Experiments on datasets such as DressCode and VITON-HD demonstrate the system’s initial validation in terms of realism, controllability, structural preservation. This work presents a unified framework bridging vision-language interaction with diffusion-based generation, establishing a foundation for scalable, user-centered, and device-adaptable fashion AI systems applicable across e-commerce, AR fitting mirrors, personalization platforms, and automated outfit design.