Novel Text-to-image Synthesis Models: Obj-SA-GAN and Swinv2-Imagen

Li, Ruijun
Li, Weihua
Item type
Degree name
Master of Computer and Information Sciences
Journal Title
Journal ISSN
Volume Title
Auckland University of Technology

The rapid development of deep learning techniques has considerably boosted the development of text-to-image synthesis. Nowadays, a large number of generation models based on deep learning algorithms have emerged in the field of image generation. In this thesis, I recognise the importance of text semantic layout and propose two novel generation models, Obj-SA-GAN and Swinv2-Imagen. These two models use different algorithms to mine the text semantics, and both improve the quality of the generated images compared to baselines.

In recent years, text-to-image synthesis techniques have made considerable breakthroughs, but the progress is restricted to simple scenes. Such techniques turn out to be ineffective if the text appears complex and contains multiple objects. To address this challenging issue, I propose a novel text-to-image synthesis model called Object-driven Self-Attention Generative Adversarial Network (Obj-SA-GAN), where self-attention mechanisms are utilised to analyse the information with different granularities at different stages, achieving full exploitation of text semantic information from coarse to fine. Complex datasets are used to evaluate the performance of the proposed model. The experimental results explicitly show that our model outperforms the state-of-the-art methods. This is because the proposed Obj-SA-GAN model improves text utilisation and provides a better understanding of complex scenarios.

In addition, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google’s Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient-Unet used by Image is not the best choice for image generation. To address these issues, I propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, I also introduce a Swin-Transformer-based Unet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.

Publisher's version
Rights statement