Ting-Chun Liu and Leon-Etienne Kühr
What occurs when machines learn from one another and engage in self-cannibalism within the generative process? Can an image model identify the happiest person or determine ethnicity from a random image? Most state-of-the-art text-to-image implementations rely on a number of limited datasets, models, and algorithms. These models, initially appearing as black boxes, reveal complex pipelines involving multiple linked models and algorithms upon closer examination. We engage artistic strategies like feedback, misuse, and hacking to crack the inner workings of image-generation models. This includes recursively confronting models with their output, deconstructing text-to-image pipelines, labelling images, and discovering unexpected correlations. During the talk, we will share our experiments on investigating Stable-Diffusion pipelines, manipulating aesthetic scoring in extensive public text-to-image datasets, revealing NSFW classification, and utilizing Contrastive Language-Image Pre-training (CLIP) to reveal biases and problematic correlations inherent in the daily use of these models.
The talk will be conducted by sharing various experiments we've done under the umbrella of generative AI models. We will begin with a general idea of how we, as artists/programmers, perceive these models and our research on the workflow of these constructs. Then, we will further elaborate on our exploration of the Stable Diffusion pipeline and datasets. Throughout our investigation, we discovered that some essential parts are all based on the same few datasets, models, and algorithms. This causes us to think that if we investigate deeper into some specific mechanisms, we might be able to reflect on the bigger picture of some political discourses surrounding generative AI models. We deconstructed the models into three steps essential to understanding how they worked: dataset, embedding, and diffusions. Our examples are primarily based on Stable-Diffusion, but some concepts are interchangeable in other generative models.
As datasets and machine-learning models grow in scale and complexity, understanding their nuances becomes challenging. Large datasets, like the one for training Stable Diffusion, are filtered using algorithms often employing machine learning. To "enhance" image generation, LAION's extensive dataset underwent filtering with an aesthetic prediction algorithm that uses machine learning to score the aesthetics of an image with a strong bias towards water-color and oil paintings. Besides the aesthetic scoring of images, images are also scored with a not safe-for-work classifier that outputs a probability of an image containing explicit content . This algorithm comes with its own discriminatory tendencies that we explore in the talk and furthermore asks how and by whom we want our datasets to be filtered and constructed.
Many generative models are built upon Contrastive Language-Image Pre-training (CLIP) and its open-source version, Open-CLIP, which stochastically relates images and texts. These models connect images and text, digitize text, and calculate distances between words and images. However, they heavily rely on a large number of text-image pairs during training, potentially introducing biases into the database. We conducted experiments involving various "false labelling" scenarios and identified correlations. For instance, we used faces from ThisPersonDoesNotExist to determine "happiness" faces, explored ethnicities and occupations on different looks, and analyzed stock images of culturally diverse food. The results often align with human predictions, but does that mean anything?
In the third part, we take a closer look at the image generation process, focusing on the Stable Diffusion pipeline. Generative AI models, like Stable Diffusion, have the ability not only to generate images from text descriptions but also to process existing images. Depending on the settings, they can reproduce input images with great accuracy. However, errors accumulate with each iteration when this AI reproduction is recursively used as input. We observed that images gradually transform into purple patterns or a limited set of mundane concepts depending on the parameters and settings. This raises questions about the models' tendencies to default to learned patterns.