Fine-Tuning a Vision Transformer (Swin-Tiny) for Detection and Classification of AI-generated Images
The notebooks in this repository focus primarily on fine-tuning a pre-trained vision transformer (Swin-Tiny) to extend a binary classification problem: identifying whether an image is created by generative AI. The work here expands the scope of this baseline into a multiclass classification problem: identifying whether an image is authentic (human-generated) or generated by one of a series of text-to-image AI generators (i.e., Stable Diffusion, Midjourney, and DALL-E).
The goal was to tackle the multiclass classification problem using three separate approaches to transfer learning:
- The first experiment used the model as a feature extractor. Extracted outputs were passed to a logistic regressor implemented in Scikit-learn (
LogisticRegressionCV
) to classify the images. - The second experiment was fine-tuning with frozen layers. It involved freezing all of the parameters up until the final linear layer, and then adding our own linear layer that transformed the output dimensions and handed off to a softmax for the classification.
- The third experiment was selective fine-tuning: a natural extension to experiment 2 where we froze every layer except the last one (specifically Stage 3, Block 1), which would remain unfrozen and trainable. As with the previous experiment, we added a trainable linear layer with a softmax for classification.
Read the full report here.