CNN vs Transformer
Introduction
In the PetFinder.my - Pawpularity Contest, participants were challenged to predict which pets were more likely to be adopted based on their images and metadata. Initially, Convolutional Neural Networks (CNNs) showed promising results in this competition. However, as time went on, a new model called Transformer emerged and surpassed CNN by a significant margin. This raises the question: Why did the Transformer outperform CNN?
Why??
When it comes to adopting a pet, humans are often drawn to pets that look appealing. However, this appeal is not solely dependent on a cute face; it considers the entire picture, including the tail, fur, surroundings, and more. Similarly, we expect a deep learning model to consider these factors when making predictions. Let’s delve into what CNN and Transformer models focus on in an image before making their decisions.
From the animation above, we can observe that the CNN model (EfficientNet
) mainly concentrates on the facial features of the image, such as the eyes, nose, and mouth. On the other hand, the Transformer model (ViT
) pays attention to all the features of the image, including the tail, fur, surroundings, and more.
How??
Now that we understand the “why,” let’s explore the “How” behind the Transformer’s success and the CNN’s limitations.
Convolutional Neural Network (CNN)
Code:
- train: PetFinder: CNN [TPU][Train] 🐶
- infer: PetFinder: CNN [TPU][Infer] 🐶
CNNs are limited by their local receptive field. During the convolutional process, a CNN can only perceive information within its
kernel_size
, which means it lacks the context of regions outside that area—a limitation referred to as “local-context.” Typically, the kernel size in a CNN model is set to3x3
or5x5
. While the receptive field of a CNN expands as the model’s depth increases, thanks to the reduction in resolution, it still remains “short-sighted” due to the limited receptive field.
Transformer
Code:
On the other hand, Transformers possess a superpower known as the Global Receptive Field. This allows them to perceive the entire picture when making decisions. Transformers achieve this global awareness through a mechanism called
self-attention
which dynamically focuses on relevant information from different parts of the image enabling better understanding of the scene.
Summary: CNNs are limited by their local-context perspective while Transformers can capture the broader context of an image, including the appealing features that contribute to a pet’s adoptability.
Enjoy Reading This Article?
Here are some more articles you might like to read next: