Introduction

In the PetFinder.my - Pawpularity Contest, participants were challenged to predict which pets were more likely to be adopted based on their images and metadata. Initially, Convolutional Neural Networks (CNNs) showed promising results in this competition. However, as time went on, a new model called Transformer emerged and surpassed CNN by a significant margin. This raises the question: Why did the Transformer outperform CNN?

Why??

When it comes to adopting a pet, humans are often drawn to pets that look appealing. However, this appeal is not solely dependent on a cute face; it considers the entire picture, including the tail, fur, surroundings, and more. Similarly, we expect a deep learning model to consider these factors when making predictions. Let’s delve into what CNN and Transformer models focus on in an image before making their decisions.

GradCAM of CNN vs AttentionMAP of Transformer.

From the animation above, we can observe that the CNN model (EfficientNet) mainly concentrates on the facial features of the image, such as the eyes, nose, and mouth. On the other hand, the Transformer model (ViT) pays attention to all the features of the image, including the tail, fur, surroundings, and more.

How??

Now that we understand the “why,” let’s explore the “How” behind the Transformer’s success and the CNN’s limitations.

Convolutional Neural Network (CNN)

Code:

CNNs are limited by their local receptive field. During the convolutional process, a CNN can only perceive information within its kernel_size, which means it lacks the context of regions outside that area—a limitation referred to as “local-context.” Typically, the kernel size in a CNN model is set to 3x3 or 5x5. While the receptive field of a CNN expands as the model’s depth increases, thanks to the reduction in resolution, it still remains “short-sighted” due to the limited receptive field.

Mechanism of CNn

Transformer

Code:

On the other hand, Transformers possess a superpower known as the Global Receptive Field. This allows them to perceive the entire picture when making decisions. Transformers achieve this global awareness through a mechanism called self-attention which dynamically focuses on relevant information from different parts of the image enabling better understanding of the scene.

Mechanism of Vision Transformer

Summary: CNNs are limited by their local-context perspective while Transformers can capture the broader context of an image, including the appealing features that contribute to a pet’s adoptability.