Is Human Vision More like CNN or Vision Transformer?

18 minute read

Engineer the Mind

In Winter 2015, after coming back from grad school interviews in the States, I told my dad over hotpot that I was going to study cognitive science at Berkeley.

- “So, what is cognitive science?" he asked.
- “It is the study of the mind, uncovering algorithms that might underlie human reasoning, perception, and language." I tried my best to explain.
- “Cool… How is that different from artificial intelligence?" Dad πŸ€”.
- “Hmm… AI engineers solutions that work, but CogSci reverse-engineers how humans think back from the solutions?" 21-year-old me πŸ€“.
- “If AI works, does it matter if it works like the mind? Since the mind already works, does it matter if we can reverse-engineer it?" Dad 🧐.
- “The weather today is quite nice…" 21-year-old me πŸ₯΅.

Little did I know, nearly 10 years later as a machine learning engineer, I’d be repeating this conversation with recruiters, hiring managers, and curious colleagues, each asking my dad’s questions. My answers, and perhaps the field’s, have changed.

Marr: Purpose & Function

“Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: It just cannot be done." — David Marr (1982), Vision, p. 27.

When I started my PhD in 2016, it was before the Transformer. ResNet (CVPR 2016) had just surpassed humans in image classification, while higher-level cognition was still dominated by Bayesian models (see Griffiths et al. 2010, Tenenbaum et al., 2011, and Lake et al., 2017, for reviews), like Computer Vision before AlexNet. Even Fei-Fei Li, the godmother of AI, began her career in Bayes (e.g., CVPR 2004).

Why Bayes? I hope my CogSci professors forgive me — I think Bayesian models are neat “pseudocode” to capture a learner’s inductive biases through priors and outcomes through posteriors, without worrying too much about the process in between.

This tradition of understanding the mind through its abstract function stems from the British neuroscientist David Marr. Tired of neuroscience’s obsession with identifying one specialized neuron after another (e.g., Barlow, 1953, “bug detector”; Gross, Rocha-Miranda, & Bender, 1972, “hand detector”; and the Cambridge joke, the apocryphal grandmother cell that supposedly activates when you see your grandma), Marr argued that to truly understand vision, we must step back and consider the purpose of vision and the problems it solves. This is the “computational” level of analysis, which laid the groundwork for modern computational cognitive science.

Marr provided a vivid example of how to understand an information-processing system through its purpose and function: How do we understand a cash register, which tells a customer how much to pay? Instead of examining each button on the machine, like neuroscientists did in the ’50s and ’60s, we can ask, what should a cash register compute? — Addition. Why addition and not, say, multiplication? — Because addition, unlike multiplication, meets the requirements for a successful transaction:

  • The rules of zero: If you buy nothing, you pay nothing; if you buy nothing along with something, you should pay for that something.
  • Commutativity: The order in which items are scanned shouldn’t affect the total.
  • Associativity: Grouping items into different piles shouldn’t affect the total.
  • Inverse: If you buy something and then return it, you should pay nothing.

Multiplication fails the rule of zero — if you buy nothing along with something, you’d pay nothing, since $0 \times \mathrm{something} = 0$. So, any merchant aiming to make a profit wouldn’t use a cash register that performs multiplication. Studying the buttons won’t help us understand the cash register at this level, Marr argued, just as finding the grandmother cell doesn’t bring us any closer to understanding vision.

Summarized below are levels at which we study the mind (artificial or natural). At the computational level, we define the constraints for a task (in domains such as vision, language, or reasoning) and identify a computation that satisfies these constraints. At the algorithmic level, we determine input/output representations, as well as the algorithm to perform the transformation. Finally, at the implementational level, we figure out how to physically implement these representations and algorithms, whether in the human brain or in a machine (like a CPU or a GPU).

Source: David Marr’s Vision, Chapter 1 The Philosophy and the Approach

Source: David Marr’s Vision, Chapter 1 The Philosophy and the Approach

Hinton: Mechanism & Pretraining

Marr did not prescribe specific architectures for modeling vision, yet his vision for vision somehow contributed to the rejection of a generation of vision papers using neural nets, as Geoff Hinton recounted in his conversation with Fei-Fei Li.

“It’s hard to imagine now, but around 2010 or 2011, the top Computer Vision people were really adamantly against neural nets — they were so against it that, for example, one of the main journals had a policy not to referee papers on neural nets at one point. Yann LeCun sent a paper to a conference where he had a neural net that was better at doing segmentation of pedestrians than the state of the art, and it was rejected. One of the reasons it was rejected was because one of the referees said this tells us nothing about vision — they had this view of how computer vision works, which is, you study the nature of the problem of vision, you formulate an algorithm that will solve it, you implement that algorithm, and you publish a paper." — Geoff Hinton (2023), talk @Radical Ventures, 29' 27''.

“This view of how computer vision works” clearly came from Marr, but I found it kind of sad that Marr’s motivation to zoom out from the nitty-gritty when first understanding an intelligence was misinterpreted by some as staying at this abstract level forever, without getting back down to business once we know the direction.

And it wasn’t just vision — I remember Berkeley CogSci PhD students had to write seminar essays explaining why neural networks (dubbed as “connectionism” in CogSci) weren’t as good a fit for higher-level cognition as Bayesian models. The recurring argument was that neural networks require too much data to train, and it’s way harder to adjust weights in a neural net than to modify edges in a Bayes net. For instance, a human may misclassify a dolphin as a fish but can quickly correct the label to a mammal — at that time, it was hard to imagine how a neural network could perform this one-shot belief updating that was straightforward in a Bayes net.

Only after so many years can I admit – I never really understood the contention between Bayesian models and neural nets. First of all, they’re not even at the same level of analysis, with the former describing the solution to a task and the latter solving it. Moreover, just as it was underwhelming for Marr to see a collection of specialized neurons and call it “understanding vision,” it felt similarly underwhelming to draw a Bayes net describing how a cognitive task should be done and call it “understanding cognition,” without implementing the nitty-gritty details to build one. Years later, I heard my doubts spoken aloud by Hinton, in that same talk with Fei-Fei (he might as well just say MIT’s Josh Tenenbaum’s name out aloud πŸ˜†).

“For a long time in cognitive science, the general opinion was that if you give neural nets enough training data, they can do complicated things, but they need an awful lot of training data — they need to see thousands of cats — and people are much more statistically efficient. What they were really doing was comparing what an MIT undergraduate can learn to do on a limited amount of data with what a neural net that starts with random weights can learn to do on a limited amount of data.

To make a fair comparison, you take a foundation model that is a neural net trained on lots and lots of data, give it a completely new task, and you ask how much data it needs to learn this completely new task — and you discover these things are statistically efficient and compare favorably with people in how much data they need."
— Geoff Hinton (2023), talk @Radical Ventures, 46' 19''.

When interviewing at Berkeley, I asked my student host why Bayesian models could magically explain how humans learn so much from so little, so quickly (Prof. Alison Gopnik’s catchphrase). I don’t remember her answer. Today, I realize that prior knowledge sharpens the probability density around certain hypotheses. If we allow pretraining on a Bayes net, we should similarly allow pretraining on a neural net.

Human vs. Computer Vision

Today’s cognitive science is much more receptive to neural nets — so much so that one might worry the best-performing model on machine learning benchmarks may just be viewed as the algorithm underlying the human mind. We need clever experimental designs and metrics to assess how well SOTA models align with human cognition. Tuli et al.’s (2021) CogSci paper, comparing CNNs and Vision Transformers (ViT) to human vision, is an early effort in this direction. Below, I review the key ideas behind CNN + ViT and the authors' methodology for measuring model-human alignment.

Convolutional Neural Network (CNN)

A convolutional neural network (CNN) is a fancier version of a feed-forward network (FNN), which extracts features through convolutional layers and pooling layers first, before feeding them to fully connected layers. But what is a convolution? And what features can it extract? These are the million-dollar questions.

What Is a Convolution?

In math, a convolution is an operation on two functions, $f$ and $g$, that creates a third function, $f * g$. That might sound a bit abstract. In his awesome video, 3Blue1Brown explains it with a classic dice example: Imagine two $N$-faced dice, each with an array of probabilities for landing on faces 1 to $N$. To find the probability of rolling a specific sum from the two dice, you use a convolution:

The probability of rolling a sum of 6 from 2 dice (source: 3Blue1Brown).

The probability of rolling a sum of 6 from 2 dice (source: 3Blue1Brown).

  1. Flip the second die so that its faces range from $N$ to 1, left to right;
  2. Align dice with offsets 1 to $N$; sums in the overlapping region are the same;
  3. Finally, to get the probability of rolling each unique sum, add the product of the probabilities from each overlapping pair of faces.

Below is a Python implementation for 1D array convolution (if this an coding interview πŸ˜…), or you could simply call np.convolve on the two input arrays.

 1def convolve(dice1, dice2):
 2    # Length of the convolved array is len(dice1) + len(dice2) - 1
 3    n1 = len(dice1)
 4    n2 = len(dice2)
 5    result = [0] * (n1 + n2 - 1)
 6    
 7    # Perform convolution
 8    for i in range(n1):
 9        for j in range(n2):
10            # Index: a unique sum
11            # Value: probability of this sum
12            result[i + j] += dice1[i] * dice2[j]
13    
14    return result
15
16# Example 1: Two fair dice
17dice1 = [1/6] * 6
18dice2 = [1/6] * 6
19print(convolve(dice1, dice2))
20# Expected output (probabilities for sums 2 to 12):
21# [0.027777777777777776, 0.05555555555555555, 0.08333333333333333, 
22#  0.1111111111111111, 0.1388888888888889, 0.16666666666666669, 
23#  0.1388888888888889, 0.1111111111111111, 0.08333333333333333, 
24#  0.05555555555555555, 0.027777777777777776]
25
26# Example 2: Two weighted dice
27dice1 = [0.16, 0.21, 0.17, 0.16, 0.12, 0.18]
28dice2 = [0.11, 0.22, 0.24, 0.10, 0.20, 0.13]
29print(convolve(dice1, dice2))
30# Expected output (probabilities for sums 2 to 12):
31# [0.0176, 0.058300000000000005, 0.1033, 0.1214, 0.1422, 0.1644, 
32#  0.1457, 0.10930000000000001, 0.06280000000000001, 
33#  0.05159999999999999, 0.0234]

See the full results of convolving two 6-faced dice below, along with the formula.

Probabilities of rolling possible sums from 2 dice (source: 3Blue1Brown).

Probabilities of rolling possible sums from 2 dice (source: 3Blue1Brown).

In a CNN, instead of convolving two 1D arrays of the same length, we convolve two 2D arrays of different dimensions — a larger image array and a smaller $k \times k$ kernel (which, like the second 1D array, is flipped 180 degrees before applying).

What Features Can It Extract?

An example kernel for detecting horizontal edges (source: 3Blue1Brown).

An example kernel for detecting horizontal edges (source: 3Blue1Brown).

Element values in the kernel determine what features it extracts. In the above example, element values sum up to 1, so the kernel blurs the original image by taking a moving average of neighboring pixels (“box blur”). If we allow some values in a kernel to be positive and others negative, the kernel may detect variations in pixel values and pick up on features such as vertical and horizontal edges. We can design different kernel values to detect different image features (more examples).

In the 1D example, we considered all possible offsets between two arrays. In a CNN, however, we only compute element-wise products where the kernel is fully aligned with the original image. If the original image has dimensions $m \times n$ (ignoring the color channel for now), the output array — or the “feature map” — of a $k \times k$ kernel will have dimensions $(m - k + 1) \times (n - k + 1)$. This is because the kernel slides horizontally $(n - k + 1)$ times and vertically $(m - k + 1)$ times across the image.

In a CNN, we only convolve fully aligned positions (source: DigitalOcean).

In a CNN, we only convolve fully aligned positions (source: DigitalOcean).

In practice, we usually add padding to keep the dimensions of each feature map at $m \times n$ instead of reducing it to $(m - k + 1) \times (n - k + 1)$. After convolution, we stack the $l$ feature maps into a tensor of size $l \times m \times n$, and then apply the ReLU activation function to each element in the tensor, setting negative values to zero.

Apply max pooling after convolutional layers + ReLU (source: CS231n).

Apply max pooling after convolutional layers + ReLU (source: CS231n).

It’s customary to apply max pooling after ReLU, where we use a fixed-size window to downsample each individual feature map and take the maximum value in each window. We can use a hyperparameter “stride” to control how far the window moves across the feature map — with a stride of 2, we reduce the spatial dimensions by half.

Using a window of size $p \times p$ and a stride of $s$, we reduce the tensor dimension to:

$$l \times \left( \left\lfloor \frac{m - p}{s} \right\rfloor + 1 \right) \times \left( \left\lfloor \frac{n - p}{s} \right\rfloor + 1 \right).$$

Finally, this dimension-reduced tensor is fed into a feed-forward network to perform the target task, such as image classification or object detection.

Putting Together a CNN

“There is no set way of formulating a CNN architecture. That being said, it would be idiotic to simply throw a few of layers together and expect it to work." — O’Shea and Nash (2015), An Introduction to CNN.

A common way to stack CNN layers (source: O’Shea and Nash, 2015).

A common way to stack CNN layers (source: O’Shea and Nash, 2015).

To extract complex features at increasingly levels of abstraction, we can use multiple CNN layers. A common approach is to stack two convolutional layers before each pooling layer. The code below illustrates this concept (source).

 1class ConvNeuralNet(nn.Module):
 2#  Determine what layers and their order in CNN object 
 3    def __init__(self, num_classes):
 4        super(ConvNeuralNet, self).__init__()
 5        self.conv_layer1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)
 6        self.conv_layer2 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3)
 7        self.max_pool1 = nn.MaxPool2d(kernel_size = 2, stride = 2)
 8        
 9        self.conv_layer3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3)
10        self.conv_layer4 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3)
11        self.max_pool2 = nn.MaxPool2d(kernel_size = 2, stride = 2)
12        
13        self.fc1 = nn.Linear(1600, 128)
14        self.relu1 = nn.ReLU()
15        self.fc2 = nn.Linear(128, num_classes)
16    
17    # Progresses data across layers    
18    def forward(self, x):
19        out = self.conv_layer1(x)
20        out = self.conv_layer2(out)
21        out = self.max_pool1(out)
22        
23        out = self.conv_layer3(out)
24        out = self.conv_layer4(out)
25        out = self.max_pool2(out)
26                
27        out = out.reshape(out.size(0), -1)
28        
29        out = self.fc1(out)
30        out = self.relu1(out)
31        out = self.fc2(out)
32        return out

Vision Transformer (ViT)

Some say the success of CNNs in Computer Vision is no coincidence (e.g., Yamins et al., 2014) — the primate primary visual cortex (V1) is similar to a CNN in that it also uses local receptive fields (“kernels”) with pooling to extract features from visual inputs, with increasing levels of abstraction from one layer to the next.

Transformers, on the other hand, originated from Natural Language Processing (e.g., Vaswani et al., 2017) and do not bear biological similarities to the visual cortex like CNNs do, nor do they enjoy inductive biases such as translation equivariance (e.g., rotating or moving a pattern doesn’t affect recognition) and locality (e.g., nearby pixels are more similar to one another than remote ones), which are inherent to CNNs. Despite lacking these inductive biases, a standard Transformer trained on large datasets (14M-300M images) performs favorably to CNNs on many benchmarks.

An image doesn’t have discrete tokens like language does. To leverage the standard Transformer encoder (see this post, for an NLP refresher), the Vision Transformer (ViT) authors split each image into fixed-size patches and treat each patch as a token. They then apply a linear projection to embed each flattened patch. To aid classification, a learnable [CLS] token is prepended to the sequence of patch embeddings. Positional embeddings are added to patch embeddings to retain positional information before feeding them into the multi-headed attention and MLP blocks.

Architecture of the Vision Transformer (source: Dosovitskiy et al., 2020).

Architecture of the Vision Transformer (source: Dosovitskiy et al., 2020).

It’s fascinating that, with sufficient training, ViT learns to produce similar embeddings for patches in the same row or column, even though it lacks the inductive bias of CNNs that nearby patches should be similar. This harkens back to Hinton’s comment that pretrained foundation models generalize as well as human learners, despite that humans are endowed with even more inductive biases than CNNs.

Attention from output token to input (source: Dosovitskiy et al., 2020).

Attention from output token to input (source: Dosovitskiy et al., 2020).

Model-Human Alignment

“First, two systems can differ in which stimuli they fail to classify correctly […] Second, while there is only one way to be right, there are many ways to be wrong — systems can also vary systematically in how they misclassify stimuli." — Tuli et al. (2021).

As far as engineers are concerned, whichever model performs better on the task at hand should be used. However, the overall accuracy doesn’t tell us which model behaves more like humans and, therefore, may be closer to the nature of human vision.

Why does the question in this blog post’s title even matter? Today, as an engineer, I’m not so sure anymore. If I were to channel my CogSci professors, they might say that understanding the algorithms behind human vision is key to improving human-machine alignment and interaction. In any case, it’s still a great exercise to think how we can measure and compare the alignment between models and human performance.

Shape Bias

If you mess up the texture of an image but keep the shapes intact, like the one below, a human would still be able to recognize the cat in it. This is the so-called “shape bias” in human vision (e.g., Kucker et al., 2019). By contrast, a CNN trained on ImageNet exhibits a texture bias (e.g., Geirhos et al., 2018).

Stylized ImageNet with transformed textures (source: Tuli et al., 2021).

Stylized ImageNet with transformed textures (source: Tuli et al., 2021).

Of course, one might ask: How can a CNN or ViT learn to rely on shapes if it has never been trained on texture-transformed cats still labeled as “cats”, for instance? The authors compared both ImageNet-trained models (CNN: ResNet-50; ViT: ViT-B/32) and models fine-tuned with augmented data (e.g., Gaussian blur, color distortions) to human performance. ViT demonstrated a stronger shift toward shape bias than CNN, though both models still fall far short of human-level shape bias.

Shape bias shown by human vs. models (source: Tuli et al., 2021).

Shape bias shown by human vs. models (source: Tuli et al., 2021).

Error Consistency

We can also collect human judgments for all ImageNet classes and check if the human-label confusion matrix (e.g., $P$) differs more from the CNN-label or the ViT-label confusion matrix (e.g., $Q$). We can symmetric metrics to measure the distance between two probability distributions, such as the Jensen–Shannon divergence (JSD):

$$\text{JSD}(P \parallel Q) = \frac{1}{2} \left( \text{KL}(P \parallel M) + \text{KL}(Q \parallel M) \right),$$

where $M = \frac{1}{2}(P + Q)$ and $\text{KL}(P \parallel M)$ is the the Kullback-Leibler divergence between distributions $P$ and $M$, defined as $\text{KL}(P \parallel M) = \sum_i P(i) \log \frac{P(i)}{M(i)}$.

With 1,000 ImageNet classes, populating the $10^6$ cells with human data is infeasible. The ImageNet classes were inspired by the WordNet — using the WordNet hierarchy, the authors identified 16 “entry-level” categories (e.g., airplane, bear, bird) that are hypernyms of finer labels, reducing the matrix dimensions to $16 \times 16$.

JSD returns a non-negative scalar, where 0 indicates identical distributions and large numbers indicate large divergences. The authors computed two types of JSD:

  • Class-wise JSD: Collapse the $16 \times 16$ confusion matrix into a 16-vector, where each element represents the total count of errors made for that class πŸ‘‰ compute JSD between the human and the model (CNN or ViT) error vectors.
  • Inter-class JSD: Compare the full $16 \times 16$ confusion matrices, where each element represents the count of times one class is misclassified as another πŸ‘‰ compute JSD between the human and the model (CNN or ViT) confusion matrices to measure how similarly they confuse specific pairs of classes.

Cohen’s $\kappa$ is another measure of whether humans and models tend to make mistakes on the same images and how much this agreement differs from chance. However, Cohen’s $\kappa$ has the limitation of not considering which label is assigned when an error is made. Overall, ViT shows more consistency with human errors, especially after fine-tuning. Interestingly, fine-tuning made CNNs less consistent with human errors.

Error consistency between human vs. models (source: Tuli et al., 2021).

Error consistency between human vs. models (source: Tuli et al., 2021).

Cognitively Inspired AI

Looking back on my dad’s question in Winter 2015, I’d answer differently today.

- “If AI works, does it matter if it works like the mind? Since the mind already works, does it matter if we can reverse-engineer it?" Dad 🧐.

Back then, as a 21-year-old dead set on becoming a cognitive scientist, I wanted to believe that some special algorithms allow humans to learn from small amounts of data in a way that machines cannot. Now, as an engineer, I think that if pretraining on large datasets gets us human-level generalization and performance, perhaps it doesn’t matter if we can create a replica of the human mind — just like if we can build a plane that flies, does it matter if it flies the same way as a bird?

But I still believe in the value of Cognitively Inspired AI, especially in a world seemingly dominated by Transformers. In a recent talk publicizing her startup World Labs, Fei-Fei Li emphasized the importance of spatial intelligence. The 3D world doesn’t come with captions describing what’s happening — it follows the rules of physics. To navigate this world, we may need a new form of representation, one that goes beyond sequences, tokens, and attention. As for what that might be, I look forward to what emerges from Fei-Fei’s new venture. Before then, I have this hunch that the answer lies in some old CogSci literature from decades ago… We shall see.

References

Papers

  1. Are Convolutional Neural Networks or Transformers More Like Human Vision? (2021) by Tuli, Dasgupta, Grant, and Griffiths, CogSci.
  2. Vision Transformers Represent Relations Between Objects (2024) by Lepori et al., arXiv.
  3. ImageNet Classification with Deep Convolutional Neural Networks (2012) by Krizhevsky, Sutskever, and Hinton, NeurIPS.
  4. Deep Learning (2015) by LeCun, Bengio, and Hinton, Nature.
  5. An Introduction to Convolutional Neural Networks (2015) by O’Shea and Nash, arXiv.
  6. Performance-optimized Hierarchical Models Predict Neural Responses in Higher Visual Cortex (2014) by Yamins et al., PNAS.
  7. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2021) by Dosovitskiy et al., ICLR.
  8. Vision (1982) by Marr, MIT Press.
  9. Probabilistic Models of Cognition: Exploring Representations and Inductive Biases (2010) by Griffiths et al., Trends in Cognitive Sciences.
  10. How to Grow a Mind: Statistics, Structure, and Abstraction (2011) by Tenenbaum et al., Science.
  11. Building Machines that Learn and Think Like People (2017) by Lake et al., Behavioral and Brain Sciences.
  12. Levels of Analysis for Machine Learning (2020) by Hamrick, arXiv.
  13. Yuan’s Qualifying Exam Notes (2018), UC Berkeley.

Talks

  1. But What Is A Convolution? by 3Blue1Brown, YouTube.
  2. Geoffrey Hinton and Fei-Fei Li in Conversation, YouTube.
  3. Aerodynamics For Cognition by Griffiths, Edge.
  4. The Future of AI is Here by Fei-Fei Li on her startup World Labs, YouTube.