AlexNet and ImageNet: The Birth of Deep Learning

Today’s deep learning revolution traces back to the 30th of September, 2012. On this day, a Convolutional Neural Network (CNN) called AlexNet won the ImageNet 2012 challenge [1]. AlexNet didn’t just win; it dominated.

AlexNet was unlike the other competitors. This new model demonstrated unparalleled performance on the largest image dataset of the time, ImageNet. This event made AlexNet the first widely acknowledged, successful application of deep learning. It caught people’s attention with a 9.8 percentage point advantage over the nearest competitor [2].

The best ImageNet challenge results in 2010 and 2011, compared against all results in 2012, including AlexNet [2].

Until this point, deep learning was a nice idea that most deemed as impractical. AlexNet showed that deep learning was more than a pipedream, and the authors showed the world how to make it practical. Yet, the surge of deep learning that followed was not fueled solely by AlexNet. Indeed, without the huge ImageNet dataset, there would have been no AlexNet.

ml-arxiv-papers Number of “ML” papers in ArXiv per year [3].

The future of AI was to be built on the foundations set by the ImageNet challenge and the novel solutions that enabled the synergy between ImageNet and AlexNet.


Fei-Fei Li, WordNet, and Mechanical Turks

In 2006, the world of computer vision was an underfunded discipline with little attention. Yet, many researchers were focused on building better models. Year after year saw progress, but it was slow.

Fei-Fei Li had just completed her Ph.D. in Computer Vision at Caltech [4] and started as a computer science professor at the University of Illinois Urbana-Champaign. During this time, Li noticed this focus on models and subsequent lack of focus on data.

Li thought that the key to better model performance could be bigger datasets that reflected the diversity of the real world.

During Li’s research into datasets, she learned about professor Christiane Felbaum, a co-developer of a dataset from the 1980s called WordNet. WordNet consisted of many English-language terms organized into an ontological structure [5].

Example of the ontological structure of WordNet [5].

In 2007, Li and Felbaum met. Felbaum discussed her current work on adding a reference image to each word in WordNet. This inspired an idea that would shift the world of computer vision into hyperdrive. Soon after, Li put together a team to build what would become the largest image dataset of its time: ImageNet [6].

The idea behind ImageNet is that a large ontology of images – based on WordNet – could be the key to developing advanced, content-based image retrieval and understanding [7].

Two years later, the first version of ImageNet was released with 12 million images structured and labeled in line with the WordNet ontology. If one person had annotated one image/minute and did nothing else in those two years (including sleeping or eating), it would have taken 22 years and 10 months.

To do this in under two years, Li turned to Amazon Mechanical Turk, a crowdsourcing platform where anyone can hire people from around the globe to perform tasks cost-effectively.

The ImageNet team instructed “Turkers” to decide whether an image represents a given word (from the WordNet ontology) or not. Several measures were implemented to ensure accurate annotation, including having multiple Turker scores for each image-word pair [7].

On its release, ImageNet was the world’s largest labeled dataset of images publically available. Yet, there was very little interest in the dataset. After being presented as a poster at the CVPR conference, they needed to find another way to stir interest.


ImageNet

When the paper detailing ImageNet was released in 2009, the dataset comprised 12 million images across 22,000 categories.

imagenet-paper Example ontologies from WordNet used by ImageNet [7].

As it used WordNet’s ontological structure, these images rolled up into evermore general categories.

At the time, a few other image datasets also used an ontological structure like ImageNet’s. One of the better known of these was the Extra Sensory Perception (ESP) dataset, which used a similar “crowdsourcing” approach but via the “ESP game”. In this game, partners would try to match words to images, creating labels [8].

esp-imagenet The subtree for many terms were much larger and denser for ImageNet than the public subset of ESP [7].

Despite collecting a large amount of data, most of the dataset was not made public [8]. Of the 60K images that were, ImageNet offered much larger and denser coverage [7]. Additionally, ESP was found to be fundamentally flawed [9]. Beyond a relicensed version used for Google Image search [10], it did not impact the field of AI.

There was initially little interest in ImageNet or other similar datasets like ESP. At the time, very few people believed that the performance of models could be improved through more data.

Most researchers dismissed the dataset as being too large and complex. In hindsight, this seems surprising. However, at the time, models struggled on datasets with 12 categories, so ImageNet’s 22,000 categories must have seemed absurd.

ImageNet Challenge

By the following year, the ImageNet team managed to organize the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Competitors had to correctly classify images and detect different objects and scenes across a trimmed list of 1,000 ImageNet categories. Every year, the team that produced the model with the lowest error rate won [2].

The scale of the dataset and competition resulted in ILSVRC becoming the primary benchmark in computer vision. Researchers realized that more data could be a good thing.

2012 was not like the previous years. On September 30, 2012, the latest ILSVRC results were released. One model called AlexNet was clearly distinguished from the others [11].

Results of ILSVRC 2012 [11].

AlexNet was the first model to score a sub-25% error rate. The nearest competitor scored 9.8 percentage points behind [1]. AlexNet dominated the competition, and they did it with a deep-layered Convolutional Neural Network (CNN), an architecture dismissed by most as impractical.

Convolutional Neural Networks

A CNN is a neural network model architecture that uses convolutional layers. These models are known today for their high performance on image data and minimal preprocessing or manual feature extraction requirements.

cnn Typical architecture of a CNN.

CNNs use several convolutional layers stacked on top of one another. The first layers can recognize simple features, like edges, shapes, and textures. As the network gets deeper, it produces more “abstract” representations, eventually identifying concepts from mammals to dogs and even Siberian huskies.


Convolutional Neural Networks will be explained in more detail in the next chapter of Embedding Methods for Image Search.


These networks generally work best with many layers and large amounts of data, so they were overlooked. Shallow implementations lacked benefits over other networks, and deeper implementations were computationally unrealistic; the odds were stacked against these networks.

Despite these potential challenges, the authors of AlexNet won ILSVRC by a 9.8 percentage point margin with one of these models. It turns out they were the right people in the right place at the right time.

Several pieces came together for this to work. ImageNet provided the massive amounts of data required to train a deep CNN. A few years earlier, Nvidia had released CUDA, an API that enabled software access to highly-parallel GPU processing [12][13]. GPU power had reached a point where training AlexNet’s 60 million parameters became practical with the use of multiple GPUs.

AlexNet

AlexNet was by no means small. To make it work, the authors had to solve many problems. The model consisted of eight layers: five convolutional layers followed by three fully-connected linear layers. To produce the 1000-label classification needed for ImageNet, the final layer used a 1000-node softmax, creating a probability distribution over the 1000 classes.

AlexNet architecture Network architecture of AlexNet [1].

A key conclusion from AlexNet was that the depth of the network had been instrumental to its performance. That depth produced a lot of parameters, making training either impractically slow or simply impossible; if training on CPU. By training on GPU, training time could become practical. Still, high-end GPUs of the time were limited to ~3GB of memory, not enough to train AlexNet.

To make this work, AlexNet was distributed across two GPUs. Each GPU handled one-half of AlexNet. The two halves would communicate in specific layers to ensure they were not training two separate models.

ReLU activation function.

Training time was reduced further by swapping the standard sigmoid or tanh activation functions of the time for Rectified Linear Unit (ReLU) activation functions.

Results from a four-layer CNN with ReLU activation functions reached a 25% error rate on the CIFAR-10 dataset six times faster than the equivalent with Tanh activation functions [1].

ReLU is a simpler operation and does not require normalization like other functions to avoid activations congregating towards min/max values (saturation). Nonetheless, another type of normalization called Local Response Normalization (LRN) was included. Adding LRN reduced top-1 and top-5 error rates by 1.4% and 1.2% respectively [1].

Another critical component of AlexNet was the use of overlapping pooling. Pooling was already used by CNNs to summarize a group of activations in one layer to a single activation in the following layer.

overlapping-pooling

Overlapping pooling performs the same operation, but, as the pooling window moves across the preceding layer, it overlaps with the previous window. AlexNet found this to improve top-1 and top-5 error rates by 0.4% and 0.3%, respectively, and reduce overfitting.

AlexNet in Action

While it’s great to talk about all of this, it’s even better to see it implemented in code. You can find the Colab notebook here, TK, if you’d like to follow along.

Data Preprocessing

Let’s start by downloading and preprocessing our dataset. We will use a small sample from ImageNet hosted on HuggingFace.

The Maysee/tiny-imagenet dataset contains 100K and 10K labeled images in the train and validation sets, respectively. All images are stored as Python PIL objects. Preprocessing of these images consists of several steps:

  • Convert all images to RGB format.
  • Resize to fit AlexNet’s expected input dimensions.
  • Convert to tensor format.
  • Normalize values.
  • Stack this set of tensors into a single batch.

We start with RGB; AlexNet assumes all images will have three color channels (Red, Green, and Blue). But many other formats are supported by PIL, such as L (grayscale), RGBA, and CMYK. We must convert any non-RGB PIL objects into RGB format.

AlexNet, and many other pretrained models, expect input images to be tensors of dimensions (3 x H x W), where 3 represents the three color channels. H and W are expected to have a dimensionality of at least 224 [14]. We must resize our images; this is done easily using torchvision.transforms.

Finally, we must normalize the image tensors to a range of [0, 1] using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225] as per the implementation notes on PyTorch docs [14].

We can combine all of this into a batch of 50 images. Rather than running preprocessing on our entire dataset and keeping everything in memory, we must process it in batches. For this example, we will only test the first 50 images as we know these should all be labeled as “goldfish”.

We have preprocessed our first batch and produced tensors containing 50 input tensors of shape (3, 224, 244), ready for inference with AlexNet.

Inference

To perform inference (e.g., make predictions) with AlexNet, we first need to download the model. We will download the pretrained AlexNet hosted by PyTorch.

We can see the network architecture of AlexNet here with five convolutional layers followed by three feed-forward linear layers. This represented a more efficient modification of the original AlexNet and was proposed by Krizhevsky in a later paper [15].

By default, the model is loaded to the CPU. We can run it here, but running on a CUDA-enabled GPU or MPS on Apple Silicon is more efficient. We do this by setting the device like so:

From this, we must always move the input tensors and model to the device before performing inference. Once moved, we run inference with model(inputs).

The model will output a set of logits (output activations) for each possible class. There are 1000 of these for every image we feed into the model. The highest activation represents the class that the model predicts for each image. We convert these logits into class predictions with an argmax function.

Most of the predicted values belong to class 1. That has no meaning for us, so we cross check this with the PyTorch AlexNet classes like so:

Clearly, the AlexNet model is predicting goldfish correctly. We can calculate the accuracy with:

sum(preds == 1) / len(preds)

This returns an accuracy of 72% for the goldfish class. A top-1 error rate of 28% beats the reported average error rate of 37.5% from the original AlexNet paper. However, this is only for a single class, and the model performance varies from class to class.


That’s our overview of one of the most significant events in computer vision and machine learning. The ImageNet Challenge was hosted annually until 2017. By then, 29 of 38 contestants had an error rate of less than 5% [16], demonstrating the massive progress made in computer vision during ImageNet’s active years.

AlexNet was superseded by even more powerful CNNs. Microsoft Research Asia dethroned AlexNet as the winner of ILSVRC in 2015 [17]. Since then, many more CNN architectures have come and gone. Recently, the use of another network architecture known as a transformer has begun to disrupt CNNs domination of computer vision.

The final paragraph of the AlexNet paper proved almost prophetical for the future of AI and computer vision. They noted that they:

"did not use any unsupervised pre-training even though we expect it will help", and "our results have improved as we have made our network larger... we still have many orders of magnitude to go in order to match the infero-temporal pathway of the human visual system"

Unsupervised pre-training and ever larger models would later become the hallmark of ever better models.

Resources

[1] A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks (2012), NeurIPS

[2] ImageNet Large Scale Visual Recognition Challenge (ILSVRC), ImageNet

[3] J. Dean, Machine Learning for Systems and Systems for Machine Learning (2017), NeurIPS 2017

[4] F. Li, Visual Recognition: Computational Models and Human Psychophysics (2005), Caltech

[5] G. Miller, R. Beckwith, C. Felbaum, D. Gross, K. Miller, Introduction to WordNet: An On-line Lexical Database (1993)

[6] D. Gershgorn, The data that transformed AI research — and possibly the world (2017), Quartz

[7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image database (2009), CVPR

[8] L. Ahn, L. Dabbish, Labeling images with a computer game (2004), Proc. SIGCHI

[9] I. Weber, S. Robertson, M. Vojnovic, Rethinking the ESP Game (2009), ACM

[10] A. Saini, Solving the web’s image problem (2008), BBC News

[11] O. Russakovsky, J. Deng, et. al., ImageNet Large Scale Visual Recognition Challenge (2015), IJCV

[12] F. Abi-Chahla, Nvidia’s CUDA: The End of the CPU? (2008), Tom’s Hardware

[13] A. Krizhevsky, cuda-convnet (2011), Google Code Archive

[14] AlexNet Implementation in PyTorch, PyTorch Resources

[15] A. Krizhevsky, One weird trick for parallelizing convolutional neural networks (2014)

[16] ILSVRC2017 Results (2017)

[17] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition (2015)


Next Chapter:

Multi-modal ML with OpenAI's CLIP


Comments

What will you build?

Upgrade your search or recommendation systems with just a few lines of code, or contact us for help.