For those familiar with convolutional neural networks (if you're not, check out this post), you volition know that, for many architectures, the last set of layers are often of the fully connected variety. This is like bolting a standard neural network classifier onto the terminate of an image processor. The convolutional neural network starts with a series of convolutional (and, potentially, pooling) layers which create feature maps which correspond different components of the input images. The fully connected layers at the end then "interpret" the output of these features maps and make category predictions. However, as with many things in the fast moving globe of deep learning enquiry, this do is starting to fall by the wayside in favor of something called Global Boilerplate Pooling (GAP). In this post, I'll introduce the benefits of Global Boilerplate Pooling and apply it on the Cats vs Dogs paradigm nomenclature task using TensorFlow two. In the procedure, I'll compare its performance to the standard fully connected layer image. The code for this tutorial tin can be found in a Jupyter Notebook on this site's Github repository, set up for use in Google Colaboratory.


Eager to build deep learning systems? Get the volume here


Global Average Pooling

Global Average Pooling is an operation that calculates the average output of each characteristic map in the previous layer. This fairly elementary operation reduces the data significantly and prepares the model for the final nomenclature layer. It also has no trainable parameters – just like Max Pooling (encounter here for more than details). The diagram beneath shows how it is commonly used in a convolutional neural network:

Global Average Pooling in a CNN architecture

Global Average Pooling in a CNN architecture

As tin can be observed, the last layers consist only of a Global Average Pooling layer and a concluding softmax output layer. As tin can be observed, in the architecture higher up, in that location are 64 averaging calculations corresponding to the 64, seven x vii channels at the output of the second convolutional layer. The GAP layer transforms the dimensions from (7, seven, 64) to (1, 1, 64) by performing the averaging across the 7 x seven channel values. Global Average Pooling has the following advantages over the fully connected final layers paradigm:

  • The removal of a large number of trainable parameters from the model. Fully connected or dense layers accept lots of parameters. A 7 ten 7 x 64 CNN output beingness flattened and fed into a 500 node dense layer yields one.56 million weights which need to be trained. Removing these layers speeds upwardly the training of your model.
  • The emptying of all these trainable parameters also reduces the tendency of over-fitting, which needs to exist managed in fully connected layers by the use of dropout.
  • The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more than closely related to the classification categories – then that each feature map becomes a kind of "category confidence map".
  • Finally, the authors too contend that, due to the averaging performance over the feature maps, this makes the model more robust to spatial translations in the information. In other words, as long as the requisite characteristic is included / or activated in the feature map somewhere,it volition still be "picked up" by the averaging operation.

To exam out these ideas in practice, in the next section I'll show you an example comparison the benefits of the Global Average Pooling with the historical image. This example problem volition be the Cats vs Dogs image nomenclature job and I'll be using TensorFlow ii to build the models. At the time of writing, simply TensorFlow ii Alpha is available, and the reader can follow this link to find out how to install it.

Global Average Pooling with TensorFlow 2 and Cats vs Dogs

To download the Cats vs Dogs data for this example, you tin can use the post-obit code:

import tensorflow as tf from tensorflow.keras import layers import tensorflow_datasets as tfds divide = (80, ten, ten) splits = tfds.Carve up.Railroad train.subsplit(weighted=dissever) (cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', carve up=list(splits), with_info=True, as_supervised=True)

The code in a higher place utilizes the TensorFlow Datasets repository which allows you to import common motorcar learning datasets into TF Dataset objects.  For more on using Dataset objects in TensorFlow 2, check out this post. A few things to note. First, the split tuple (80, 10, ten) signifies the (training, validation, examination) split as percentages of the dataset. This is then passed to the tensorflow_datasetsdissever object which tells the dataset loader how to interruption upwards the information. Finally, thetfds.load() function is invoked. The offset statement is a cord specifying the dataset name to load. Following arguments relate to whether a separate should exist used, whether to return an argument with data about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels beingness included. In order to examine the images in the data gear up, the following code can be run:

import matplotlib.pylab as plt for image, label in cat_train.have(two):   plt.figure()   plt.imshow(prototype)

This produces the following images: Equally can be observed, the images are of varying sizes. This volition demand to be rectified and so that the images accept a consistent size to feed into our model. Every bit usual, the paradigm pixel values (which range from 0 to 255) need to be normalized – in this example, to between 0 and ane. The function below performs these tasks:

IMAGE_SIZE = 100 def pre_process_image(image, label):   image = tf.bandage(image, tf.float32)   paradigm = image / 255.0   image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))   return prototype, label

In this example, nosotros'll be resizing the images to 100 x 100 usingtf.image.resize. To get country of the art levels of accuracy, yous would probably want a larger image size, say 200 x 200, only in this case I've chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and too shuffle and batch where advisable:

TRAIN_BATCH_SIZE = 64 cat_train = cat_train.map(pre_process_image).shuffle(1000).echo().batch(TRAIN_BATCH_SIZE) cat_valid = cat_valid.map(pre_process_image).repeat().batch(thousand)

For more on TensorFlow datasets, run across this mail service. Now it is fourth dimension to build the model – in this instance, we'll be using the Keras API in TensorFlow 2. In this example, I'll be using a common "caput" model, which consists of layers of standard convolutional operations – convolution and max pooling, with batch normalization and ReLU activations:

caput = tf.keras.Sequential() head.add together(layers.Conv2D(32, (3, iii), input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3))) head.add(layers.BatchNormalization()) head.add(layers.Activation('relu')) caput.add together(layers.MaxPooling2D(pool_size=(2, 2))) head.add(layers.Conv2D(32, (3, 3))) head.add(layers.BatchNormalization()) caput.add(layers.Activation('relu')) head.add(layers.MaxPooling2D(pool_size=(2, 2))) head.add(layers.Conv2D(64, (iii, three))) head.add together(layers.BatchNormalization()) head.add(layers.Activation('relu')) head.add together(layers.MaxPooling2D(pool_size=(2, 2)))

Next, we need to add the "back-cease" of the network to perform the classification.

Standard fully connected classifier results

In the starting time instance, I'll show the results of a standard fully connected classifier, without dropout. Considering, for this example, there are only two possible classes – "cat" or "dog" – the final output layer is a dense / fully connected layer with a single node and a sigmoid activation.

standard_classifier = tf.keras.Sequential() standard_classifier.add(layers.Flatten()) standard_classifier.add(layers.BatchNormalization()) standard_classifier.add(layers.Dense(100)) standard_classifier.add(layers.Activation('relu')) standard_classifier.add(layers.BatchNormalization()) standard_classifier.add(layers.Dense(100)) standard_classifier.add(layers.Activation('relu')) standard_classifier.add(layers.Dense(i)) standard_classifier.add together(layers.Activation('sigmoid'))

As can be observed, in this instance, the output classification layers includes 2 x 100 node dense layers. To combine the head model and this standard classifier, the following commands tin be run:

standard_model = tf.keras.Sequential([     caput,      standard_classifier ])

Finally, the model is compiled, a TensorBoard callback is created for visualization purposes, and the Keras fit command is executed:

standard_model.compile(optimizer=tf.keras.optimizers.Adam(),               loss='binary_crossentropy',               metrics=['accuracy']) callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%g-%d-%H-%Thou-%S")))] standard_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=x, validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Note that the loss used is binary crossentropy, due to the binary classes for this example. The training progress over vii epochs can be seen in the figure beneath:

Standard classifier without average pooling

Standard classifier accuracy (red – training, blueish – validation)

Standard classifier loss without average pooling

Standard classifier loss (red – training, blueish – validation)

As can be observed, with a standard fully connected classifier back-finish to the model (without dropout), the grooming accuracy reaches high values simply it overfits with respect to the validation dataset. The validation dataset accurateness stagnates around 80% and the loss begins to increment – a sure sign of overfitting.

Global Average Pooling results

The next footstep is to exam the results of the Global Boilerplate Pooling in TensorFlow two. To build the GAP layer and associated model, the following code is added:

average_pool = tf.keras.Sequential() average_pool.add(layers.AveragePooling2D()) average_pool.add(layers.Flatten()) average_pool.add(layers.Dense(1, activation='sigmoid')) pool_model = tf.keras.Sequential([     head,      average_pool ])

The accuracy results for this model, forth with the results of the standard fully connected classifier model, are shown beneath:

Global Average Pooling accuracy

Global boilerplate pooling accuracy vs standard fully connected classifier model (pink – GAP grooming, green – GAP validation, blueish – FC classifier validation)

As tin be observed from the graph above, the Global Average Pooling model has a college validation accuracy past the 7th epoch than the fully connected model. The grooming accurateness is lower than the FC model, but this is clearly due to overfitting existence reduced in the GAP model. A terminal comparison including the instance of the FC model with a dropout layer inserted is shown below:

standard_classifier_with_do = tf.keras.Sequential() standard_classifier_with_do.add(layers.Flatten()) standard_classifier_with_do.add(layers.BatchNormalization()) standard_classifier_with_do.add(layers.Dense(100)) standard_classifier_with_do.add(layers.Activation('relu')) standard_classifier_with_do.add together(layers.Dropout(0.five)) standard_classifier_with_do.add(layers.BatchNormalization()) standard_classifier_with_do.add(layers.Dense(100)) standard_classifier_with_do.add(layers.Activation('relu')) standard_classifier_with_do.add together(layers.Dense(i)) standard_classifier_with_do.add together(layers.Activation('sigmoid'))

Global Average Pooling accuracy vs FC with dropout

Global boilerplate pooling validation accurateness vs FC classifier with and without dropout (light-green – GAP model, bluish – FC model without DO, orangish – FC model with DO)

Every bit tin be seen, of the three model options sharing the same convolutional forepart end, the GAP model has the best validation accuracy after 7 epochs of preparation (x – axis in the graph above is the number of batches). Dropout improves the validation accuracy of the FC model, merely the GAP model is yet narrowly out in front end. Further tuning could exist performed on the fully connected models and results may improve. However, 1 would expect Global Average Pooling to exist at to the lowest degree equivalent to a FC model with dropout – fifty-fifty though it has hundreds of thousands of fewer parameters. I hope this short tutorial gives you lot a good agreement of Global Average Pooling and its benefits. You may want to consider information technology in the architecture of your adjacent image classifier pattern.


Eager to build deep learning systems? Get the book here