In the previous post I showed that shallow convolutional neural networks, with linear rectified hidden units and max-pooling, were able to achieve classification accuracies as high as 74%. I then speculated that we could improve the accuracy further by adding a fully-connected layer at the end to allow the network to learn more global features.
I tried that initially, but found that training had slowed down by a magnitude! Training even a single epoch with around 100 hidden units in the top fully-connected layer would take more than an hour, which with my previous observation that we need at least 100 epochs means it would take more than 4 days (100 hours) to train a single model. The reason for this is that the fully-connected layer increases the total number of parameters ten-fold (from a few hundred thousand parameters to millions of parameters). Of course, we could reduce the number units in the fully-connected layer, but then we’d probably also loose all of its benefits.
So I changed directions. Instead I added an additional convolutional layer of linear rectified units. The layer has the same architecture as before, except now it had 64 feature maps with the reasoning that since the top feature map now represents a larger part of the original image it must also be able to represent a larger number of features.
Interestingly, this yielded even better accuracies than before!
|Dataset||Negative Log-likelihood (NLL)||Accuracy|
However, we should also note that the negative log-likelihood has increased for the validation and test sets. This means that we are while we are ovefitting heaily w.r.t. to log-likelihood we are still improving our discrimination performance. The log-likelihood and misclassification error plots below confirm this.
Next step is to try training with dropout to prevent overfitting. This was found to be very efficient for ImageNet, but also required twice as many training epochs.