It is well known that neural networks have been extremenly successful at cracking problems previously thought to be intractable. The recently announced google assistant’s ability to book restaurants and hairdresser appointments is just the last on the list of the numerous achievements of neural networks. In this blog post, I want to emphasize the fact that deep learning is not always a necessity and often nearly as precise predictions can be achieved with simpler, more transparent and less computationally expensive algorithms. To this end we will try to crack handwritten digits recognition without neural networks. We will use the MNIST database, a well known and extensively used database of handwritten digits. You can download it here: http://yann.lecun.com/exdb/mnist/. On the same site, the performance of different classifiers is reported, including neural nets. So, clearly what I will illustrate in this blog post is not new (this is usually the case in machine learning), but still worth my time (and I hope yours).

Python, get the data!

The MNIST database is so popular that it comes with the sklearn module of python. To get it, I just need to call the function fetch_mldata:

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home="./")

This downloads the data in the specified local path, as a .mat file, and loads it into the python environment. The images are 70000, with a resolution of 28x28 pixels, ready to go into a machine learning pipeline. Here a few examples:

import numpy as np
import matplotlib.pyplot as plt

plt.figure(figsize=(16,4));

# image 0
plt.subplot(1, 4, 1);
plt.imshow(mnist.data[0].reshape(28,28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));

# image 1
plt.subplot(1, 4, 2);
plt.imshow(mnist.data[10000].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));

# image 2
plt.subplot(1, 4, 3);
plt.imshow(mnist.data[15000].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));

# image 3
plt.subplot(1, 4, 4);
plt.imshow(mnist.data[20000].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));

png

Pre-processing

The database is clean, in the sense that all elements are handwritten digits. So we do not have to worry about data cleaning, often considered to be 80% of the data scientist’s job. Instead, we can directly start pre-processing.

Traininig and testing

The first step is to divide the data into training and test sets. As their names suggest, the former is used to train predictive models, whereas the latter is used to test their performance. Even better practice would be to use k-fold validation: divide the dataset into k equally sized subsets, and train k different models, each tested with a different subset. The final model is the average of those. The clear advantage of this method is that the resulting model generalizes better: its predictions are better with new data. However, it is time consuming and not necessary for this blog’s post purposes.

For us, a good compromise is to keep 60000 images in the training set and 10000 images in the test set. We can do this using the train_test_split function of sklearn:

from sklearn.model_selection import train_test_split
# divide the data into train and test sets, with a proportion of 6:1
train_img, test_img, train_lbl, test_lbl = train_test_split(mnist.data, 
                                                            mnist.target, 
                                                            test_size=1/7.0,
                                                            random_state=0)

Rescaling

Next, we have to rescale the feature vectors. This is a very important step as it can increase dramatically the performance of the classifier. The function StandardScaler of sklearn does so by subtracting from each vector the mean and dividing by the standard deviation of the train dataset:

from sklearn.preprocessing import StandardScaler
import warnings; warnings.simplefilter('ignore')

scaler = StandardScaler()
# Fit on training set only
scaler.fit(train_img)
# Apply transform to both the training set and the test set
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

Dimensionality reduction

Before feeding the data to our favourite machine learning algorithms, there is a last step that can help achieve better performance. The images come as vectors of dimension 784, so already as respectable feature vectors. However, it could be that not all of these dimensions contain relevant information. For instance, the pixels in the corners are likely to always be empty, as the figures above show. To avoid carrying over unnecessary information we can use Principal Component Analysis, a technique well known in statistical analysis that roughly consists in finding a rotation in the feature vectors’ space such that the variance of the data projected into the first principal component is maximal, and decreases monothonically for all other principal components. As you might have guessed, sklearn has a function for that. As an argument to the function, it is possible to state the amount of the variance (i.e. information) that we want to keep in our feature vectors. We will try different options: 95%, 90% and 50% of the variance.

from sklearn.decomposition import PCA
# apply PCA with different variance thresholds
pca95 = PCA(.95)
pca95.fit(train_img)

pca90 = PCA(.90)
pca90.fit(train_img)

pca50 = PCA(.50)
pca50.fit(train_img)

pca95_train_img = pca95.transform(train_img)
pca95_test_img = pca95.transform(test_img)

pca90_train_img = pca90.transform(train_img)
pca90_test_img = pca90.transform(test_img)

pca50_train_img = pca50.transform(train_img)
pca50_test_img = pca50.transform(test_img)

[pca95.n_components_, pca90.n_components_, pca50.n_components_]
[330, 236, 39]

It turns out that 95% of the information is contained in only 330 dimensions, 90% in 236 and 50% in 39! It is important to notice that the results of Principal Component Analysis are not independent of the rescaling: applying PCA before or after rescaling, or rescaling differently, would have given different results. However, it is a great tool for feature preparation, if used responsibly. Now let’s illustrate the effect of PCA by plotting the same image before and after these transformations. To do this, we need to rotate the feature vectors back to their original axes and revert the scaling:

res95_train_img = pca95.inverse_transform(pca95_train_img)
res90_train_img = pca90.inverse_transform(pca90_train_img)
res50_train_img = pca50.inverse_transform(pca50_train_img)

res95_train_img = scaler.inverse_transform(res95_train_img)
res90_train_img = scaler.inverse_transform(res90_train_img)
res50_train_img = scaler.inverse_transform(res50_train_img)

res100_train_img = scaler.inverse_transform(train_img)
import numpy as np
import matplotlib.pyplot as plt

plt.figure(figsize=(16,4));

# Original Image
plt.subplot(1, 4, 1);
plt.imshow(res100_train_img[1].reshape(28,28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.xlabel('784 components', fontsize = 14)
plt.title('Original Image', fontsize = 14);

# 330 principal components
plt.subplot(1, 4, 2);
plt.imshow(res95_train_img[1].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.xlabel('330 components', fontsize = 14)
plt.title('95% of Variance', fontsize = 14);

# 236 principal components
plt.subplot(1, 4, 3);
plt.imshow(res90_train_img[1].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.xlabel('236 components', fontsize = 14)
plt.title('90% of Variance', fontsize = 14);

# 39 principal components
plt.subplot(1, 4, 4);
plt.imshow(res50_train_img[1].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.xlabel('39 components', fontsize = 14)
plt.title('50% of Variance', fontsize = 14);

png

As expected, leaving out information leads to increasingly blurred images.

Another way to illustrate the effect of PCA is to look at the scatter plot of the different digits in feature space, before and after the rotation to the principal compenents. Using the seaborn and pandas modules, we can conveniently gather the data into a single dataframe and plot it as follows:

import seaborn as sns
import pandas as pd

rng = list(range(99))
mrks = ["$0$", "$1$", "$2$", "$3$", "$4$", "$5$", "$6$", "$7$", "$8$", "$9$"]

df1 = pd.DataFrame({"x" : pca95_train_img[rng,0],
                    "y" : pca95_train_img[rng,1],
                    "label" : train_lbl[rng],
                    "PCA" : "yes"})
df2 = pd.DataFrame({"x" : 7.5*train_img[rng,370],
                    "y" : 7.5*train_img[rng,371],
                    "label" : train_lbl[rng],
                    "PCA" : "no"})

df_plot = pd.concat([df1, df2])

plot = sns.lmplot(x="x", 
                  y="y",  
                  data=df_plot, 
                  fit_reg=False, 
                  hue='label', 
                  markers=mrks, 
                  legend = False,
                  col="PCA")

png

The plot on the left shows the first two principal components, which account for nearly 10% of the total variance. Compared to the plot on the right, that shows two central pixels of the original feature vectors (rescaled for illustration purposes), it is striking that after PCA, the feature vectors seem to occupy more uniformly the feature space. Also, already with 10% of the variance, we can distinguish the zeros on the lower right corner, and the sevens in the upper left corner.

Learning to distinguish digits

We are now ready to train some models for handwritten digits recognition. The aim isn’t to come up with the best performance, but rather to show that surprisingly good performance can be attained with very little effort. We will use 4 different models: logistic regression, random forest, support vector machine and gradient boosting machine.

Logistic regression

Simpler things first. Logistic regression is a very simple classification algorithm and probably the first one should try. Let us train classifiers with the previously prepared feature vectors: do we need 100% of the information or do we get better performance leaving some combination of pixels out?

from sklearn.linear_model import LogisticRegression
# all parameters not specified are set to their defaults
# default solver is incredibly slow which is why it was changed to 'lbfgs'

lr1 = LogisticRegression(solver = 'lbfgs')
lr1.fit(train_img, train_lbl)
lr100 = lr1.score(test_img, test_lbl)

lr2 = LogisticRegression(solver = 'lbfgs')
lr2.fit(pca95_train_img, train_lbl)
lr95 = lr2.score(pca95_test_img, test_lbl)

lr3 = LogisticRegression(solver = 'lbfgs')
lr3.fit(pca90_train_img, train_lbl)
lr90 = lr3.score(pca90_test_img, test_lbl)

lr4 = LogisticRegression(solver = 'lbfgs')
lr4.fit(pca50_train_img, train_lbl)
lr50 = lr4.score(pca50_test_img, test_lbl)
df_lr = pd.DataFrame({"% of variance" : [100, 95, 90, 50],
                      "dimension of feature vector" : [784, pca95.n_components_, pca90.n_components_, pca50.n_components_],
                      "performance" : [lr100, lr95, lr90, lr50]})

df_lr
% of variance dimension of feature vector performance
0 100 784 0.9157
1 95 330 0.9201
2 90 236 0.9168
3 50 39 0.8861

The table above summarizes our findings. All models are close to 90% accuracy, a very good result given the amount of effort that we put in. Keeping 95% of the variance gives the best performance, with 92% accuracy. This proves the usefulness of Principal Component analysis: it can help train models that generalise better.

K-nearest neighbours

K-nearest neighbours (knn) is another simple algorithm that deserves a try early in the model building process. It works by assigning a new data point to the class that is most common amongst its k nearest neighbours. In its simplest form, the k neighbours are assigned the same weighting and the distance is computed using the euclidean metric in feature space. But all these are hyperparameters that one could fit to improve the model’s performance. Here, we choose to consider 5 neighbours with equal weights:

from sklearn.neighbors import KNeighborsClassifier

knn1 = KNeighborsClassifier(n_neighbors=5)
knn1.fit(train_img, train_lbl)
knn100 = knn1.score(test_img, test_lbl)

knn2 = KNeighborsClassifier(n_neighbors=5)
knn2.fit(pca95_train_img, train_lbl)
knn95 = knn2.score(pca95_test_img, test_lbl)

knn3 = KNeighborsClassifier(n_neighbors=5)
knn3.fit(pca90_train_img, train_lbl)
knn90 = knn3.score(pca90_test_img, test_lbl)

knn4 = KNeighborsClassifier(n_neighbors=5)
knn4.fit(pca50_train_img, train_lbl)
knn50 = knn4.score(pca50_test_img, test_lbl)
df_knn = pd.DataFrame({"% of variance" : [100, 95, 90, 50],
                       "dimension of feature vector" : [784, pca95.n_components_, pca90.n_components_, pca50.n_components_],
                       "performance" : [knn100, knn95, knn90, knn50]})

df_knn
% of variance dimension of feature vector performance
0 100 784 0.9481
1 95 330 0.9497
2 90 236 0.9535
3 50 39 0.9592

As can be seen in the table above, knn also benefits from feature preparation with PCA. In fact, the best performing model is the one trained with only 50% of the information, and performs quite well, at nearly 96% accuracy!

Random forest

Random forest is a very popular algorithm. Its name derives from the fact that it trains a collection of decision trees, each trained with a different subset of the training set and of the feature vectors. It is a very robust model, that does the feature selection for you. For this reason, PCA is less useful in this case, but let’s see if nevertheless it allows to improve the model’s performance:

from sklearn.ensemble import RandomForestClassifier

rf1 = RandomForestClassifier(max_features=35, max_depth=1000, random_state=0)
rf1.fit(train_img, train_lbl)
rf100 = rf1.score(test_img, test_lbl)

rf2 = RandomForestClassifier(max_features=35, max_depth=1000, random_state=0)
rf2.fit(pca95_train_img, train_lbl)
rf95 = rf2.score(pca95_test_img, test_lbl)

rf3 = RandomForestClassifier(max_features=35, max_depth=1000, random_state=0)
rf3.fit(pca90_train_img, train_lbl)
rf90 = rf3.score(pca90_test_img, test_lbl)

rf4 = RandomForestClassifier(max_features=35, max_depth=1000, random_state=0)
rf4.fit(pca50_train_img, train_lbl)
rf50 = rf4.score(pca50_test_img, test_lbl)
df_rf = pd.DataFrame({"% of variance" : [100, 95, 90, 50],
                          "dimension of feature vector" : [784, pca95.n_components_, pca90.n_components_, pca50.n_components_],
                          "performance" : [rf100, rf95, rf90, rf50]})

df_rf
% of variance dimension of feature vector performance
0 100 784 0.9514
1 95 330 0.9048
2 90 236 0.9129
3 50 39 0.9174

As anticipated, PCA does not help with improving the model here. The good news is that we reach over 95% accuracy without even doing a PCA! That is one of the biggest strengths of random forests: they generalize very well, with little effort.

Support vector machine

Support vector machine (SVM) is based on geometric considerations: it identifies classes regions in feature space based on the notion of maximal distance between classes. Using suitable chosen kernel functions, it is possible to identify non-linearly separated classes and even classes in an infinite space. Let’s see how PCA affects the SVM classifier:

from sklearn import svm

clf1 = svm.SVC()
clf1.fit(train_img, train_lbl)
svm100 = clf1.score(test_img, test_lbl)

clf2 = svm.SVC()
clf2.fit(pca95_train_img, train_lbl)
svm95 = clf2.score(pca95_test_img, test_lbl)

clf3 = svm.SVC()
clf3.fit(pca90_train_img, train_lbl)
svm90 = clf3.score(pca90_test_img, test_lbl)

clf4 = svm.SVC()
clf4.fit(pca50_train_img, train_lbl)
svm50 = clf4.score(pca50_test_img, test_lbl)
df_svm = pd.DataFrame({"% of variance" : [100, 95, 90, 50],
                          "dimension of feature vector" : [784, pca95.n_components_, pca90.n_components_, pca50.n_components_],
                          "performance" : [svm100, svm95, svm90, svm50]})

df_svm
% of variance dimension of feature vector performance
0 100 784 0.9642
1 95 330 0.9619
2 90 236 0.9597
3 50 39 0.8945

The best performing model is the one trained with all the information, with over 96% accuracy. This is a great result, considering that we used the defaults hyperparamenters of SVM. Tuning them, would presumably considerably increase the performance. Furthermore, PCA does not seem to help: the less information we keep, the worst the perfomance, although it does not decrease by much between 100%, 95% and 90%. Such small differences are not conclusive: perhaps with new data, things would change.

Gradient boosting machine

Gradient boosting machine (gbm) is a relatively new and really cool learning algorithm. It’s strategy relies on minizing the prediction errors (residuals) by fitting first a simple model to the data, such as a logistic regression, and then fitting another model to the residuals of the first model, that is the difference between predictions and ground truth. Then, the sum of these two models is taken and the previous step is repeated. This process goes on until the residuals stop improving: no meaningful pattern can be fit to them. Given the complexity of this model, it usually takes a while to train and does not benefit much from PCA. For these reasons I decided to train the model only once, with the full feature vectors. I will use the XGBClassifier from xgboost, a version of gbm optimized for performance. Let’s see:

from xgboost import XGBClassifier

# fit model no training data
xgb = XGBClassifier()
xgb.fit(train_img, train_lbl)
xgb100 = xgb.score(test_img, test_lbl)
xgb100
0.9339

We get nearly 93.4% accuracy, not the best score so far but still pretty good. The XGBClassifier default settings use a decision tree model to fit the data and residuals, with specific choices of many parameters. So, presumably there is still a large margin of improvement!

Putting it all together

Let’s summarize our findings in a single table:

df_sum = pd.DataFrame({"% of variance" : [100, 95, 90, 50],
                       "dimension of feature vector" : [784, pca95.n_components_, pca90.n_components_, pca50.n_components_],
                       "logistic regression" : [lr100, lr95, lr90, lr50],
                       "k-nearest neighbours" : [knn100, knn95, knn90, knn50],
                       "random forest" : [rf100, rf95, rf90, rf50],
                       "support vector machine" : [svm100, svm95, svm90, svm50],
                       "gradient boosting machine" : [xgb100, "--", "--", "--"]})

df_sum
% of variance dimension of feature vector gradient boosting machine k-nearest neighbours logistic regression random forest support vector machine
0 100 784 0.9339 0.9481 0.9157 0.9514 0.9642
1 95 330 -- 0.9497 0.9201 0.9048 0.9619
2 90 236 -- 0.9535 0.9168 0.9129 0.9597
3 50 39 -- 0.9592 0.8861 0.9174 0.8945

The best performing model is SVM with the full feature vectors. But this does not tell us much since we did not really try to optimize all these models. The takehome message is that we can go a long way with little effort.

Let them vote

Perhaps we can get an increase in performance by putting together a few models and making them vote. Let’s select the 5 best models plus the best random forest and the gradient boosting machine. The reason for adding the last two is to use conceptually different models, that are likely to do different type of errors.

from statistics import mode

pred_svm1 = clf1.predict(test_img)
pred_svm2 = clf2.predict(pca95_test_img)
pred_svm3 = clf3.predict(pca90_test_img)
pred_knn4 = knn4.predict(pca50_test_img)
pred_knn3 = knn3.predict(pca90_test_img)
pred_rf = rf1.predict(test_img)
pred_gbm = xgb.predict(test_img)

df_predict = pd.DataFrame({"pred.svm1" : pred_svm1,
                           "pred.svm2" : pred_svm2,
                           "pred.svm3" : pred_svm3,
                           "pred.knn4" : pred_knn4,
                           "pred.knn3" : pred_knn3,
                           "pred.rf" : pred_rf,
                           "pred.gbm" : pred_gbm})

df_predict["predicted"] = df_predict.mode(axis=1)[0]
df_predict["actual"] = test_lbl
           

incorrect = df_predict[df_predict["actual"] != df_predict["predicted"]]

1 - incorrect.shape[0]/df_predict.shape[0]
0.9723

We get to over 97.2%! Not bad, democracy helps.

A benchmark to assess this model is to compare it to a human’s performance. So let’s have a look at some of the errors that our best model did:

import warnings; warnings.simplefilter('ignore')

test_img = scaler.inverse_transform(test_img)
plt.figure(figsize=(16,4));

# image 0
plt.subplot(1, 4, 1);
plt.imshow(test_img[incorrect.index.values[0]].reshape(28,28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.title('classified as ' + str(int(incorrect["predicted"][[incorrect.index.values[0]]])), fontsize = 14);

# image 1
plt.subplot(1, 4, 2);
plt.imshow(test_img[incorrect.index.values[1]].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.title('classified as ' + str(int(incorrect["predicted"][[incorrect.index.values[1]]])), fontsize = 14);

# image 2
plt.subplot(1, 4, 3);
plt.imshow(test_img[incorrect.index.values[2]].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.title('classified as ' + str(int(incorrect["predicted"][[incorrect.index.values[2]]])), fontsize = 14);

# image 3
plt.subplot(1, 4, 4);
plt.imshow(test_img[incorrect.index.values[3]].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.title('classified as ' + str(int(incorrect["predicted"][[incorrect.index.values[3]]])), fontsize = 14);

png

Arguably a human would still do a better job than our best model. But keep in mind that we did not try very hard to optimize the models, so perhaps we could eventually get better than the average human…

Let’s have a look at the most common predicted classes amongst errors:

df_error_shares = pd.DataFrame({"as 0" : round(incorrect[incorrect["predicted"] == 0].shape[0]/incorrect.shape[0],3),
                                "as 1" : round(incorrect[incorrect["predicted"] == 1].shape[0]/incorrect.shape[0],3),
                                "as 2" : round(incorrect[incorrect["predicted"] == 2].shape[0]/incorrect.shape[0],3),
                                "as 3" : round(incorrect[incorrect["predicted"] == 3].shape[0]/incorrect.shape[0],3),
                                "as 4" : round(incorrect[incorrect["predicted"] == 4].shape[0]/incorrect.shape[0],3),
                                "as 5" : round(incorrect[incorrect["predicted"] == 5].shape[0]/incorrect.shape[0],3),
                                "as 6" : round(incorrect[incorrect["predicted"] == 6].shape[0]/incorrect.shape[0],3),
                                "as 7" : round(incorrect[incorrect["predicted"] == 7].shape[0]/incorrect.shape[0],3),
                                "as 8" : round(incorrect[incorrect["predicted"] == 8].shape[0]/incorrect.shape[0],3),
                                "as 9" : round(incorrect[incorrect["predicted"] == 9].shape[0]/incorrect.shape[0],3)},
                              index=[0])

df_error_shares                 
as 0 as 1 as 2 as 3 as 4 as 5 as 6 as 7 as 8 as 9
0 0.051 0.054 0.188 0.126 0.116 0.097 0.083 0.097 0.072 0.116

So 2 is the most common mistake, accounting for nearly 19% of errors. 0, 1 and 8 are the less common.

Finally, which numbers does the model mostly get woring?

df_error_shares = pd.DataFrame({"as 0" : round(incorrect[incorrect["actual"] == 0].shape[0]/incorrect.shape[0],3),
                                "as 1" : round(incorrect[incorrect["actual"] == 1].shape[0]/incorrect.shape[0],3),
                                "as 2" : round(incorrect[incorrect["actual"] == 2].shape[0]/incorrect.shape[0],3),
                                "as 3" : round(incorrect[incorrect["actual"] == 3].shape[0]/incorrect.shape[0],3),
                                "as 4" : round(incorrect[incorrect["actual"] == 4].shape[0]/incorrect.shape[0],3),
                                "as 5" : round(incorrect[incorrect["actual"] == 5].shape[0]/incorrect.shape[0],3),
                                "as 6" : round(incorrect[incorrect["actual"] == 6].shape[0]/incorrect.shape[0],3),
                                "as 7" : round(incorrect[incorrect["actual"] == 7].shape[0]/incorrect.shape[0],3),
                                "as 8" : round(incorrect[incorrect["actual"] == 8].shape[0]/incorrect.shape[0],3),
                                "as 9" : round(incorrect[incorrect["actual"] == 9].shape[0]/incorrect.shape[0],3)},
                              index=[0])

df_error_shares   
as 0 as 1 as 2 as 3 as 4 as 5 as 6 as 7 as 8 as 9
0 0.029 0.061 0.054 0.126 0.101 0.134 0.04 0.108 0.141 0.206

The most difficult digits to classify are 9, 3 and 8. We could continue investigating, but I think we already covered enough for today!

Conclusions

We have seen how five different learning algorithms without nets can do a good job at identifying handwritten digits. Two of them, logistic regression and knn, benefit from a preliminary PCA applied to the feature vectors, whereas the others, random forest, svm and gbm, give the best results on the full feature vectors. Our best model is an average of 7 models, at 97.23% accuracy. SVM can go up to 99.44%, as documented here. With neural networks, the accuracy can be as high as 99.65%, presumably with a considerable effort.

Acknowledgements

The inspiration to write this post came to me while reading Michael Nielsen’s online book on neural networks. He uses the MNIST database to train a neural network, mentioning that good results can be achieved with other learning algorithms also. This triggered my curiosity, so I dug deeper and ended up writing this post. I also borrowed some ideas from here, particularly concerning the visualisation of PCA’s effect on the digits.