I did a project to classify handwritten digits using machine learning in Python, almost an year back. I used Python’s Scikit Learn library, along with Pandas, and Numpy on an open data set, to read and classify handwritten digits.
I used K Nearest neighbors and Decision Trees, and GridSearchCV to achieve an accuracy of 92.5%. The process that I followed is mentioned below, along with the .ipynb file with the code.
The logic behind creating a machine learning classifier for any image is very simple.
Scikit-learn only understands numbers and floats, and the number of columns for all the rows in the training and testing data sets should always be equal. Therefore, in order to use different images for training and testing data, the best way is to standardize them, and then covert to numbers. Thankfully, there is an open data set, which has already done this for us.
The Semeion Handwritten Digit Data Set, published at the Machine Learning Repository of the University of California, Irvine – is ready to use data set. In case you wish to use your data set, you can process your’s using the following approach, before using code on it:
- All the images of handwritten digits are first converted to monochrome, so that we are left with only two colors – black and white. It simplifies our approach of defining numbers to the colors and gradients. E.g. now that we have only black and white, we can assign 1 to black, and 0 to white.
- The images are then resized to 16 X 16 pixels. This will help us in getting same number of columns for all the rows, which can then be split to training and testing data sets
The data that this data set provides us, is exactly the same. It has 256 (16 X 16) columns of floats, containing 0 and 1, which will be used as input variables (X), and 10 columns, from 0 to 9, representing the output variable (y). E.g. in the series 0 0 0 0 0 0 0 1 0 0, the seventh digit (index starts at 0) or number 7 is the correct output.
Importing Libraries
We will start by importing the libraries. I have primarily used Sklearn, Pandas, and Numpy, along with Matplotlib to render charts. In Sklearn, I have used train_test_split to split the data sets; metrics, confusion_matrix, and precision_recall_fscore_support to check the accuracy and other metrics; KNeighborsClassifier and tree to use the KNN and Decision Tree classifiers respectively; and GridSearchCV to get the best parameters.
Importing and Training Data Set
I imported the data set as a Pandas dataframe. The data set had 1,593 rows for images of different numbers, and 266 columns. As discussed earlier, first 256 columns are the 256 pixels of 16 X 16 image, and the last 10 columns denote the 0 – 9 numbers, with the correct output variable.
In[8]: df.head() Out[8]: 0 1 2 3 4 5 6 7 8 9 ... 256 257 258 259 260 261 262 263 264 265 0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0 1 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0 2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0 3 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 ... 1 0 0 0 0 0 0 0 0 0 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1 0 0 0 0 0 0 0 0 0 5 rows × 266 columns
I split the first 256 columns to input variables (X) and the remaining 10 columns to output variables (y).
Further, both X and y were split into training and testing data sets in 95% : 5% ratio
In [24]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=5) In [25]: X_train.shape Out[25]: (1513, 256) In [26]: y_train.shape Out[26]: (1513,) In [27]: X_test.shape Out[27]: (80, 256) In [28]: y_test.shape Out[28]: (80,)
Using Classifiers
I first used K Nearest Neighbors (KNN) with GridSearchCV, and got the accuracy of 92.27%, along with the following results:
In [39]: knn_grid.best_estimator_ Out[39]: KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=4, p=2, weights='distance')
Then, I used Decision Trees with GridSearchCV, and got the accuracy of 75.89%, along with the following results:
In [49]: dt_grid.best_estimator_ Out[49]: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=10, splitter='random')
I finally used the KNN classifier with the following parameters, and got an accuracy of 92.5%:
In [50]: clf_final = KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=4, p=2, weights='distance') In [51]: clf_final.fit(X_train,y_train) In [52]: y_pred = clf_final.predict(X_test) In [53]: print (metrics.accuracy_score(y_test,y_pred)) >>> 0.92
Analyzing the Errors
I got the Precision of 93.4%, Recall of 92.5%, and F-beta score of 92.1%.
In [58]: print ("classifier's precision: "+str(scr_clf_knn[0]) ) print ("classifier's recall: "+str(scr_clf_knn[1]) ) print ("classifier's fbeta_score: "+str(scr_clf_knn[2]) ) >>> classifier's precision: 0.934371843434 >>> classifier's recall: 0.925 >>> classifier's fbeta_score: 0.921461011039
I plotted the confusion matrix using Matplotlib, and got the following chart:
On Y-axis are the actual numbers, and on X-axis are their predicted results, for all the 80 rows in the testing data set. The mid-diagonal line shows the numbers, where the actual digits were predicted correctly, while the remaining columns show the errors.
We had in total 6 errors, where:
- A 9 was predicted wrongly as a 1 once, and a 5 once
- An 8 was predicted as 3 twice, and 9 once
- a 7 was predicted as 1 once
In order to check further, why the classifier wrongly predicted the digits, I printed the digits as images, by converting the floats back to images of 16 X 16 pixels.
If we look at the image below, we would notice that the 6 wrongly predicted digits are hard to read even with eyes:
On the other hand, if we look at the correctly predicted digits, we notice that they are clearly written
The Anaconda Notebook (.ipynb) with whole of the code can be accessed here.