Handwritten Digit Recognition through Machine Learning

I did a project to classify handwritten digits using machine learning in Python, almost an year back. I used Python’s Scikit Learn library, along with Pandas, and Numpy on an open data set, to read and classify handwritten digits.

I used K Nearest neighbors and Decision Trees, and GridSearchCV to achieve an accuracy of 92.5%. The process that I followed is mentioned below, along with the .ipynb file with the code.

The logic behind creating a machine learning classifier for any image is very simple.

Scikit-learn only understands numbers and floats, and the number of columns for all the rows in the training and testing data sets should always be equal. Therefore, in order to use different images for training and testing data, the best way is to standardize them, and then covert to numbers. Thankfully, there is an open data set, which has already done this for us.

The Semeion Handwritten Digit Data Set, published at the Machine Learning Repository of the University of California, Irvine –  is ready to use data set. In case you wish to use your data set, you can process your’s using the following approach, before using code on it:

  • All the images of handwritten digits are first converted to monochrome, so that we are left with only two colors – black and white. It simplifies our approach of defining numbers to the colors and gradients. E.g. now that we have only black and white, we can assign 1 to black, and 0 to white.
  • The images are then resized to 16 X 16 pixels. This will help us in getting same number of columns for all the rows, which can then be split to training and testing data sets

The data that this data set provides us, is exactly the same. It has  256 (16 X 16) columns of  floats, containing 0 and 1, which will be used as input variables (X), and 10 columns, from 0 to 9, representing the output variable (y).  E.g. in the series 0 0 0 0 0 0 0 1 0 0, the seventh digit (index starts at 0) or number 7 is the correct output.

Handwritten Digit Recognition through Machine Learning 1 - Abhishek Chhibber

Importing Libraries

We will start by importing the libraries.  I have primarily used Sklearn, Pandas, and Numpy, along with Matplotlib to render charts. In Sklearn, I have used train_test_split to split the data sets; metrics, confusion_matrix, and precision_recall_fscore_support to check the accuracy and other metrics;  KNeighborsClassifier and tree to use the KNN and Decision Tree classifiers respectively; and GridSearchCV to get the best parameters.

Importing and Training Data Set

I imported the data set as a Pandas dataframe. The data set had 1,593 rows for images of different numbers, and 266 columns. As discussed earlier, first 256 columns are the 256 pixels of 16 X 16 image, and the last 10 columns denote the 0 – 9 numbers, with the correct output variable.

In[8]: df.head()
Out[8]:
         0	1	2	3	4	5	6	7	8	9	...	256	257	258	259	260	261	262	263	264	265
0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0	1.0	1.0	...	1	0	0	0	0	0	0	0	0	0
1	0.0	0.0	0.0	0.0	0.0	1.0	1.0	1.0	1.0	1.0	...	1	0	0	0	0	0	0	0	0	0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	1	0	0	0	0	0	0	0	0	0
3	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0	1.0	1.0	...	1	0	0	0	0	0	0	0	0	0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	1	0	0	0	0	0	0	0	0	0
5 rows × 266 columns

I split the first 256 columns to input variables (X) and the remaining 10 columns to output variables (y).

Further, both X and y were split into training and testing data sets in 95% : 5% ratio

In [24]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=5)

In [25]: X_train.shape
Out[25]: (1513, 256)

In [26]: y_train.shape
Out[26]: (1513,)

In [27]: X_test.shape
Out[27]: (80, 256)

In [28]: y_test.shape
Out[28]: (80,)

Using Classifiers

I first used K Nearest Neighbors (KNN) with GridSearchCV, and got the accuracy of 92.27%, along with the following results:

In [39]: knn_grid.best_estimator_

Out[39]: KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='distance')

Then, I used Decision Trees with GridSearchCV, and got the accuracy of 75.89%, along with the following results:

In [49]: dt_grid.best_estimator_
Out[49]: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=10,
            splitter='random')

I finally used the KNN classifier with the following parameters, and got an accuracy of 92.5%:

In [50]: clf_final = KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='distance')
In [51]: clf_final.fit(X_train,y_train)
In [52]: y_pred = clf_final.predict(X_test)
In [53]: print (metrics.accuracy_score(y_test,y_pred))

>>> 0.92

Analyzing the Errors

I got the Precision of 93.4%, Recall of 92.5%, and F-beta score of 92.1%.

In [58]: print ("classifier's precision: "+str(scr_clf_knn[0]) )
         print ("classifier's recall: "+str(scr_clf_knn[1]) )
         print ("classifier's fbeta_score: "+str(scr_clf_knn[2]) )

>>> classifier's precision: 0.934371843434
>>> classifier's recall: 0.925
>>> classifier's fbeta_score: 0.921461011039

I plotted the confusion matrix using Matplotlib, and got the following chart:

Handwritten Digit Recognition through Machine Learning 2 - Abhishek Chhibber

On Y-axis are the actual numbers, and on X-axis are their predicted results, for all the 80 rows in the testing data set. The mid-diagonal line shows the numbers, where the actual digits were predicted correctly, while the remaining columns show the errors.

We had in total 6 errors, where:

  • A 9 was predicted wrongly as a 1 once, and a 5 once
  • An 8 was predicted as 3 twice, and 9 once
  • a 7 was predicted as 1 once

In order to check further, why the classifier wrongly predicted the digits, I printed the digits as images, by converting the floats back to images of 16 X 16 pixels.

If we look at the image below, we would notice that the 6 wrongly predicted digits are hard to read even with eyes:

Handwritten Digit Recognition through Machine Learning 3 - Abhishek Chhibber

On the other hand, if we look at the correctly predicted digits, we notice that they are clearly written

Handwritten Digit Recognition through Machine Learning 4 - Abhishek Chhibber

The Anaconda Notebook (.ipynb) with whole of the code can be accessed here.

Parsing Job Profiles through NLP

I am working on a Python project which basically does the following :

  • Reads all the job profiles across job boards for a given keyword (Urllib)
  • Scrapes the job-description, and stores in a text file (Beautiful Soup)
  • Parses the file to find the words and phrases which highest frequencies (NLTK + Pandas)
  • Uses the words to create a machine learning model – that matches a given resume, with the suitable keywords in it (Scikit-Learn + Pandas)

This post depicts the third part of the process, where the job descriptions are already stored in a .txt file and I’ll be parsing it to find the most frequent keywords and phrases.

 

For this post, I have worked on the job profile of ‘Data Analyst’.

 

I have used NLTK for Natural Language Processing (NLP) and Pandas for analysis, WordCloud and Matplotlib for visualization, Anaconda Jupyter Notebook as an IDE, and other python libraries such  os and re.

Reading the Text File

Reading the .txt file is pretty simple. I used the os library to get the file

import os

path =  "path_of_txt_file"
os.chdir(path)

with open('data_analyst.txt') as f:
    data = f.read()

 

Analyzing the Text

In order to remove redundancies, the first thing I did was to convert whole of the text to lower case.

data = data.lower()

 

I wanted to find the following from the text:

  • Most frequent words
  • Most frequent phrases
    • Phrases with 2 words (Bigrams)
    • Phrases with 3 words (Trigrams)

 

Data Cleaning

First, I checked the total number of words + numbers + punctuation in my text file. I imported NLTK, and used the word_tokenize function.

 

import nltk
from nltk import word_tokenize
tokens = nltk.word_tokenize(data)
print (len(tokens))

= 9848

I got a result of 9,848.

 

Next, I removed all the stop-words in the text. For this, I used NLTK.corpus’ stopwords  function.

from nltk.corpus import stopwords

stop = set(stopwords.words('english'))

token_list1 = [ ]
for token in tokens:
    if token not in stop:
        token_list1.append(token)

print(len(token_list1))

= 7149

After removing stop-words, I was left with 7,149 words, numbers and punctuation

 

Next, I removed all the numbers and punctuation. For this, I used regular expression.

import re

punctuation = re.compile(r'[-.?!,":;()|0-9]')

token_list2 = [ ]

for token in token_list1:
    word = punctuation.sub("", token)
    if len(word)>0:
        token_list2.append(word)

print(len(token_list2))

= 6359

 

I was now left with 6,359 words

 

Finding most frequent words

Before finding the most frequent words, I checked the Part-of-speech (POS) tags of the words, to select the types of words, which I want in my final analysis.

 

I used NLTK’s pos_tag to add the POS tags to all the words, and converted them to a Pandas DataFrame to get the count of all POS tags.

import pandas as pd
import numpy as np

tokens_pos_tag = nltk.pos_tag(token_list2)
pos_df = pd.DataFrame(tokens_pos_tag, columns = ('word','POS'))

pos_sum = pos_df.groupby('POS', as_index=False).count() # group by POS tags
pos_sum.sort_values(['word'], ascending=[False]) # in descending order of number of words per tag

 

I got the following DataFrame:

#Index	POS	word
10	NN	2309
12	NNS	1304
6	JJ	1147
21	VBG	437
23	VBP	402
15	RB	184
22	VBN	131
19	VB	107
24	VBZ	94
20	VBD	93
5	IN	37
2	CC	20
9	MD	19
11	NNP	18
4	FW	13
3	CD	13
8	JJS	7
16	RBR	7
14	PRP	6
7	JJR	4
13	POS	2
0	#	2
1	''	1
17	RBS	1
18	RP	1

 

Looking at the POS tags of individual words, I noticed that I need only ‘nouns’ for my analysis.

 

[('maintain', 'NN'), ('validate', 'NN'), ('equity', 'NN'), ('fixedincome', 'VBP'), ('security', 'NN'), ('data', 'NNS'), ('including', 'VBG'), ('pricing', 'NN'), ('ensure', 'VB'), ('consistency', 'NN'), ('investment', 'NN'), ('data', 'NNS'), ('across', 'IN'), ('multiple', 'JJ'), ('business', 'NN'), ('applications', 'NNS'), ('databases', 'VBZ'), ('work', 'NN'), ('closely', 'RB'), ('business', 'NN'), ('teams', 'NNS'), ('ensure', 'VB'), ('data', 'NNS'), ('integrity', 'NN'), ('define', 'NN'), ('process', 'NN'), ('improvements', 'NNS'), ('respond', 'VB'), ('data', 'NNS'), ('requests', 'NNS'), ('support', 'VBP'), ('analytic', 'JJ'), ('investing', 'VBG'), ('portfolio', 'NN'), ('management', 'NN'), ('functions', 'NNS'), ('analyze', 'VBP'), ('exception', 'NN'), ('reports', 'NNS'), ('followup', 'JJ'), ('ensure', 'VB'), ('timely', 'JJ'), ('resolution', 'NN'), ('analyze', 'IN'), ('user', 'JJ'), ('requests', 'NNS'), ('respond', 'VBP'), ('data', 'NNS'), ('issues', 'NNS'), ('assessment', 'JJ'), ('resolution', 'NN'), ('root', 'NN'), ('cause', 'NN'), ('work', 'NN'), ('business', 'NN'), ('development', 'NN'), ('extensive', 'JJ'), ('list', 'NN'), ('data', 'NNS'), ('requirements', 'NNS'), ('needed', 'VBD'), ('ongoing', 'JJ'), ('basis', 'NN'), ('ensuring', 'VBG'), ('deadlines', 'NNS'), ('met', 'VBD'), ('high', 'JJ'), ('level', 'NN'), ('accuracy', 'NN'), ('assist', 'JJ'), ('projects', 'NNS'), ('specific', 'JJ'), ('data', 'NNS'), ('team', 'NN'), ('well', 'RB'), ('initiatives', 'VBZ'), ('review', 'NN'), ('existing', 'VBG'), ('business', 'NN'), ('processes', 'NNS'), ('identify', 'VBP'), ('improvements', 'NNS'), ('and/or', 'JJ'), ('opportunities', 'NNS'), ('leverage', 'VBP'), ('technology', 'NN'), ('achieve', 'NN'), ('business', 'NN'), ('objectives', 'NNS'), ('document', 'NN'), ('data', 'NNS'), ('integrity', 'NN'), ('processes', 'VBZ'), ('procedures', 'NNS'), ('controls', 'NNS'), ('maintain', 'VBP'), ('data', 'NNS'), ('flow', 'JJ'), .........]

 

Hence, I filtered the nouns, and got rid of all other POS tags, such as adjectives, verbs, adverbs etc.

filtered_pos = [ ]

for one in tokens_pos_tag:
    if one[1] == 'NN' or one[1] == 'NNS' or one[1] == 'NNP' or one[1] == 'NNPS':
        filtered_pos.append(one)

print (len(filtered_pos))

= 3631

Finally, I was left with 3,631 words

 

Once I had all the nouns, finding the most frequent words was an easy task. I used NLTK’s FreqDist()  to get the frequency distribution of the words, and then selected the top-100 words.

fdist_pos = nltk.FreqDist(filtered_pos)
top_100_words = fdist_pos.most_common(100)
print(top_100_words)

[(('data', 'NNS'), 236), (('experience', 'NN'), 117), (('skills', 'NNS'), 83), (('ability', 'NN'), 83), (('business', 'NN'), 73), (('analysis', 'NN'), 51), (('work', 'NN'), 48), (('management', 'NN'), 39), (('years', 'NNS'), 35), (('knowledge', 'NN'), 31), (('reports', 'NNS'), 31), (('quality', 'NN'), 29), (('support', 'NN'), 25), (('communication', 'NN'), 25), (('requirements', 'NNS'), 24), (('team', 'NN'), 23), (('environment', 'NN'), 23), (('design', 'NN'), 22), (('product', 'NN'), 22), (('tools', 'NNS'), 22), (('projects', 'NNS'), 21), (('systems', 'NNS'), 20), (('analyst', 'NN'), 20), (('project', 'NN'), 18), (('sql', 'NN'), 18), (('process', 'NN'), 17), (('health', 'NN'), 17), (('statistics', 'NNS'), 17), (('dashboards', 'NNS'), 17), (('sources', 'NNS'), 17), (('office', 'NN'), 16), (('asset', 'NN'), 16), (('information', 'NN'), 16), (('software', 'NN'), 15), (('opportunities', 'NNS'), 15), (('computer', 'NN'), 15), (('time', 'NN'), 15), (('analytics', 'NNS'), 15), (('processes', 'NNS'), 14), (('development', 'NN'), 14), (('field', 'NN'), 14), (('issues', 'NNS'), 14), (('detail', 'NN'), 14), (('science', 'NN'), 14), (('results', 'NNS'), 13), (('problems', 'NNS'), 13), (('attention', 'NN'), 13), (('customer', 'NN'), 13), (('performance', 'NN'), 13), (('integrity', 'NN'), 13), (('qualifications', 'NNS'), 13), (('solutions', 'NNS'), 13), (('insights', 'NNS'), 13), (('teams', 'NNS'), 12), (('marketing', 'NN'), 12), (('degree', 'NN'), 12), (('market', 'NN'), 12), (('report', 'NN'), 12), (('problem', 'NN'), 11), (('users', 'NNS'), 11), (('position', 'NN'), 11), (('findings', 'NNS'), 11), (('mathematics', 'NNS'), 11), (('company', 'NN'), 11), (('techniques', 'NNS'), 11), (('client', 'NN'), 11), (('databases', 'NNS'), 10), (('reporting', 'NN'), 10), (('collection', 'NN'), 10), (('database', 'NN'), 10), (('metrics', 'NNS'), 10), (('presentation', 'NN'), 10), (('improvement', 'NN'), 10), (('functions', 'NNS'), 10), (('engineering', 'NN'), 10), (('system', 'NN'), 10), (('excel', 'NN'), 9), (('recommendations', 'NNS'), 9), (('research', 'NN'), 9), (('analyses', 'NNS'), 9), (('job', 'NN'), 9), (('ms', 'NN'), 9), (('trends', 'NNS'), 9), (('sets', 'NNS'), 9), (('plan', 'NN'), 8), (('clients', 'NNS'), 8), (('execution', 'NN'), 8), (('manage', 'NN'), 8), (('stakeholders', 'NNS'), 8), (('needs', 'NNS'), 8), (('level', 'NN'), 8), (('operations', 'NNS'), 8), (('queries', 'NNS'), 8), (('r', 'NN'), 8), (('and/or', 'NN'), 7), (('initiatives', 'NNS'), 7), (('models', 'NNS'), 7), (('technologies', 'NNS'), 7), (('education', 'NN'), 7), (('technology', 'NN'), 7)]

 

I converted the list into a DataFrame to clean it a bit.

top_words_df = pd.DataFrame(top_100_words, columns = ('pos','count'))
top_words_df['Word'] = top_words_df['pos'].apply(lambda x: x[0]) # split the tuple of POS
top_words_df = top_words_df.drop('pos', 1) # drop the previous column

top_words_df.head()

  count	Word
0	236	data
1	117	experience
2	83	skills
3	83	ability
4	73	business

 

And finally, used the wordcloud and matplotlib libraries, to present it inform of a word cloud.

 

from wordcloud import WordCloud
import matplotlib.pyplot as plt

subset_pos = top_words_df[['Word', 'count']]
tuples_pos = [tuple(x) for x in subset_pos.values]
wordcloud = WordCloud()
wordcloud.generate_from_frequencies(tuples_pos)
plt.figure(figsize=(20,15))
plt.imshow(wordcloud, interpolation="bilinear")

plt.show()

I got the following image:

parsing job profiles through NLP

Finding the most frequent phrases

Next part of this project was finding the most frequent phrases, which can be used to customize the resumes. For this, filtered out the most frequent bigrams (2 words) and trigrams (3 words).

For filtering the bigrams and trigrams, I had to consider the original text (before cleaning it) as removing stop-words, numbers or punctuation breaks the original sentences. 

 

I started with creating bigrams on the word-tokenizer that I had used on the original text. Then, I selected the top-100 bigrams through frequency distribution, and converted them into a DataFrame.

 

bgs = nltk.bigrams(tokens)

fdist2 = nltk.FreqDist(bgs) # selecting bigrams from tokens
bgs_100 = fdist2.most_common(100) # top-100 bigrams
bgs_df = pd.DataFrame(bgs_100, columns = ('bigram','count'))

bgs_df.head()

         bigram	                count
0	(ability, to)	        79
1	(,, and)	        74
2	(experience, with)	41
3	(in, a)	                35
4	(to, work)	        28

 

Then, I converted the tuples into strings, and removed that ones, which has numbers or punctuation in them.

Finally, after removing the excess columns, I was left with the most frequent bigrams.

bgs_df['phrase'] = bgs_df['bigram'].apply(lambda x: x[0]+" "+x[1]) # merging the tuple into a string
bgs_df['filter_bgs'] = bgs_df['phrase'].str.contains(punctuation) # finding strings with numbers and punctuation

bgs_df = bgs_df[bgs_df.filter_bgs == False] # removing strings with numbers and punctuation
bgs_df = bgs_df.drop('bigram', 1)
bgs_df = bgs_df.drop('filter_bgs', 1) # removing the excess columns

bgs_df.reset_index()
bgs_df.head(10) #Final bigrams

  count	phrase
0	79	ability to
2	41	experience with
3	35	in a
4	28	to work
5	26	years of
6	26	knowledge of
7	22	of data
8	22	such as
9	21	and data
10	20	understanding of

 

However, the bigrams were not of much use, so I repeated the exercise with trigrams. This time, I got some good phrases and keywords:

  count	phrase
0	14	ability to work
1	13	attention to detail
2	13	the ability to
3	10	years of experience
6	8	to work in
8	7	as well as
10	6	to work independently
11	6	with the ability
12	6	experience as a
13	6	a data analyst
14	6	as a data
15	6	working knowledge of
16	6	demonstrated ability to
17	6	experience with sql
18	6	written communication skills
19	6	able to work
20	5	strong analytical skills
22	5	verbal communication skills
23	5	in computer science
24	5	considered an asset

The trigrams that I got, can be used to improve resumes / CVs for applying as a Data Analyst.

 

Click here to access the .IPYNB file

Django Reverse Query through ManyToManyField

This posts explains querying Django models with ManyToManyField relationships. 

let’s say we have the following two models for Country and News:

models.py

class Country(models.Model):
  country = models.CharField(max_length=100)
  continent = models.CharField(max_length=100)
  class Meta:
          ordering = ('country',) # helps in alphabetical listing. Sould be a tuple
  def __str__(self):
    return self.country+" : "+self.continent


class News(models.Model):
    news_title = models.CharField(max_length=200)
    news_link = models.CharField(max_length=100)
    news_date = models.DateField()
    news_country = models.ManyToManyField(Country)  
    def __str__(self):
        return self.news_title

 

A country can have more than one news, and a news can have more than one country. Therefore, in the News class, we have assigned a ManyToMany relationship with the Country class.

 

Now, once some data is populated, we can query the models through following commands.

 

First, we will run activate the shell

(venv) C:\django_venv\djangoproject>python manage.py shell
Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>>

Next we will import the class to be queried, from the models

>>> from app.models import News, Country

Running News.objects.all() will output all the rows under the class News.

 

We will take the first news, and assign it a variable

>>> news1 = News.objects.all()[0]

# Alternatively, we can also get the first news by:

>>> news1 = News.objects.get(id=1)


>>> news1
<News: London fintech startup iwoca just got $20 million from German ...>
>>>

 

We can get different fields of the class by calling them

>>> news1.news_title
'London fintech startup iwoca just got $20 million from German ...'

>>> news1.news_date
datetime.date(2015, 7, 30)

>>>

 

However, when we call a field with ManytoMany relationships, we get the following response:

>>> news1.news_country
<django.db.models.fields.related_descriptors.create_forward_many_to_many_manager.<locals>.ManyRelatedManager object at 0x03E17CD0>
>>>

 

For calling a field with ManyToMany relationship, we will pass the following command, and assign it a variable:

 

>>> country1 = news1.news_country.all()
>>> country1
<QuerySet [<Country: UK : Europe>]>
>>>

 

Since the QuerySet has only one country, we can get the same by querying the index (starts from zero). Else, we would have used a For Loop to get all the results.

>>> country1[0]
<Country: UK : Europe>
>>>
>>> type(country1[0])
<class 'dashboard.models.Country'>
>>>

 

The QuerySet is actually pointing us the the Country class in our models. Calling the country, or continent  field of the call, we can retrieve the the country and continent name, respectively.

>>> country1[0].country
'UK'
>>>
>>> country1[0].continent
'Europe'
>>>

 

 

Django: Importing a CSV File to Database Models

This post covers the simplest way to import a .CSV file into a Django model.

I have created an app in Django – with news, and the companies mentioned in those news. The idea is that when a user adds a news as a new entry, s/he can also mention, the countries present in that news. E.g. If there is a news that ‘US and UK sign a bilateral deal, ‘ then the user can add the countries — US and UK  while adding that news into the model.

The model.py file is straightforward:

class Country(models.Model):
  country = models.CharField(max_length=100)
  continent = models.CharField(max_length=100)

  class Meta:
          ordering = ('country',) # helps in alphabetical listing. Sould be a tuple

  def __str__(self):
    return self.country+" : "+self.continent



class News(models.Model):
    news_title = models.CharField(max_length=200)
    news_link = models.CharField(max_length=100)
    news_date = models.DateField()
    news_country = models.ManyToManyField(Country)  

    def __str__(self):
        return self.news_title

Now, in the Country model, I had to enter all the countries, and their continents. One way is to go to the admin panel, and start adding each country — one at a  time.

A much easier way is to get all the countries into a CSV file, and then import the CSV file at once. I googled and found a site with countries, their capitals, and their continents. I copied the countries and continents as a CSV, by adding the column name as the top row.

Country,Continent
Afghanistan,Asia
Albania,Europe
Algeria,Africa
Andorra,Europe
Angola,Africa
Argentina,South America
Armenia,Asia

Once, I had the CSV ready, went to the Django Shell, using python manage.py shell, and imported the CSV using following code:

 

Python 3.5.1 (v3.5.1:37a07cee5969, Dec  6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import csv
>>> import os
>>> path =  "C:\\...." # Set path of new directory here
>>> os.chdir(path) # changes the directory
>>> from dashboard.models import Country # imports the model
>>> with open('countries_continents.csv') as csvfile:
...     reader = csv.DictReader(csvfile)
...     for row in reader:
...             p = Country(country=row['Country'], continent=row['Continent'])
...             p.save()
...
>>>
>>> exit()

That’s it. The CSV has been imported,

Running python manage.py runserver  and checking the Countries class in the admin module, gave the list of all the countries and their continents.

python_django_csv_to_model

 

Installing Django in Windows through a virtual environment

In this post, I have outline important steps of installing Django framework on Windows 10, using Python 3.5. I have used the command line prompt to create a virtual environment, and installed Django in the same.

 

1) Go to the directory via command prompt

Microsoft Windows [Version 10.0.15063]
(c) 2017 Microsoft Corporation. All rights reserved.

C:\WINDOWS\system32>cd..

C:\Windows>cd..

C:\>E:

 

2) Install virtual environment in the directory

E:\>python -m pip install virtualenv
Collecting virtualenv
 Using cached virtualenv-15.1.0-py2.py3-none-any.whl
Installing collected packages: virtualenv
Successfully installed virtualenv-15.1.0

 

3) Install the virtual environment folder (e.g. venv17) in the virtual environment

E:\>python -m virtualenv venv17
Using base prefix 'C:\\Users\\Abhishek\\AppData\\Local\\Programs\\Python\\Python36-32'
New python executable in E:\venv17\Scripts\python.exe
Installing setuptools, pip, wheel...done.

E:\>

 

4) Go to Scripts folder inside the virtual environment directory (venv17), and activate it

E:\>cd venv17

E:\venv17>cd Scripts

E:\venv17\Scripts>activate

(venv17) E:\venv17\Scripts>

 

(venv17) shows that the virtual environment is now activated

 

5) Come out to venv17 folder. And install Django

(venv17) E:\venv17\Scripts>cd..



(venv17) E:\venv17>pip install django
Collecting django
  Downloading Django-1.11.7-py2.py3-none-any.whl (6.9MB)
    100% |████████████████████████████████| 7.0MB 126kB/s
Collecting pytz (from django)
  Downloading pytz-2017.3-py2.py3-none-any.whl (511kB)
    100% |████████████████████████████████| 512kB 843kB/s
Installing collected packages: pytz, django
Successfully installed django-1.11.7 pytz-2017.3

 

6) Create a new Django Project. This will create a new directory under venv17, named ‘mydjango’

(venv17) E:\venv17>django-admin startproject mydjango

(venv17) E:\venv17>

 

The new directory ‘mydjango’ has the following structure

 

mydjango/
      manage.py
      mydjango/
            __init__.py
            settings.py
            urls.py
            wsgi.py

 

  • manage.py: It is a thin wrapper around the django-admin.py tool, and should not be edited. It is used as a command-line utility to interact with the project.
  • mydjango/: The is the main project directory, and consist of the following files:
  • __init__.py: An empty file that tells Python to treat the mydjango. It should not be edited
  • settings.py: It contains the settings and configurations for the project, including the default setting.
  • urls.py: It contains the URL patterns.
  • wsgi.py: It contains the configuration to run the project, as a WSGI application.

 

Installing Django in windows - Abhishek Chhibber

 

 

Installing Database

7) Creating initial tables in the database

# go to the directory that contains manage.py

 
(venv17) E:\venv17\mydjango>python manage.py migrate
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
  Applying admin.0001_initial... OK
  Applying admin.0002_logentry_remove_auto_add... OK
  Applying contenttypes.0002_remove_content_type_name... OK
  Applying auth.0002_alter_permission_name_max_length... OK
  Applying auth.0003_alter_user_email_max_length... OK
  Applying auth.0004_alter_user_username_opts... OK
  Applying auth.0005_alter_user_last_login_null... OK
  Applying auth.0006_require_contenttypes_0002... OK
  Applying auth.0007_alter_validators_add_error_messages... OK
  Applying auth.0008_alter_user_username_max_length... OK
  Applying sessions.0001_initial... OK

(venv17) E:\venv17\mydjango>

 

The default database is SQLite. In case you want to use MySQL, then go to the setting.py file, and change the database settings, with the following:

# Database
# https://docs.djangoproject.com/en/1.10/ref/settings/#databases


DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': your_DB_name,
        'USER': 'youur_user_name',
        'PASSWORD': 'your_password',
        'HOST': 'your_host',
        'OPTIONS': {'charset': 'utf8mb4'},
        'OPTIONS': {
            'init_command': "SET sql_mode='STRICT_TRANS_TABLES'",
        },
    }
}

 

And then make the migrations

 

(venv17) E:\venv17\mydjango>python manage.py makemigrations

 

8) Run the Development Server

(venv17) E:\venv17\mydjango>python manage.py runserver
Performing system checks...


System check identified no issues (0 silenced).
January 06, 2017 - 11:14:39
Django version 1.11.7, using settings 'mydjango.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CTRL-BREAK.

 

 

The site can be accessed at http://127.0.0.1:8000/

Installing Django in windows 2 - Abhishek Chhibber

This is a Development, and not Deployment Server.

In order to deploy Django in a production environment, it should be run as a Web Server Gateway Interface (WSGI) application using a real web server such as Apache, Gunicorn, or uWSGI.

 

Creating Admin and Super User

9) Create a Super User

(venv17) E:\venv17\mydjango>python manage.py createsuperuser

Username (leave blank to use 'computer_name'): admin
Email address: ******@gmail.com
Password:
Password (again):
Superuser created successfully.


 

10) Run admin panel

First, run the server

(venv17) E:\venv17\mydjango>python manage.py runserver
Performing system checks...

System check identified no issues (0 silenced).
January 06, 2017 - 12:55:33
Django version 1.11.7, using settings 'mydjango.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CTRL-BREAK.

 

Then go to http://127.0.0.1:8000/admin in the browser

Installing Django in windows 3 - Abhishek Chhibber

Installing Django in windows 4 - Abhishek Chhibber

Creating an Application

 

11) Create an application

 

First, come out of the runserver by pressing control+C / control+break

^C
(venv17) E:\venv17\mydjango>

 

 

Then, create a new app, ‘firstapp’

(venv17) E:\venv17\mydjango>python manage.py startapp firstapp
(venv17) E:\venv17\mydjango>

 

The apps (firstapp) directory is created in the same project directory, which contains the database, and manage.py file.

The directory contains the following:

firstapp/
   migrations/
      _init_.py
   _init_.py
   admin.py
   apps.py
   models.py
   tests.py
   views.py

 

 

models.py file in the app, consists of the models — which are used to create the tables of the database

 

from django.db import models
# Create your models here.

class fin_Trend(models.Model):
       trend = models.CharField(max_length=100)
       class Meta:
               ordering = ('trend',)#this will list the trends in order in the admin panel
       def __str__(self):
              return self.trend

class fin_Country(models.Model):
       country = models.CharField(max_length=100)# a charfield of length 100
       continent = models.CharField(max_length=100)
       class Meta:
               ordering = ('country',)
       def __str__(self):
              return self.country+" : "+self.continent #this will display/return Country:Continent in the admin panel

class fin_News(models.Model):
       news_title = models.CharField(max_length=500)
       news_main_text = models.TextField(blank=True)# a textfield
       news_snippet = models.TextField(blank=True)#this field can be blank
       news_trend = models.ManyToManyField(fin_Trend)
       news_date = models.DateField()
       news_country = models.ManyToManyField(fin_Country) )#this will be inherited from the fin_Country class above, and can have many-to-many fields


       class Meta:
               ordering = ('news_date',)


       def __str__(self):
              return self.news_title


 

Check for Many-to-many and Many-to-one (using ForeignKey) relationships

 

12) Add the app to the main framework

 

Go to the main settings.py file in mydjango > mydjango > setting.py (where the database is added), and add the app name in the ‘installed apps’

 

# Application definition

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'firstapp.apps.FirstappConfig', #Don’t forget to put the comma
]

 

 

In the apps.py file of the fistapp directory, we have the ‘FirstappConfig’ class. We will add the same here.

 

This will make our app active, and connected to our framework.

 

13) Create and apply migrations

(venv17) E:\venv17\mydjango>python manage.py makemigrations

Migrations for 'firstapp':
  firstapp\migrations\0001_initial.py
    - Create model fin_Country
    - Create model fin_News
    - Create model fin_Trend
    - Add field news_trend to fin_news

(venv17) E:\venv17\mydjango>python manage.py migrate

Operations to perform:
  Apply all migrations: admin, auth, contenttypes, firstapp, sessions
Running migrations:
  Applying firstapp.0001_initial... OK

(venv17) E:\venv17\mydjango>

 

Adding models to the admin site

14) Update models.py file

Go to the application (firstapp) > admin.py file, and add the models from the models.py file

from django.contrib import admin

# Register your models here.

from .models import fin_Trend, fin_Country, fin_News # adding all the models

class Fin_News_admin(admin.ModelAdmin):
    list_display = ('id', 'news_title', 'news_date') # display the columns in the admin site
    list_filter = ('news_trend',) # for side filter
    search_fields = ('news_title',) # for search box. should be a list or tuple

class Fin_Country_admin(admin.ModelAdmin):
    list_display = ('id', 'country')
    list_filter = ('country',) # make sure the model names are correct
    search_fields = ('country',)

admin.site.register(fin_News, Fin_News_admin) # register the models
admin.site.register(fin_Trend)
admin.site.register(fin_Country,Fin_Country_admin)

 

Check for migrations, if any (though this has nothing to do with migrations)

 

The admin panel now looks like this:

Installing Django in windows 5 - Abhishek Chhibber

 

After adding some dummy data in Fin_countrys, the filter and search panel, as well as the sorting arrow  are also visible

Installing Django in windows 6 - Abhishek Chhibber

 

 

Editing and Querying models

15) Go to Python shell

(venv17) E:\venv17\mydjango>python manage.py shell


Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:14:34) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> 

 

We can exit the shell by:

>>> exit()

 

16) Import the models that have to be edited / queried

>>> from firstapp.models import fin_Country
>>> 

 

17) Get all the entries of that model

>>> fin_Country.objects.all()
<QuerySet [<fin_Country: China : Asia>, <fin_Country: France : Europe>, <fin_Country: UK : Europe>, <fin_Country: US : North America>]>
>>> 

 

18) Add a new field

 

# Option 1

>>> 
>>> cont1 = fin_Country()
>>> cont1.country = "India"
>>> cont1.continent = "Asia"
>>> cont1.save()# save is equivalent to commit in SQL

# Option 2
>>> 
>>> cont2 = fin_Country(country = 'South Africa', continent = 'Africa')
>>> cont2.save()
>>> 
>>> 

>>> fin_Country.objects.all()
<QuerySet [<fin_Country: China : Asia>, <fin_Country: France : Europe>, <fin_Country: India : Asia>, <fin_Country: South Africa : Africa>, <fin_Country: UK : Europe>, <fin_Country: US : North America>]>
>>> 
>>> 
>>> len(fin_Country.objects.all())
6
>>> 

 

19) Query Models

 

# Querying all objects

>>> fin_Country.objects.all()
<QuerySet [<fin_Country: China : Asia>, <fin_Country: France : Europe>, <fin_Country: India : Asia>, <fin_Country: South Africa : Africa>, <fin_Country: UK : Europe>, <fin_Country: US : North America>]>
>>> 


# Slicing the objects

>>> fin_Country.objects.all()[:2]
<QuerySet [<fin_Country: China : Asia>, <fin_Country: France : Europe>]>
>>> 


# Using the get method (will work when there is only one result)

>>> fin_Country.objects.get(continent = 'Africa')
<fin_Country: South Africa : Africa>
>>> 


# Using the filter method

>>> 
>>> fin_Country.objects.filter(continent = 'Asia')
<QuerySet [<fin_Country: China : Asia>, <fin_Country: India : Asia>]>
>>> 
>>> 
>>> fin_Country.objects.filter(continent__contains="sia")
<QuerySet [<fin_Country: China : Asia>, <fin_Country: India : Asia>]>
>>> 
>>> 

 

 

20) Connecting data from one model to another

E.g. We will create a News object, which will have trends from the fin_Trend model, and country from the fin_Country model

 

First, we will create an instance, and will add the data, which is not to be connected to that news (title, main text, snippet, date)

 

>>> from firstapp.models import fin_News, fin_Trend, fin_Country
>>> news1 = fin_News()
>>> news1.news_title = "Sample news title"
>>> news1.news_main_text = "This is some text that has been written in the news"
>>> news1.news_snippet = "The news also has snippet"
>>> news1.news_date = “2017-01-01”
>>> news1.save()
>>> 

 

Then, we will find ID of the trend (e.g. collaboration), related to this news, and will add that to this instance

 

 

>>> trnd1 = fin_Trend.objects.get(trend = "Collaboration").id
>>> trnd1
3
>>> news1.news_trend.add(trnd1)
>>> news1.save()
>>> 

 

Similarly, we will add ID of the country (e.g. US), related to this news, and will add that to this instance

>>> cont1 = fin_Country.objects.get(country = "US").id
>>> cont1
1
>>> 
>>> news1.news_country.add(cont1)
>>> news1.save()
>>> 

 

 

If we run the server and check our admin site, we have 1 entry for news, with the filled-in fields

Installing Django in windows 7 - Abhishek Chhibber

 

 

 

Installing Django in windows 8 - Abhishek Chhibber

 

 

We will add two more dummy news to render a list on our web page

 

Creating Views

 

21) Edit views.py

In app > views.py file, import the class from models.py, and then create a views function

 

from django.shortcuts import render
from .models import fin_News # importing the class that needs to be rendered

# Create your views here.

def news_list(request):
       all_news = fin_News.objects.all()
       context = {'all_news':all_news}
       return render(request, 'firstapp/index.html', context)

 

22) Create urls.py file

In the app folder, create a urls.py file, and add the following code:

from django.conf.urls import url
from . import views

urlpatterns = [
    # anything /app/
    url(r'^$', views.news_list, name='index'), ] # news_list is the function of the views.py file

 

 

23) Create templates directory

In the apps directory, create a directory with the name ‘templates’. Within the ‘templates’ directory, create another directory with the app’s name. Within that directory, create index.html

mydjango/
   firstapp/
      templates/
         firstapp/
            index.html

 

In index.html file, add the following:

<h1>Page Heading</h1>

<ul>
       {% for news in all_news%}
              <li>{{news.news_title}}</li> # {{ }} is for print
       {% endfor %}
</ul>

 

If we go to http://127.0.0.1:8000/firstapp/, we get the web page index.html

 

Installing Django in windows 9 - Abhishek Chhibber

Accessing Google Sheets through Python IDE

Google Sheets is a good alternative for storing data and accessing it online. In fact, if we have relatively small data of ~1,000 rows which we want to analyze, then we can go for Google Sheets, instead of creating a database.

 

For one of my projects, I wanted end-users to fill surveys, which can be aggregated in one place, and analyzed. Rather that creating a contact form in WordPress to populate a DB, and then querying it, I have experimented with Google Docs / Forms. The survey responses are stored in a Google Sheet, which I have to access using Python script.

 

I have used the gspread library, through following steps:

1) Go to Google Developers’ console and create a new project

accessing google sheets through Python_1

 

2) Go to ‘Credentials’ > Select ”Service Account Key”

accessing google sheets through Python_2

3) Select ‘App Engine default service account’ under ‘service account’ and ‘JSON’ under ‘Key type’

accessing google sheets through Python_3

4) Open the .json file using notepad, and copy the ‘client_email’ > Save the .json file in a directory

accessing google sheets through Python_4

5) Open Command Prompt > install oauth2client and PyOpenSSL through PIP

 

6) Open the Google Sheet that has to be accessed > Click on ‘File’ > Click on ‘Share’ > enter the client_email from .json file in step#4 > Click done

 

7) Open a new Python file (.py) in the same directory where .json file is saved in step#4, and enter the following code:

import gspread
from oauth2client.service_account import ServiceAccountCredentials


scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('#ENTER JSON FILE NAME HERE#', scope)

gc = gspread.authorize(credentials)

wks = gc.open_by_key('#ENTER KEY OF GOOGLE SHEET HERE#').sheet1
val = wks.acell('#ENTER A CELL NUMBER HERE E.G. B1#').value
print(val)

 

Build will give value of the cell B1


	

Site appearance for a search query in Google

This post is only for educational and non-commercial purposes.

Many times, we have to check check how our website ranks for a particular query in Google/Yahoo/Bing. Ideally, we should be there on the first page. But if we are competing against established websites, on a general and competitive query, then we might come on the first page as soon as we publish our content, targeting that query.

In this project, my objective is to create a dashboard, which displays:

  1. How many times does my site appears for a particular query on Google/Yahoo/Bing?
  2. On pages does my site appears?
  3. What keywords from my content is Google/Yahoo/Bing matching with the query?
  4. What all sites are coming on the first page? and what keywords are they using?
  5. How are my site’s search results improving?

How many times does my site appears for a particular query on Google/Yahoo/Bing?

I have taken care of this problem (as well as point#2) by writing a simple script that searches a domain name against a query on Google.

For Google, we can also use Google Search API. However I have tried to write a common script for all search engines, and for educating readers about simple Python modules.

E.g. I want to search “Best hotel prices” and I want to see how many times “tripadvisor.com” appears in first 10 pages of Google search.

The result looks like this:

How many times does my site appears for a particular query on Google

And following is the code outline:

1) Import modules: I have used Requests and Beautiful Soup. Though I was able to get the output from these two, I also tried Re (Regular Expression) and Selenium to create a backup plan.

2) I used input to enter the domain and Query, and use Requests to fetch the URL of Google, Yahoo and Bing search results for the query.

3) I have used Beautiful Soup to parse the HTML of first page, followed by a For-loop and simple text search to search my domain in the search results.

4) Then I selected the links of next nine pages from footer (class:f1), and converted them into URL by adding the search engine’s domain (https://www.google.com in case of Google).

5) Finally a For-loop was used to parse each of the nine pages, which in turn contained a nested loop for searching my domain with the search results. Alternately, I could have created a function while searching the first page, and called it here as well.


Next phase of this project is using NLTK to find the keywords that are helping the search results on the first page of search engines.