I am working on a Python project which basically does the following :

  • Reads all the job profiles across job boards for a given keyword (Urllib)
  • Scrapes the job-description, and stores in a text file (Beautiful Soup)
  • Parses the file to find the words and phrases which highest frequencies (NLTK + Pandas)
  • Uses the words to create a machine learning model – that matches a given resume, with the suitable keywords in it (Scikit-Learn + Pandas)

This post depicts the third part of the process, where the job descriptions are already stored in a .txt file and I’ll be parsing it to find the most frequent keywords and phrases.

 

For this post, I have worked on the job profile of ‘Data Analyst’.

 

I have used NLTK for Natural Language Processing (NLP) and Pandas for analysis, WordCloud and Matplotlib for visualization, Anaconda Jupyter Notebook as an IDE, and other python libraries such  os and re.

Reading the Text File

Reading the .txt file is pretty simple. I used the os library to get the file

import os

path =  "path_of_txt_file"
os.chdir(path)

with open('data_analyst.txt') as f:
    data = f.read()

 

Analyzing the Text

In order to remove redundancies, the first thing I did was to convert whole of the text to lower case.

data = data.lower()

 

I wanted to find the following from the text:

  • Most frequent words
  • Most frequent phrases
    • Phrases with 2 words (Bigrams)
    • Phrases with 3 words (Trigrams)

 

Data Cleaning

First, I checked the total number of words + numbers + punctuation in my text file. I imported NLTK, and used the word_tokenize function.

 

import nltk
from nltk import word_tokenize
tokens = nltk.word_tokenize(data)
print (len(tokens))

= 9848

I got a result of 9,848.

 

Next, I removed all the stop-words in the text. For this, I used NLTK.corpus’ stopwords  function.

from nltk.corpus import stopwords

stop = set(stopwords.words('english'))

token_list1 = [ ]
for token in tokens:
    if token not in stop:
        token_list1.append(token)

print(len(token_list1))

= 7149

After removing stop-words, I was left with 7,149 words, numbers and punctuation

 

Next, I removed all the numbers and punctuation. For this, I used regular expression.

import re

punctuation = re.compile(r'[-.?!,":;()|0-9]')

token_list2 = [ ]

for token in token_list1:
    word = punctuation.sub("", token)
    if len(word)>0:
        token_list2.append(word)

print(len(token_list2))

= 6359

 

I was now left with 6,359 words

 

Finding most frequent words

Before finding the most frequent words, I checked the Part-of-speech (POS) tags of the words, to select the types of words, which I want in my final analysis.

 

I used NLTK’s pos_tag to add the POS tags to all the words, and converted them to a Pandas DataFrame to get the count of all POS tags.

import pandas as pd
import numpy as np

tokens_pos_tag = nltk.pos_tag(token_list2)
pos_df = pd.DataFrame(tokens_pos_tag, columns = ('word','POS'))

pos_sum = pos_df.groupby('POS', as_index=False).count() # group by POS tags
pos_sum.sort_values(['word'], ascending=[False]) # in descending order of number of words per tag

 

I got the following DataFrame:

#Index	POS	word
10	NN	2309
12	NNS	1304
6	JJ	1147
21	VBG	437
23	VBP	402
15	RB	184
22	VBN	131
19	VB	107
24	VBZ	94
20	VBD	93
5	IN	37
2	CC	20
9	MD	19
11	NNP	18
4	FW	13
3	CD	13
8	JJS	7
16	RBR	7
14	PRP	6
7	JJR	4
13	POS	2
0	#	2
1	''	1
17	RBS	1
18	RP	1

 

Looking at the POS tags of individual words, I noticed that I need only ‘nouns’ for my analysis.

 

[('maintain', 'NN'), ('validate', 'NN'), ('equity', 'NN'), ('fixedincome', 'VBP'), ('security', 'NN'), ('data', 'NNS'), ('including', 'VBG'), ('pricing', 'NN'), ('ensure', 'VB'), ('consistency', 'NN'), ('investment', 'NN'), ('data', 'NNS'), ('across', 'IN'), ('multiple', 'JJ'), ('business', 'NN'), ('applications', 'NNS'), ('databases', 'VBZ'), ('work', 'NN'), ('closely', 'RB'), ('business', 'NN'), ('teams', 'NNS'), ('ensure', 'VB'), ('data', 'NNS'), ('integrity', 'NN'), ('define', 'NN'), ('process', 'NN'), ('improvements', 'NNS'), ('respond', 'VB'), ('data', 'NNS'), ('requests', 'NNS'), ('support', 'VBP'), ('analytic', 'JJ'), ('investing', 'VBG'), ('portfolio', 'NN'), ('management', 'NN'), ('functions', 'NNS'), ('analyze', 'VBP'), ('exception', 'NN'), ('reports', 'NNS'), ('followup', 'JJ'), ('ensure', 'VB'), ('timely', 'JJ'), ('resolution', 'NN'), ('analyze', 'IN'), ('user', 'JJ'), ('requests', 'NNS'), ('respond', 'VBP'), ('data', 'NNS'), ('issues', 'NNS'), ('assessment', 'JJ'), ('resolution', 'NN'), ('root', 'NN'), ('cause', 'NN'), ('work', 'NN'), ('business', 'NN'), ('development', 'NN'), ('extensive', 'JJ'), ('list', 'NN'), ('data', 'NNS'), ('requirements', 'NNS'), ('needed', 'VBD'), ('ongoing', 'JJ'), ('basis', 'NN'), ('ensuring', 'VBG'), ('deadlines', 'NNS'), ('met', 'VBD'), ('high', 'JJ'), ('level', 'NN'), ('accuracy', 'NN'), ('assist', 'JJ'), ('projects', 'NNS'), ('specific', 'JJ'), ('data', 'NNS'), ('team', 'NN'), ('well', 'RB'), ('initiatives', 'VBZ'), ('review', 'NN'), ('existing', 'VBG'), ('business', 'NN'), ('processes', 'NNS'), ('identify', 'VBP'), ('improvements', 'NNS'), ('and/or', 'JJ'), ('opportunities', 'NNS'), ('leverage', 'VBP'), ('technology', 'NN'), ('achieve', 'NN'), ('business', 'NN'), ('objectives', 'NNS'), ('document', 'NN'), ('data', 'NNS'), ('integrity', 'NN'), ('processes', 'VBZ'), ('procedures', 'NNS'), ('controls', 'NNS'), ('maintain', 'VBP'), ('data', 'NNS'), ('flow', 'JJ'), .........]

 

Hence, I filtered the nouns, and got rid of all other POS tags, such as adjectives, verbs, adverbs etc.

filtered_pos = [ ]

for one in tokens_pos_tag:
    if one[1] == 'NN' or one[1] == 'NNS' or one[1] == 'NNP' or one[1] == 'NNPS':
        filtered_pos.append(one)

print (len(filtered_pos))

= 3631

Finally, I was left with 3,631 words

 

Once I had all the nouns, finding the most frequent words was an easy task. I used NLTK’s FreqDist()  to get the frequency distribution of the words, and then selected the top-100 words.

fdist_pos = nltk.FreqDist(filtered_pos)
top_100_words = fdist_pos.most_common(100)
print(top_100_words)

[(('data', 'NNS'), 236), (('experience', 'NN'), 117), (('skills', 'NNS'), 83), (('ability', 'NN'), 83), (('business', 'NN'), 73), (('analysis', 'NN'), 51), (('work', 'NN'), 48), (('management', 'NN'), 39), (('years', 'NNS'), 35), (('knowledge', 'NN'), 31), (('reports', 'NNS'), 31), (('quality', 'NN'), 29), (('support', 'NN'), 25), (('communication', 'NN'), 25), (('requirements', 'NNS'), 24), (('team', 'NN'), 23), (('environment', 'NN'), 23), (('design', 'NN'), 22), (('product', 'NN'), 22), (('tools', 'NNS'), 22), (('projects', 'NNS'), 21), (('systems', 'NNS'), 20), (('analyst', 'NN'), 20), (('project', 'NN'), 18), (('sql', 'NN'), 18), (('process', 'NN'), 17), (('health', 'NN'), 17), (('statistics', 'NNS'), 17), (('dashboards', 'NNS'), 17), (('sources', 'NNS'), 17), (('office', 'NN'), 16), (('asset', 'NN'), 16), (('information', 'NN'), 16), (('software', 'NN'), 15), (('opportunities', 'NNS'), 15), (('computer', 'NN'), 15), (('time', 'NN'), 15), (('analytics', 'NNS'), 15), (('processes', 'NNS'), 14), (('development', 'NN'), 14), (('field', 'NN'), 14), (('issues', 'NNS'), 14), (('detail', 'NN'), 14), (('science', 'NN'), 14), (('results', 'NNS'), 13), (('problems', 'NNS'), 13), (('attention', 'NN'), 13), (('customer', 'NN'), 13), (('performance', 'NN'), 13), (('integrity', 'NN'), 13), (('qualifications', 'NNS'), 13), (('solutions', 'NNS'), 13), (('insights', 'NNS'), 13), (('teams', 'NNS'), 12), (('marketing', 'NN'), 12), (('degree', 'NN'), 12), (('market', 'NN'), 12), (('report', 'NN'), 12), (('problem', 'NN'), 11), (('users', 'NNS'), 11), (('position', 'NN'), 11), (('findings', 'NNS'), 11), (('mathematics', 'NNS'), 11), (('company', 'NN'), 11), (('techniques', 'NNS'), 11), (('client', 'NN'), 11), (('databases', 'NNS'), 10), (('reporting', 'NN'), 10), (('collection', 'NN'), 10), (('database', 'NN'), 10), (('metrics', 'NNS'), 10), (('presentation', 'NN'), 10), (('improvement', 'NN'), 10), (('functions', 'NNS'), 10), (('engineering', 'NN'), 10), (('system', 'NN'), 10), (('excel', 'NN'), 9), (('recommendations', 'NNS'), 9), (('research', 'NN'), 9), (('analyses', 'NNS'), 9), (('job', 'NN'), 9), (('ms', 'NN'), 9), (('trends', 'NNS'), 9), (('sets', 'NNS'), 9), (('plan', 'NN'), 8), (('clients', 'NNS'), 8), (('execution', 'NN'), 8), (('manage', 'NN'), 8), (('stakeholders', 'NNS'), 8), (('needs', 'NNS'), 8), (('level', 'NN'), 8), (('operations', 'NNS'), 8), (('queries', 'NNS'), 8), (('r', 'NN'), 8), (('and/or', 'NN'), 7), (('initiatives', 'NNS'), 7), (('models', 'NNS'), 7), (('technologies', 'NNS'), 7), (('education', 'NN'), 7), (('technology', 'NN'), 7)]

 

I converted the list into a DataFrame to clean it a bit.

top_words_df = pd.DataFrame(top_100_words, columns = ('pos','count'))
top_words_df['Word'] = top_words_df['pos'].apply(lambda x: x[0]) # split the tuple of POS
top_words_df = top_words_df.drop('pos', 1) # drop the previous column

top_words_df.head()

  count	Word
0	236	data
1	117	experience
2	83	skills
3	83	ability
4	73	business

 

And finally, used the wordcloud and matplotlib libraries, to present it inform of a word cloud.

 

from wordcloud import WordCloud
import matplotlib.pyplot as plt

subset_pos = top_words_df[['Word', 'count']]
tuples_pos = [tuple(x) for x in subset_pos.values]
wordcloud = WordCloud()
wordcloud.generate_from_frequencies(tuples_pos)
plt.figure(figsize=(20,15))
plt.imshow(wordcloud, interpolation="bilinear")

plt.show()

I got the following image:

parsing job profiles through NLP

Finding the most frequent phrases

Next part of this project was finding the most frequent phrases, which can be used to customize the resumes. For this, filtered out the most frequent bigrams (2 words) and trigrams (3 words).

For filtering the bigrams and trigrams, I had to consider the original text (before cleaning it) as removing stop-words, numbers or punctuation breaks the original sentences. 

 

I started with creating bigrams on the word-tokenizer that I had used on the original text. Then, I selected the top-100 bigrams through frequency distribution, and converted them into a DataFrame.

 

bgs = nltk.bigrams(tokens)

fdist2 = nltk.FreqDist(bgs) # selecting bigrams from tokens
bgs_100 = fdist2.most_common(100) # top-100 bigrams
bgs_df = pd.DataFrame(bgs_100, columns = ('bigram','count'))

bgs_df.head()

         bigram	                count
0	(ability, to)	        79
1	(,, and)	        74
2	(experience, with)	41
3	(in, a)	                35
4	(to, work)	        28

 

Then, I converted the tuples into strings, and removed that ones, which has numbers or punctuation in them.

Finally, after removing the excess columns, I was left with the most frequent bigrams.

bgs_df['phrase'] = bgs_df['bigram'].apply(lambda x: x[0]+" "+x[1]) # merging the tuple into a string
bgs_df['filter_bgs'] = bgs_df['phrase'].str.contains(punctuation) # finding strings with numbers and punctuation

bgs_df = bgs_df[bgs_df.filter_bgs == False] # removing strings with numbers and punctuation
bgs_df = bgs_df.drop('bigram', 1)
bgs_df = bgs_df.drop('filter_bgs', 1) # removing the excess columns

bgs_df.reset_index()
bgs_df.head(10) #Final bigrams

  count	phrase
0	79	ability to
2	41	experience with
3	35	in a
4	28	to work
5	26	years of
6	26	knowledge of
7	22	of data
8	22	such as
9	21	and data
10	20	understanding of

 

However, the bigrams were not of much use, so I repeated the exercise with trigrams. This time, I got some good phrases and keywords:

  count	phrase
0	14	ability to work
1	13	attention to detail
2	13	the ability to
3	10	years of experience
6	8	to work in
8	7	as well as
10	6	to work independently
11	6	with the ability
12	6	experience as a
13	6	a data analyst
14	6	as a data
15	6	working knowledge of
16	6	demonstrated ability to
17	6	experience with sql
18	6	written communication skills
19	6	able to work
20	5	strong analytical skills
22	5	verbal communication skills
23	5	in computer science
24	5	considered an asset

The trigrams that I got, can be used to improve resumes / CVs for applying as a Data Analyst.

 

Click here to access the .IPYNB file