I am working on a Python project which basically does the following :
- Reads all the job profiles across job boards for a given keyword (Urllib)
- Scrapes the job-description, and stores in a text file (Beautiful Soup)
- Parses the file to find the words and phrases which highest frequencies (NLTK + Pandas)
- Uses the words to create a machine learning model – that matches a given resume, with the suitable keywords in it (Scikit-Learn + Pandas)
This post depicts the third part of the process, where the job descriptions are already stored in a .txt file and I’ll be parsing it to find the most frequent keywords and phrases.
For this post, I have worked on the job profile of ‘Data Analyst’.
I have used NLTK for Natural Language Processing (NLP) and Pandas for analysis, WordCloud and Matplotlib for visualization, Anaconda Jupyter Notebook as an IDE, and other python libraries such os and re.
Reading the Text File
Reading the .txt file is pretty simple. I used the os library to get the file
import os path = "path_of_txt_file" os.chdir(path) with open('data_analyst.txt') as f: data = f.read()
Analyzing the Text
In order to remove redundancies, the first thing I did was to convert whole of the text to lower case.
data = data.lower()
I wanted to find the following from the text:
- Most frequent words
- Most frequent phrases
- Phrases with 2 words (Bigrams)
- Phrases with 3 words (Trigrams)
Data Cleaning
First, I checked the total number of words + numbers + punctuation in my text file. I imported NLTK, and used the word_tokenize function.
import nltk from nltk import word_tokenize tokens = nltk.word_tokenize(data) print (len(tokens)) = 9848
I got a result of 9,848.
Next, I removed all the stop-words in the text. For this, I used NLTK.corpus’ stopwords function.
from nltk.corpus import stopwords stop = set(stopwords.words('english')) token_list1 = [ ] for token in tokens: if token not in stop: token_list1.append(token) print(len(token_list1)) = 7149
After removing stop-words, I was left with 7,149 words, numbers and punctuation
Next, I removed all the numbers and punctuation. For this, I used regular expression.
import re punctuation = re.compile(r'[-.?!,":;()|0-9]') token_list2 = [ ] for token in token_list1: word = punctuation.sub("", token) if len(word)>0: token_list2.append(word) print(len(token_list2)) = 6359
I was now left with 6,359 words
Finding most frequent words
Before finding the most frequent words, I checked the Part-of-speech (POS) tags of the words, to select the types of words, which I want in my final analysis.
I used NLTK’s pos_tag to add the POS tags to all the words, and converted them to a Pandas DataFrame to get the count of all POS tags.
import pandas as pd import numpy as np tokens_pos_tag = nltk.pos_tag(token_list2) pos_df = pd.DataFrame(tokens_pos_tag, columns = ('word','POS')) pos_sum = pos_df.groupby('POS', as_index=False).count() # group by POS tags pos_sum.sort_values(['word'], ascending=[False]) # in descending order of number of words per tag
I got the following DataFrame:
#Index POS word 10 NN 2309 12 NNS 1304 6 JJ 1147 21 VBG 437 23 VBP 402 15 RB 184 22 VBN 131 19 VB 107 24 VBZ 94 20 VBD 93 5 IN 37 2 CC 20 9 MD 19 11 NNP 18 4 FW 13 3 CD 13 8 JJS 7 16 RBR 7 14 PRP 6 7 JJR 4 13 POS 2 0 # 2 1 '' 1 17 RBS 1 18 RP 1
Looking at the POS tags of individual words, I noticed that I need only ‘nouns’ for my analysis.
[('maintain', 'NN'), ('validate', 'NN'), ('equity', 'NN'), ('fixedincome', 'VBP'), ('security', 'NN'), ('data', 'NNS'), ('including', 'VBG'), ('pricing', 'NN'), ('ensure', 'VB'), ('consistency', 'NN'), ('investment', 'NN'), ('data', 'NNS'), ('across', 'IN'), ('multiple', 'JJ'), ('business', 'NN'), ('applications', 'NNS'), ('databases', 'VBZ'), ('work', 'NN'), ('closely', 'RB'), ('business', 'NN'), ('teams', 'NNS'), ('ensure', 'VB'), ('data', 'NNS'), ('integrity', 'NN'), ('define', 'NN'), ('process', 'NN'), ('improvements', 'NNS'), ('respond', 'VB'), ('data', 'NNS'), ('requests', 'NNS'), ('support', 'VBP'), ('analytic', 'JJ'), ('investing', 'VBG'), ('portfolio', 'NN'), ('management', 'NN'), ('functions', 'NNS'), ('analyze', 'VBP'), ('exception', 'NN'), ('reports', 'NNS'), ('followup', 'JJ'), ('ensure', 'VB'), ('timely', 'JJ'), ('resolution', 'NN'), ('analyze', 'IN'), ('user', 'JJ'), ('requests', 'NNS'), ('respond', 'VBP'), ('data', 'NNS'), ('issues', 'NNS'), ('assessment', 'JJ'), ('resolution', 'NN'), ('root', 'NN'), ('cause', 'NN'), ('work', 'NN'), ('business', 'NN'), ('development', 'NN'), ('extensive', 'JJ'), ('list', 'NN'), ('data', 'NNS'), ('requirements', 'NNS'), ('needed', 'VBD'), ('ongoing', 'JJ'), ('basis', 'NN'), ('ensuring', 'VBG'), ('deadlines', 'NNS'), ('met', 'VBD'), ('high', 'JJ'), ('level', 'NN'), ('accuracy', 'NN'), ('assist', 'JJ'), ('projects', 'NNS'), ('specific', 'JJ'), ('data', 'NNS'), ('team', 'NN'), ('well', 'RB'), ('initiatives', 'VBZ'), ('review', 'NN'), ('existing', 'VBG'), ('business', 'NN'), ('processes', 'NNS'), ('identify', 'VBP'), ('improvements', 'NNS'), ('and/or', 'JJ'), ('opportunities', 'NNS'), ('leverage', 'VBP'), ('technology', 'NN'), ('achieve', 'NN'), ('business', 'NN'), ('objectives', 'NNS'), ('document', 'NN'), ('data', 'NNS'), ('integrity', 'NN'), ('processes', 'VBZ'), ('procedures', 'NNS'), ('controls', 'NNS'), ('maintain', 'VBP'), ('data', 'NNS'), ('flow', 'JJ'), .........]
Hence, I filtered the nouns, and got rid of all other POS tags, such as adjectives, verbs, adverbs etc.
filtered_pos = [ ] for one in tokens_pos_tag: if one[1] == 'NN' or one[1] == 'NNS' or one[1] == 'NNP' or one[1] == 'NNPS': filtered_pos.append(one) print (len(filtered_pos)) = 3631
Finally, I was left with 3,631 words
Once I had all the nouns, finding the most frequent words was an easy task. I used NLTK’s FreqDist() to get the frequency distribution of the words, and then selected the top-100 words.
fdist_pos = nltk.FreqDist(filtered_pos) top_100_words = fdist_pos.most_common(100) print(top_100_words) [(('data', 'NNS'), 236), (('experience', 'NN'), 117), (('skills', 'NNS'), 83), (('ability', 'NN'), 83), (('business', 'NN'), 73), (('analysis', 'NN'), 51), (('work', 'NN'), 48), (('management', 'NN'), 39), (('years', 'NNS'), 35), (('knowledge', 'NN'), 31), (('reports', 'NNS'), 31), (('quality', 'NN'), 29), (('support', 'NN'), 25), (('communication', 'NN'), 25), (('requirements', 'NNS'), 24), (('team', 'NN'), 23), (('environment', 'NN'), 23), (('design', 'NN'), 22), (('product', 'NN'), 22), (('tools', 'NNS'), 22), (('projects', 'NNS'), 21), (('systems', 'NNS'), 20), (('analyst', 'NN'), 20), (('project', 'NN'), 18), (('sql', 'NN'), 18), (('process', 'NN'), 17), (('health', 'NN'), 17), (('statistics', 'NNS'), 17), (('dashboards', 'NNS'), 17), (('sources', 'NNS'), 17), (('office', 'NN'), 16), (('asset', 'NN'), 16), (('information', 'NN'), 16), (('software', 'NN'), 15), (('opportunities', 'NNS'), 15), (('computer', 'NN'), 15), (('time', 'NN'), 15), (('analytics', 'NNS'), 15), (('processes', 'NNS'), 14), (('development', 'NN'), 14), (('field', 'NN'), 14), (('issues', 'NNS'), 14), (('detail', 'NN'), 14), (('science', 'NN'), 14), (('results', 'NNS'), 13), (('problems', 'NNS'), 13), (('attention', 'NN'), 13), (('customer', 'NN'), 13), (('performance', 'NN'), 13), (('integrity', 'NN'), 13), (('qualifications', 'NNS'), 13), (('solutions', 'NNS'), 13), (('insights', 'NNS'), 13), (('teams', 'NNS'), 12), (('marketing', 'NN'), 12), (('degree', 'NN'), 12), (('market', 'NN'), 12), (('report', 'NN'), 12), (('problem', 'NN'), 11), (('users', 'NNS'), 11), (('position', 'NN'), 11), (('findings', 'NNS'), 11), (('mathematics', 'NNS'), 11), (('company', 'NN'), 11), (('techniques', 'NNS'), 11), (('client', 'NN'), 11), (('databases', 'NNS'), 10), (('reporting', 'NN'), 10), (('collection', 'NN'), 10), (('database', 'NN'), 10), (('metrics', 'NNS'), 10), (('presentation', 'NN'), 10), (('improvement', 'NN'), 10), (('functions', 'NNS'), 10), (('engineering', 'NN'), 10), (('system', 'NN'), 10), (('excel', 'NN'), 9), (('recommendations', 'NNS'), 9), (('research', 'NN'), 9), (('analyses', 'NNS'), 9), (('job', 'NN'), 9), (('ms', 'NN'), 9), (('trends', 'NNS'), 9), (('sets', 'NNS'), 9), (('plan', 'NN'), 8), (('clients', 'NNS'), 8), (('execution', 'NN'), 8), (('manage', 'NN'), 8), (('stakeholders', 'NNS'), 8), (('needs', 'NNS'), 8), (('level', 'NN'), 8), (('operations', 'NNS'), 8), (('queries', 'NNS'), 8), (('r', 'NN'), 8), (('and/or', 'NN'), 7), (('initiatives', 'NNS'), 7), (('models', 'NNS'), 7), (('technologies', 'NNS'), 7), (('education', 'NN'), 7), (('technology', 'NN'), 7)]
I converted the list into a DataFrame to clean it a bit.
top_words_df = pd.DataFrame(top_100_words, columns = ('pos','count')) top_words_df['Word'] = top_words_df['pos'].apply(lambda x: x[0]) # split the tuple of POS top_words_df = top_words_df.drop('pos', 1) # drop the previous column top_words_df.head() count Word 0 236 data 1 117 experience 2 83 skills 3 83 ability 4 73 business
And finally, used the wordcloud and matplotlib libraries, to present it inform of a word cloud.
from wordcloud import WordCloud import matplotlib.pyplot as plt subset_pos = top_words_df[['Word', 'count']] tuples_pos = [tuple(x) for x in subset_pos.values] wordcloud = WordCloud() wordcloud.generate_from_frequencies(tuples_pos) plt.figure(figsize=(20,15)) plt.imshow(wordcloud, interpolation="bilinear") plt.show()
I got the following image:
Finding the most frequent phrases
Next part of this project was finding the most frequent phrases, which can be used to customize the resumes. For this, filtered out the most frequent bigrams (2 words) and trigrams (3 words).
For filtering the bigrams and trigrams, I had to consider the original text (before cleaning it) as removing stop-words, numbers or punctuation breaks the original sentences.
I started with creating bigrams on the word-tokenizer that I had used on the original text. Then, I selected the top-100 bigrams through frequency distribution, and converted them into a DataFrame.
bgs = nltk.bigrams(tokens) fdist2 = nltk.FreqDist(bgs) # selecting bigrams from tokens bgs_100 = fdist2.most_common(100) # top-100 bigrams bgs_df = pd.DataFrame(bgs_100, columns = ('bigram','count')) bgs_df.head() bigram count 0 (ability, to) 79 1 (,, and) 74 2 (experience, with) 41 3 (in, a) 35 4 (to, work) 28
Then, I converted the tuples into strings, and removed that ones, which has numbers or punctuation in them.
Finally, after removing the excess columns, I was left with the most frequent bigrams.
bgs_df['phrase'] = bgs_df['bigram'].apply(lambda x: x[0]+" "+x[1]) # merging the tuple into a string bgs_df['filter_bgs'] = bgs_df['phrase'].str.contains(punctuation) # finding strings with numbers and punctuation bgs_df = bgs_df[bgs_df.filter_bgs == False] # removing strings with numbers and punctuation bgs_df = bgs_df.drop('bigram', 1) bgs_df = bgs_df.drop('filter_bgs', 1) # removing the excess columns bgs_df.reset_index() bgs_df.head(10) #Final bigrams count phrase 0 79 ability to 2 41 experience with 3 35 in a 4 28 to work 5 26 years of 6 26 knowledge of 7 22 of data 8 22 such as 9 21 and data 10 20 understanding of
However, the bigrams were not of much use, so I repeated the exercise with trigrams. This time, I got some good phrases and keywords:
count phrase 0 14 ability to work 1 13 attention to detail 2 13 the ability to 3 10 years of experience 6 8 to work in 8 7 as well as 10 6 to work independently 11 6 with the ability 12 6 experience as a 13 6 a data analyst 14 6 as a data 15 6 working knowledge of 16 6 demonstrated ability to 17 6 experience with sql 18 6 written communication skills 19 6 able to work 20 5 strong analytical skills 22 5 verbal communication skills 23 5 in computer science 24 5 considered an asset
The trigrams that I got, can be used to improve resumes / CVs for applying as a Data Analyst.