For this project, I wanted to find a way to predict the salaries for data science jobs in the Boston area.

I analyzed search results from Indeed.com, an online jobs board, and highlighted key phrases in each job posting. I also determined if each job’s salary is above a threshold value. I used logistic regression to model this data. Machine learning and python experience were the strongest predictors of salary. Machine learning, python and scientist were the strongest predictors of salary. The model had low precision — approximately 63% precision and recall, where random guesses could have achieved 57% — which may have been caused by variability in the source salary data.

Part I outlines the code used. Part II contains the writeup.

Part I: Code

Import libraries

# web scraping
from urllib.request import urlopen
from bs4 import BeautifulSoup

#Data and analysis
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm
from sklearn import preprocessing, linear_model, metrics

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
params = {'legend.fontsize': 'x-large',
          'figure.figsize': (10, 6),
         'axes.labelsize': 'x-large',
         'axes.titlesize': 20,
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
plt.rcParams.update(params)

Define web scraping functions

def count_results(query = None, location = "Boston", binary_level = 1):
    job_ids = pd.DataFrame()

    result_list = []

    # Find the number of results
    URL_for_count = "http://www.indeed.com/jobs?q=data+scientist+{}&jt=fulltime&l={}".format(query, location)
    soup_for_count = BeautifulSoup(urlopen(URL_for_count).read(), 'html.parser')

    results_number = soup_for_count.find("div", attrs = {"id": "searchCount"}).text
    number_of_results = int(results_number.split(sep = ' ')[-1].replace(',', ''))

    # Now loop through the pages. Viewing 100 results at a time means fewer page refreshes, which are a bottleneck.
    i = int(number_of_results/100)
    for page_number in range(i + 1):
        URL_for_results = "http://www.indeed.com/jobs?q=data+scientist+{}&jt=fulltime&l={}&limit=100&start={}".format(query, location, str(100 * page_number))
        soup_for_results = BeautifulSoup(urlopen(URL_for_results).read(), 'html.parser')
        results = soup_for_results.find_all('div', attrs={'data-tn-component': 'organicJob'})

        # Extract the ID for each job listing
        for x in results:
            job_id = x.find('h2', attrs={"class": "jobtitle"})['id']
            job_title = x.find('a', attrs={'data-tn-element': "jobTitle"}).text.strip().capitalize()
            job_link = "https://www.indeed.com" + x.find('h2', attrs={"class": "jobtitle"}).find('a')['href']
            result_list.append([job_id, job_title, job_link, binary_level])

        # Add the job ID numbers
        job_ids = job_ids.append(result_list)

    # Remove re-posted jobs
    job_ids.drop_duplicates(inplace = True)
    return job_ids
# String format: "keyword, keyword combination, keyword, ..."
def count_results_by_keywords(query_string = None):

    # Ends the function if given invalid inputs
    if query_string == None:
        return(print("No keyword entered."))

    # Format the keyword string in to URL query
    query = "%20OR%20".join(query_string.split(", "))
    query = query.replace(' ', '+')

    # Perform the search
    job_ids = count_results("%28{}%29".format(query))

    # Rename job_ids's columns
    job_ids.columns = ['id', 'job title', 'job link', '{}'.format(" OR ".join(query_string.split(", ")))]

    return (job_ids)


def count_results_by_salary(salary_range_divider = None, salary_floor = 20000):

    if salary_range_divider <= salary_floor:
        return(print("Enter a number larger than ${}.".format(salary_floor)))

    job_ids = pd.DataFrame()
    # Set dividing salaries
    divider_strings = ["+${}-{}".format(salary_floor, salary_range_divider), "+${}".format(salary_range_divider)]

    # Perform two searches, starting with the low-salary jobs
    for level, salary_criterion in enumerate(divider_strings):
        job_ids = job_ids.append(count_results(salary_criterion, binary_level=level))

    # Rename job_ids's columns
    job_ids.columns = ['id', 'job title', 'job link', 'salary over {}'.format(salary_range_divider)]

    return(job_ids)


def count_results_by_years_experience(years_experience = None):

    # Ends the function if given invalid inputs
    if years_experience == None or type(years_experience) != int:
        return(print("Enter an integer value."))

    # Format the keyword string in to URL query
    query = "{}+years+or+{}%2B+years".format(str(years_experience), str(years_experience))

    # Perform the search
    job_ids = count_results("%28{}%29".format(query), binary_level = years_experience)

    # Rename job_ids's columns
    job_ids.columns = ['id', 'job title', 'job link', 'years experience']

    return (job_ids)

Collect data

Each dataframe is produced in a separate cell so that keywords can be adjusted or new searches added more quickly.

phd_dataframe = count_results_by_keywords("PhD, ph.d")
bachelors_dataframe = count_results_by_keywords("Bachelors, BS, BA") # Indeed.com's search includes "Bachelor's", "B.S.", etc.
python_dataframe = count_results_by_keywords("Python")
# Use %22 ... %22 to ensure at least one of these keywords are present
management_dataframe = count_results_by_keywords("%22Manager%22, %22director%22")

# Rename the column to a readable format
management_dataframe.columns = ['id', 'job title', 'job link', 'Manager OR director']
startup_dataframe = count_results_by_keywords("Startup, start-up")
scientist_dataframe = count_results_by_keywords("Scientist")
machine_learning_dataframe = count_results_by_keywords("Machine learning")
lab_dataframe = count_results_by_keywords("Lab, laboratory")
# Use %22 ... %22 to search for the exact phrase "Software engineer"
software_engineer_dataframe = count_results_by_keywords("%22Software engineer%22")

# Rename the column to a readable format
software_engineer_dataframe.columns = ['id', 'job title', 'job link', 'Software engineer']
experience_results_dataframe = pd.DataFrame()

# Scan search results from a range of year requirements
for years in range(1+7):
    experience_results_dataframe = experience_results_dataframe.append(count_results_by_years_experience(years))

experience_results_dataframe['years experience'].astype(int).astype('category', ordered = True, copy = False)
years_dummies_dataframe = pd.get_dummies(experience_results_dataframe['years experience'], prefix = 'years experience', prefix_sep = ' ')
experience_dataframe = pd.concat(objs = [experience_results_dataframe.drop('years experience', axis = 1), years_dummies_dataframe], axis = 1)
salary_dataframe = count_results_by_salary(90000)

Combine data

common_columns = ['id', 'job title', 'job link']
master_dataframe = phd_dataframe.merge(bachelors_dataframe, on = common_columns, how = 'outer')\
    .merge(python_dataframe, on = common_columns, how = 'outer')\
    .merge(startup_dataframe, on = common_columns, how = 'outer')\
    .merge(scientist_dataframe, on = common_columns, how = 'outer')\
    .merge(machine_learning_dataframe, on = common_columns, how = 'outer')\
    .merge(lab_dataframe, on = common_columns, how = 'outer')\
    .merge(software_engineer_dataframe, on = common_columns, how = 'outer')\
    .merge(management_dataframe, on = common_columns, how = 'outer')\
    .merge(experience_dataframe, on = common_columns, how = 'outer')\
    .merge(salary_dataframe, on = common_columns, how = 'outer')

# Convert non-id columns to integers and fill NaN values
data_conversion_mask = (master_dataframe.columns != 'id') & (master_dataframe.columns != 'job title') & (master_dataframe.columns != 'job link')
master_dataframe.ix[:, data_conversion_mask] = master_dataframe.ix[:, data_conversion_mask].fillna(value = 0).astype(int)
master_dataframe.head(3)
id job title job link PhD OR ph.d Bachelors OR BS OR BA Python Startup OR start-up Scientist Machine learning Lab OR laboratory ... Manager OR director years experience 0 years experience 1 years experience 2 years experience 3 years experience 4 years experience 5 years experience 6 years experience 7 salary over 90000
0 jl_7896e61d3e45dfb4 Scientist, research and modeling - cyber risk https://www.indeed.com/rc/clk?jk=7896e61d3e45d... 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 jl_a116f7a8d8db626e Research scientist, muscle https://www.indeed.com/rc/clk?jk=a116f7a8d8db6... 1 0 0 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 0
2 jl_89b3b83a5b0fd82c Cellular analytics scientist https://www.indeed.com/rc/clk?jk=89b3b83a5b0fd... 1 1 0 0 1 0 1 ... 0 0 0 1 0 0 0 0 0 1

3 rows × 21 columns

# Create a dataframe that keeps only the data features for analysis
data_dataframe_mask = (master_dataframe['Lab OR laboratory'] == 0) | (master_dataframe['Software engineer'] == 1)
data_dataframe = master_dataframe[data_dataframe_mask].drop(labels = common_columns, axis = 1).drop(labels = ['Lab OR laboratory', 'Software engineer'], axis = 1)
#data_dataframe.loc[:, 'years experience'] = sklearn.preprocessing.minmax_scale(data_dataframe['years experience'].astype(float), copy = False)

Model with statsmodel logistic regression

sm_model_dataframe = data_dataframe.copy()
sm_model_dataframe.columns = sm_model_dataframe.columns.str.replace(' ', '_').str.replace('.', '_').str.replace('-', '_')
sm_model_dataframe.columns
Index(['PhD_OR_ph_d', 'Bachelors_OR_BS_OR_BA', 'Python', 'Startup_OR_start_up',
       'Scientist', 'Machine_learning', 'Manager_OR_director',
       'years_experience_0', 'years_experience_1', 'years_experience_2',
       'years_experience_3', 'years_experience_4', 'years_experience_5',
       'years_experience_6', 'years_experience_7', 'salary_over_90000'],
      dtype='object')
sm_model = sm.logit("salary_over_90000 ~ PhD_OR_ph_d + Bachelors_OR_BS_OR_BA + Python + Startup_OR_start_up + Scientist + Machine_learning + Manager_OR_director + years_experience_0 + years_experience_1 + years_experience_2 + years_experience_3 + years_experience_4 + years_experience_5 + years_experience_6 + years_experience_7", data=sm_model_dataframe).fit()
sm_model.summary()

Optimization terminated successfully.
         Current function value: 0.625174
         Iterations 5
Logit Regression Results
Dep. Variable: salary_over_90000 No. Observations: 1128
Model: Logit Df Residuals: 1112
Method: MLE Df Model: 15
Date: Tue, 03 Jan 2017 Pseudo R-squ.: 0.08790
Time: 10:06:47 Log-Likelihood: -705.20
converged: True LL-Null: -773.16
LLR p-value: 1.460e-21
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -0.6329 0.132 -4.795 0.000 -0.892 -0.374
PhD_OR_ph_d -0.0532 0.148 -0.360 0.718 -0.343 0.236
Bachelors_OR_BS_OR_BA 0.0686 0.173 0.397 0.691 -0.270 0.407
Python 1.1100 0.181 6.117 0.000 0.754 1.466
Startup_OR_start_up 0.2948 0.216 1.366 0.172 -0.128 0.718
Scientist 0.5263 0.168 3.125 0.002 0.196 0.856
Machine_learning 0.4836 0.194 2.489 0.013 0.103 0.864
Manager_OR_director 0.4290 0.169 2.537 0.011 0.098 0.760
years_experience_0 0.7374 1.157 0.637 0.524 -1.531 3.006
years_experience_1 -0.2845 0.254 -1.119 0.263 -0.783 0.214
years_experience_2 -0.1667 0.219 -0.760 0.447 -0.597 0.263
years_experience_3 -0.0004 0.212 -0.002 0.998 -0.417 0.416
years_experience_4 -0.0320 0.304 -0.105 0.916 -0.628 0.564
years_experience_5 0.2982 0.206 1.450 0.147 -0.105 0.701
years_experience_6 0.2757 0.403 0.685 0.493 -0.513 1.065
years_experience_7 0.3683 0.360 1.024 0.306 -0.336 1.073

Model with sklearn’s logistic regression

import sklearn
X = data_dataframe.drop('salary over 90000', axis = 1)
y = data_dataframe['salary over 90000']
skl_logreg = sklearn.linear_model.LogisticRegressionCV(cv = 6)
skl_model = skl_logreg.fit(X, y)
predictions = skl_model.predict(X)

model_confusion_matrix = sklearn.metrics.confusion_matrix(y, predictions)
print("Average cross validation scores: {}".format(sklearn.cross_validation.cross_val_score(skl_logreg, X, y, scoring='roc_auc').mean()))
print("Confusion matrix:\n", model_confusion_matrix)
print("Model performance: \n", sklearn.metrics.classification_report(y, predictions))
print("Area under ROC curve for model: {}".format(sklearn.metrics.roc_auc_score(y, predictions)))
Average cross validation scores: 0.6704394926246255
Confusion matrix:
 [[146 348]
 [ 79 555]]
Model performance:
              precision    recall  f1-score   support

          0       0.65      0.30      0.41       494
          1       0.61      0.88      0.72       634

avg / total       0.63      0.62      0.58      1128

Area under ROC curve for model: 0.5854704402355074
logreg_probabilites_dataframe = pd.DataFrame(skl_logreg.predict_proba(X), columns = ["Probability over threshold", "Probability under threshold"])
model_vs_data_dataframe = pd.concat(objs = [logreg_probabilites_dataframe, master_dataframe[['salary over 90000']]],axis=1)
model_vs_data_dataframe.head()
Probability over threshold Probability under threshold salary over 90000
0 0.499859 0.500141 0
1 0.448430 0.551570 0
2 0.435767 0.564233 1
3 0.451176 0.548824 1
4 0.346364 0.653636 0
skl_logreg_model_coefficients = pd.DataFrame(data = skl_model.coef_, index = ['coefficient'], columns = data_dataframe.drop(labels = 'salary over 90000', axis = 1).columns).T

Part II: Report

Since each company has different data science needs, I expected each job posting’s requirements to vary by a large range. I therefore decided to focus our analysis on a few key pieces of information in each job posting, specifically:

  • If the job required a PhD or a bachelor’s degree
  • If the job asked for experience with Python
  • If the company was likely to be a start-up
  • If the job specifically used “scientist”, as a proxy for expectations of technical knowledge
  • If the job description mentioned machine learning
  • If the job function includes managing or directing responsibilities
  • The mean number of years of experience required

My goal was to predict which jobs will have a salary exceeding $90,000 based on these features. This value was chosen to represent the mean salary of our data, as is shown below. While I’d prefer to predict the salary more exactly, job salaries typically vary according to a number of factors (location, responsibilities, company policies, etc.) which means computing a salary range may be more useful.

print("Ratio of jobs over salary threshold to under salary threshold: {}".format(round(data_dataframe['salary over 90000'].describe()['mean'],3)))
Ratio of jobs over salary threshold to under salary threshold: 0.562
sns.countplot(data = data_dataframe, x = 'salary over 90000')
plt.title("Number of Jobs Under and Over Salary Threshold")
sns.plt.show()

png

Procedure: Scraping Job Listings From Indeed.com

By a rough estimation, only a fraction of jobs posts explicitly included more that one of these features. Rather than follow the hyperlink for each post and scanning each page, which would likely have formatting particular to each company, I reasoned that I could leverage Indeed.com’s search to obtain information we could not readily access.

Each job posting on Indeed.com has a unique ID. (When capturing the data, we noticed approximately 20% had duplicated IDs. I determined by inspection that these were reposted jobs, and therefore reasonably removed these data points.) For this reason I could employ a strategy where I changed the search query and kept track of which jobs were among the results, perform multiple keyword searches, and merge the resulting data on the for each ID. For example, the following are the first few jobs that required a PhD and mentioned 3 years of experience:

master_dataframe[(master_dataframe['PhD OR ph.d'] == 1) & (master_dataframe['years experience 3'] == 1)][['job title', 'PhD OR ph.d', 'years experience 3']].head()
job title PhD OR ph.d years experience 3
1 Research scientist, muscle 1 1
4 Scientist, cancer cell dependencies, oncology ... 1 1
13 Research scientist / 40 / day / bwh - neurology 1 1
17 Phd research scientist- machine learning 1 1
30 Research data scientist 1 1

This method of analysis is much simpler than the most obvious alternative strategy — following each job posting on Indeed.com to its description on an external website and performing a keyword search — but the method relied on Indeed.com’s search algorithm, which introduced complications:

Indeed.com’s support documentation stated that only about 20% of the job postings in a search result include salary information. The search results deliver this information if available; otherwise, the salaries are estimated from past job postings and user-provided data.

The web scraping provided a significant number of results that contained keywords like “data” and “science” but may be unrelated to data science. For example:

master_dataframe[master_dataframe['Lab OR laboratory'] == 1]['job title'].head()
2                         Cellular analytics scientist
3                         Cellular analytics scientist
4    Scientist, cancer cell dependencies, oncology ...
5    Scientist, cancer cell dependencies, oncology ...
6    Scientist, cancer cell dependencies, oncology ...
Name: job title, dtype: object

An inspection of a sampling of the potentially unrelated jobs suggested that job postings that referred to a laboratory were more likely to focus on biology or related sciences. Similarly, job postings with titles that included “software engineer” were less likely to relate to data science. These keywords were used to filter these jobs from the data.

Results and Analysis

The results are not be current since they were drawn from non-static webpages.

I analyzed postings for 1692 jobs. I removed 564 observations from the data, a 33.3% reduction, leaving 1128 observations. “Scientist” was by far the most common keyword, occurring in approximately 75% of observations, followed by PhD, “laboratory”, and bachelor’s degree (36%, 29% and 28%, respectively).

print("Jobs analyzed: {}\nJob postings removed: {} ({}% change)\nRemaining postings: {}".format(master_dataframe.shape[0], master_dataframe.shape[0] - data_dataframe.shape[0], round(100 * (data_dataframe.shape[0] / master_dataframe.shape[0] - 1),1), data_dataframe.shape[0]))
print("\nKeywords\t\tFrequency\n{}".format(master_dataframe.drop('salary over 90000', axis = 1).describe().loc['mean'].sort_values(ascending = False)))
Jobs analyzed: 1692
Job postings removed: 564 (-33.3% change)
Remaining postings: 1128

Keywords		Frequency
Scientist                0.791962
PhD OR ph.d              0.372340
Lab OR laboratory        0.336879
Bachelors OR BS OR BA    0.217494
Python                   0.211584
Manager OR director      0.196809
Machine learning         0.167849
years experience 5       0.132388
years experience 3       0.131797
years experience 2       0.119385
Startup OR start-up      0.091017
years experience 1       0.085106
years experience 4       0.059102
Software engineer        0.048463
years experience 7       0.043735
years experience 6       0.033097
years experience 0       0.007092
Name: mean, dtype: float64

Of the keywords examined, Python, Scientist and Machine Learning were the strongest predictors of a job’s salary.

skl_logreg_model_coefficients.sort_values(by = 'coefficient', ascending = False)
coefficient
Python 0.273112
Scientist 0.195354
Machine learning 0.188584
Startup OR start-up 0.068621
Manager OR director 0.062443
PhD OR ph.d 0.055231
years experience 5 0.053895
years experience 7 0.026132
Bachelors OR BS OR BA 0.018832
years experience 3 0.011098
years experience 6 0.006221
years experience 0 0.006133
years experience 4 0.000689
years experience 2 -0.013593
years experience 1 -0.022556
plt.figure(figsize=(10,6))
sns.barplot(data = skl_logreg_model_coefficients.sort_values(by = 'coefficient', ascending = False).T, orient='h')
plt.title("Influence of Features on Predicting Job Salary")
sns.plt.show()

png

Conclusion

Python, Scientist, and Machine learning were the strongest predictors of salary. However this may be because these keywords are better correlated with data science roles; therefore, these keywords could be predicting a certain class of job from others which tend to have different corresponding salaries. (E.g., a data scientist is typically paid more than a data analyst.)

The model’s precision was likely impinged from the variability in Indeed.com’s salary estimates. If given more time, one could measure this variability by comparing job postings where the salary is explicitly stated to those with estimated salaries.