Peeking into the lens of a Data Scientist’s data — why they try to search for new jobs.

Jubert Roldan
4 min readJan 6, 2021

Introduction

we have seen the rampant demand for data driven technologies in this era and one of the key major careers that fuels this are the data scientist’s gaining the title ‘sexiest jobs’ out there.

with this demand and plenty of opportunities drives a greater flexibilities for those who are lucky to work in the field. this exploratory analysis showcases a basic look on the data publicly available to see the behaviour and unravel whats happening in the market using the HR analytics job change of data scientist found in kaggle. https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015

There are 3 things that I looked at. (including answers)

  1. Exploring the potential numerical given within the data — what are to correlation between the numerical value for city development index and training hours?
  2. Exploring the categorical features in the data using odds and WoE.

3. Take a shot on building a baseline model that would show basic metric.

Preparing the data

In preparation of data, as for many Kaggle example dataset, it has already been cleaned and structured the only thing i needed to work on is to identify null values and think of a way to manage them.

as this is only an initial baseline model then i opted to simply remove the nulls which will provide decent volume of the imbalanced dataset 80% — not looking, 20% — looking. which to me as a baseline looks alright :)

Analyse the data

The simplest way to analyse the data is to look into the distributions of each feature. with this I have used pandas profiling. I also wanted to see how the categorical features related to the target variable. with this I looked into the Odds and see the Weight of Evidence that the variables will provide.

Odds — shows experience / enrolled in the unversity tends to have higher odds to move

Weight of evidence — shows the same experience and those enrolled in university.;[

Modelling the baseline

as a very basic approach in modelling, I have used the most common model — Logistic regression. though i have also tried Random Forest. for the purposes of exploring, lets just focus on the logistic regression for now.

The model i created shows an AUC (Area under the curve) of 0.75, however what i wanted to see though are the coefficients produced by the model found below:

Coefficients (plotted)

this gives me a sense and intuitively shows that years of experience are one of the indicators to of job movement as a data scientist.

Conclusion/Call to Action:

Question 1. Answer — In relation to the question asked initially, the 2 numerical features are not correlated which would be a good feature to use as a predictor.

Question 2. Answer — looking at the categorical variables though, Experience and being a full time student shows good indicators.

Question 3. Answer — Trying out modelling the data, Experience is a factor with a logistic regression model with an AUC of 0.75

Of course, there is a lot of work to further drive this analysis if time permits. but just to conclude this specific iteration. we have seen that experience would be a driver of job change… maybe expectations are different? maybe job satisfaction? well personally i would agree with it.

Github link — all code found in this link.

Jupyter notebook below:

https://github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb

--

--

Jubert Roldan

Data professional crafting insights and solutions across domains. I am passionate about writing to contribute, express and share my own knowledge