Looking into the Starbucks dataset — an example Analysis on transactional data from Starbucks mobile app to predict who picks up a discount.

Jubert Roldan
7 min readFeb 28, 2021

--

Introduction

In this analysis we look into how we can build a model to predict whether or not we would get a successful promo.

the dataset used here is a simulated data that mimics customer behaviour on the Starbucks rewards mobile app. the mobile app sends out an offer and/or informational material to its customer such as discounts (%), BOGO — Buy one get one free, and informational material about whats happening in Starbucks.

This analysis, we would focus only on the actionable offers such as BOGO and discounts to look at whats happening and try to predict if the offer would be successful or not. reason being is, we would like to see the complete journey taken by the customer to see the offer effectiveness.

This analysis can be similar in other transactional data/industry, such as banking transaction, mobile transaction,..etc. which can potentially be useful as a transferrable skills which can be used on the data science career journey.

Metrics

The metrics that we will be using is the AUC. it indicates how well the probabilities from the positive class are separated against the negative class.

The aim is to get an AUC will be better than random for the baseline model. from there we try to improve the model.

to ensure the metrics are consistent, I also implemented KFold cross validation. the KFold cross validation algorithm divides the dataset into k non overlapping fold which is then used as a held back test set. the mean of each fold is then calculated showing a better metric for the model.

Preparing the Data

The dataset comprises of 3 main tables which are json files

  • portfolio — details about the offers available
  • profile — customer demographics
  • transcript — transactions/offers the customer got from the mobile app.

in order to prepare the data, we needed to assign data types and understand what is happening to the dataset. looking out for null values. below are the summary on what has been done on each dataset pre-merging them together.

portfolio — we needed to make sense of the channels available. basically this has 4 values and best to make them as dummy variables (1’s and 0’s) for them to be used later as a feature.

profile — the profile data is a bit sparse, however, the data that has the null values are very straightforward to fix. dropping the age = 118 removes the unecessary null values. (will be shown on the visualization part)

transcript — the transactional data is the main dataset containing the events and person. in order to process this we need to take out the key value pair within the dictionaries in the value column. this is also where we would create our target variable.

Merging this together, we get the below dataframe with a target variable “offer_successful” which is defined as a complete journey for a customer who received an offer, viewed the offer and completed the offer. which to me is a legit definition on how an end to end successful could look like.

Steps to define offer successful:

  1. merge the 3 tables together using its specific id’s
  2. get the dummy variables based on the event (in transcript)
  3. group by person and offer_id
  4. identify which customers that have passed across all the journey from offer received, offer viewed and offer completed.
  5. For all the customers who have crossed the full journey, offer received = 1, offer viewed = 1 and offer completed = 1 then assign a number =1 otherwise assign as zero. (this will be the target variable for the model we will build in the next part.

all code can be referenced from the github repo below:

Analyse the data — Data Visualization

just a quick look at the dataset that we have.

histogram — income demographics
gender demographics
age profile (histogram) — removing outlier (age = 118) will show normally distributed data
event — shows multiple transactions compared to offer. mainly because a customer can transact even not receiving nor viewing or even completing an event.

Data Exploration — correlation plots

correlation plot does not show indication of any highly correlated data except for duration and difficulty.

Descriptive Statistics

as part of data cleaning, it is good to highlight that the data isnt always perfect, therefore outliers and null values should be dealth with accordingly.

  1. There are null values on the profile data -

2. Descriptive stats.

portfolio

portfolio descriptive stats

profile — shows around 2175 customers with age =118 which is unrealistic and having null values on gender. removing this will still provide sufficient data to use for prediction.

Transcript — no null values on transcript

Amount customers are spending have a maximum of ~$1k with a mean of $13

Methodology

Build a model that would predict better than random as the baseline model. then optimise using alternative model for comparison

Data Preprocessing — Note that the null values has been dropped to handle the spare dataset. still enough samples has been identified thus it is an acceptable solution.

Model

Logistic Regression — is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). (source wikipedia)

initial logistic regression model provides a 0.75 Area under the curve. this means

logistic regression model — ROC curve
ROC AUC score — .67
showing around .7 f1 score. which is better than random.
KFold validation metric (n_splits =10)
Accuracy: 0.672 (0.007) - baseline

Data Refinement — can we improve it a little bit further?

Random Forest Classifier — is an ensemble of decision trees for classifying. thought its prone to overfitting its one of the more popular machine learning algorithms out there. alongside with other variants of tree based model such as GBM, Extra Trees, XGBoost.

— the random forest classifier has been optimized using GridsearchCV with grid parameters. AUC 0.79 which is a better model compared to the previous.

max_depth: 100

min_samples_leaf: 5

n_estimators: 1000

Random Forest classifier — a better AUC 0.79
ROC AUC Score
KFold validation metric (n_splits =10)
Accuracy: 0.705 (0.010)
*improved!
showing slightly better results compared to the previous model. (Logistic regression

Conclusion/Call to action

In summary, this exercise has analysed, cleaned and modelled the available data to predict members that received an effective offer.

The use of a baseline model can be used to classify such as logistic regression. logistic regression is easy to implement, and very efficient to train. though it provides a decent AUC (Area under the curve) there are models that would be able to improve the results due to the way the algorithm work. a logistic regression constructs on a linear boundary.

to use a Random Forest on the other hand, it does not construct to linear boundaries are based on ensemble of Trees which can go through the data at a specified depth. (can be optimized using GridsearchCV).

An AUC of 0.79 compared to an AUC of 0.67 does not really make a lot of difference in the real world, however its good to note that there can be a better opportunity provided there is time and processing power to perform higher number of grid parameters on the grid search.

The result can be further refined and enriched with data which would be then of value to Starbucks. other models can be tested out to improve model performance.

Reflection

the problem I am trying to solve is to create a model that would be able to predict if an offer would be successful or not. this is specifically defined as a received, viewed and completed. if the customer does not receive nor viewed but completed an offer it will not be valid.

this analysis can be an initial start to provide an initial model to the business, if Starbucks has no model yet and it would be better than random. if a score is provided to the business for a successful offer, it would really help out.

this model can be still further optimised by enriching features however an trying out advanced models such as XGBoost and GBM. this can be exported as a pickle file which can then be used in a production ready environment. however, it is outside of the scope of this exercise.

github repository

https://github.com/jubertroldan/Starbucks

--

--

Jubert Roldan
Jubert Roldan

Written by Jubert Roldan

Data professional crafting insights and solutions across domains. I am passionate about writing to contribute, express and share my own knowledge

No responses yet