Starbucks Capstone Project

Nihal Shah
15 min readApr 30, 2021
Fig 1: Courtesy Google Images

This post is about my Udacity Data Science Nanodegree capstone project. One of the project choices was the Starbucks Offer Analysis project — analyze the offer and transaction data over a period of 1 month and recommend offers based on the customer segment

Project Definition

Project Overview

In this capstone project, I worked with datasets from Starbucks that contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). The task was to combine transaction, demographic and offer data to determine which demographic groups respond best to which type of offer. This project derived insights by merging the following 3 datasets:

  1. Rewards program users (17000 users x 5 fields)
  2. Offers sent during 30-day test period (10 offers x 6 fields)
  3. Event log (306648 events x 4 fields)

Problem Statement

I’ve used the CRISP-DM process to understand the datasets and derive insights to make recommendations to the Starbucks team. For more information about this approach, please refer to this article. Through this capstone project, I wanted to understand the following:

  1. How does offer response depend on gender, age and income?
  2. What type of demographic is least affected by offers?
  3. What are the top 2 offers for each customer segment?
  4. If we’re given demographic features about a customer, how can we determine if they’ll respond to a BOGO or a discount offer?

For the first 3 questions, I did Exploratory Data Analysis (EDA) to identify patterns in the data. For question 4, I developed a machine learning model which takes the demographic features and offer type as inputs and determines if a customer will respond to it.

Metrics

For the machine learning model to determine what type of offer a customer will respond to, I used the classification report from the sklearn library to determine model performance. It consists of the following metrics:

  • Precision: is the ability not to label an instance positive that is actually negative (True Positive / (True Positive + False Positive)
  • Recall: is the ability to find all positive instances (True Positive / (True Positive + False Negative)
  • F-1 score: is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0

For a holistic assessment of the model, I decided to use the classification report over just a single metric like f-score or accuracy_score. These metrics cover all possibilities of a model outcome — True and False Positives, True and False Negatives. I could possibly add accuracy and prevalence as 2 additional metrics but for the purpose of this project I decided to go with the 3 mentioned above

Methodology

Data Understanding

The data is contained in three files:

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed

Data Preprocessing

Even though this analysis doesn’t include a machine learning model, I still had to clean and merge the datasets to get the desired dataframes. Data Pre-processing involved 2 main steps:

  1. Data Cleaning: This was the most important part of the project and probably took 70% of the effort of the capstone project. Let’s take a top-down approach. To answer my questions, I needed 2 main dataframes:
  • Transaction data — combining the 3 datasets (This section)
  • Customer data — Reprocessing the transaction dataframe to create a consolidated customer dataframe (Refer to customer segmentation analysis section below)
  • Model data — Reprocessing the transaction dataframe to determine if a customer is affected by a BOGO or a discount offer (Refer to Implementation section below)

Cleaning the 3 datasets was the first step in preparing the data for the analysis. I created 3 functions to clean the 3 datasets to ensure I could reuse them in the future for further analysis.

Please find a snippet below which gives an overview of the steps I took in cleaning the transcript file:

Fig 2: Cleaning of the Transcript Data Set

Here are the steps I took to clean it and prepare it for analysis:

  1. Renamed the person column to customer_id column. I did the same thing in the other datasets so I can merge the datasets later on the customer_id column
  2. I created dummies for the event column and filled them with 0s and 1s depending upon the event
  3. I split the value column to split the data into offer_id and txn_amount columns. This would later help me group customers by offer_ids and sum the transaction amount for multiple transactions by a user
  4. In the end, I deleted the event and value columns since I no longer needed them

Below are the before and after pics of the transcript dataset

Fig 3: Transcript Dataset before cleaning
Fig 4: Transcript Dataset after cleaning

For the detailed cleaning of the other datasets, please refer to the Jupyter notebook in the Github repository.

2. Data Merging: This was one of the most satisfying moments after spending hours on cleaning. I cleaned the datasets in a way that I could merge them later on the customer_id column.

  • I first merged the transcript and the profile dataset to match the transaction data against their customer profiles
  • Second merge was a combination of above and the portfolio dataset so we could match the transaction data against their offer_ids
  • I also simplified the offer_ids and gave them simpler names so it’s easier for us to relate to them.
  • Lastly, the profile dataframe has fewer customer ids than transcript df so the last step after the merge was to remove the nan values.

Implementation

This part of the the process involves 3 main steps:

  1. Data Prep:

I created 2 functions to transform the transaction dataframe I created by cleaning and merging the three original data sets

a. ml_model_prep function: Through this function, I categorized the offers into BOGO and discount based on the type of offer and then grouped the dataframe by customer_id and offer_type.

Fig 5: Model Data Prep — Function 1

b. ml_model_data function: Through this function, I went one step ahead to create columns which determined if a particular customer would respond to a bogo or a discount offer. For detailed analysis, please refer to the Jupyter notebook since the function is too long to post here.

2. Splitting the Data into Training and Test Sets

To start with, I selected the relevant columns from the model dataframe created from above. Thereafter, I split the data into X (input) and y (output) datasets and then used the train_test_split function from the sklearn library to split the dataset into training and test splits.

I created separate output datasets for BOGO and discount offers so I could analyze them individually in the model.

3. Data Modeling

As a first iteration of the model, I decided to use the KNeighboursClassifier() model from the sklearn library. I trained the model on the training set and then used it to predict the outcome of the test set. Then, I used the classification_report function from the sklearn to output the metrics I discussed above.

Fig 6: Classification Report Output — KNN

For the BOGO model and the discount model, f-1 score, precision and recall are higher for 0 which means they’re higher when it comes to predicting if a customer will not respond to a particular offer.

Refinement

To refine the results further, I decided to do the following:

  1. Perform GridSearchCV with the KNeighborsClassifier Model
Fig 7: GridSearchCV with KNN

From Fig 7, we can see that the metrics have improved marginally for the situation where the model predicts if a customer will not respond to an offer. However, it’s almost the same for the situation where the model predicts if a customer will respond to an offer.

Fig 8: GridSearchCV with KNN — best parameters

In order to find the best parameters for the GridSearchCV, I used the best_params_function which yielded the same n_neighbors for BOGO and discount but with different weights.

2. I also performed a SVC model to see if it yielded better results

Fig 9: SVC model

From Fig 8, we can see that the metrics have improved considerably for the situation where the model predicts if a customer will not respond to an offer. However, it’s almost the same if not less for the situation where the model predicts if a customer will respond to an offer.

Analysis

Data Exploration

For detailed understanding of the schema, please refer to the README present in this Github repository.

For each of the dataset, I started off the the cleaning exercise by checking the following:

  • Shape of the dataset
  • Description of the columns
  • Understanding the data types of columns and changing them as necessary
  • Checking for non-nulls and removing rows as necessary
  • Identifying outliers and removing them as necessary

This exercised formed the basis for the cleaning functions. What helped me additionally was to think of the end-state dataframe that will help me answer the questions I have for this project.

For a detailed exploratory analysis and cleaning thereafter, please refer to the Jupyter notebook in the Github repository

Initial Experiment Overview

Before I dive deep into the offer and the customer profile analysis, I want to give a brief overview of the experiment:

  • The offer data is roughly over a period of 1 month (~29.75 days)
  • The cleaned and merged dataset contains 14,825 unique customers
  • The overall offer success rate is roughly 49% which seems lower

To start-off I wanted to see how customers received the different types of offers and the performance of each offer. For simplicity, I replaced the offer id with simpler tags. All offers starting with B are BOGO (Buy One Get One), D are Discount and I are Informational offers

Data Visualization

Fig 10: Count of offers received by customers
Fig 11: View and Completion Ratios of Offers

From the graphs above, we can see that the offers were generated uniformly. D2, D3, B3 and B4 were the most successful offers with view ratio of 0.96 and completion ratio of greater than 0.6. Below are the attributes of the offer which made them more successful than others:

  • Lower difficulty (spend) offers required
  • Higher reward to difficulty (spend) ratio
  • Longer duration
  1. Impact of Gender, Age and Income on Offer Response

For the purpose the analysis below, I wanted to limit my scope and really focus on the revenue generating offers. As a result, I removed I1 and I2 offers since they were informational and couldn’t be actually completed by the customer. I also removed the gender = “O” (Other) because it accounted for only 1.5% of the dataset. After the removal, the offer success rate jumped to 61% (from 49% above). I realized this is more accurate because I was including the I1, I2 offers I shouldn’t have initially.

Fig 12: Offers Received by Income and Gender
Fig 13: Offer Completion Rate by Income and Gender

As you can see from Fig 8, females have a higher completion ratio than males. The offer completion % is also increasing as the income level increase. We can’t conclude over here that the increase in income results in increase in completion % because offers received by the higher income levels is much lesser than that of the lower income levels. This might be due to the fact that there are fewer customers in the higher income brackets as you can see in Fig 7. Starbucks needs to ensure that the higher level income brackets receive more offers per customer than they do currently.

Fig 14: Offers Received by Age and Gender
Fig 15: Offer Completion % by Age and Gender

Similarly, with respect to age and gender, females and males between ages 48–68 have received the most number of offers and their offer completion rate is also one of the highest. We can also see that offer completion ratios for customers between 98–108 years is the highest, and I would caution again to ignore this stat because of the less data we have at our disposal for this age segment

2. Customer Segmentation Analysis

To answer the next few questions, I created a consolidated dataframe for all the customers with the following attributes:

  • Demographic Features: Gender, Age, Income, Membership Date
  • Offer Features: Offers Received, Offers Completed, Transactions, Transaction Amount, Rewards, Difficulty (Total Offer Spend)

I created multiple helper dataframes based on offer completion condition and by using the groupby and agg functions. Since I wanted to create a customer data, I grouped them by the customer_id column

Fig 16: Creation of the Customer Data — Merge 1
Fig 17: Creation of the Customer Data — Merge 2

I decided to use the transaction amount and the offer difficulty columns to determine if an offer had an impact on the customer. I created a new column called txn_dif ratio which was a ratio of the transaction amount and the offer difficulty

Let me explain this through an illustration:

  • Situation 1: A Starbucks customer completes all the offers but still spends much more than the difficulty (minimum offer spend). Customer is really responsive to the offer but his spend is not really affected by it so even if Starbucks doesn’t send the offer, the customer will still spend approximately the same amount. In this situation, Starbucks can save some money by sending fewer offers (to still maintain the loyalty).
  • Situation 2: A Starbucks customer makes very few transactions and does not even spend equal to the offer minimum so never completes the offer. In this case as well, Starbucks can probably tweak its offering to provide offers with a lower offer minimum.

I used the offer minimum (difficulty) and the transaction amount column to determine the impact of the offer. Here’s the approach I took:

  • High Offer Impact: transaction amount to offer ratio is less than 2 (8% of customers)
  • Medium Offer Impact: transaction amount to offer ratio is greater than 2 and less than 10 (63% of customers)
  • Low Offer Impact: transaction amount to offer ratio is greater than 10 (29% of customers)

Situations 1 and 2 fall under the low offer impact category.

Fig 18: Income Category and Offer Impact
Fig 19: Age Category and Offer Impact

Since majority of the customers fall under the medium category, the bar graph looks skewed towards the medium category in Fig 7 and Fig 8. However, important thing to note here:

  • Customers with higher income (>$84,000) are mainly concentrated in the medium and low impact category
  • Customers between 17 and 30 years of age are also mainly concentrated in the medium and low impact category

The offers will have the least impact on these customer segments.

3. Top Offers for Each Customer Segment

To recommend the top 2 offers for each segment, I first divided the customer data frame into different segments by gender, age and income. There were a total of 44 segments. I analyzed the offer behavior of those segments and recommended the top 2 offers used by that particular segment. Below is a snippet of what the recommendation table looked like:

Fig 20: Offer Recommendation Table — Females
Fig 21: Offer Recommendation Table — Males

D2 and D3 were the most popular offers in the discount category and B1 and B3 were the most popular offers in the BOGO category. BOGO offers have a higher impact on Females than Males.

4. Offer Response ML Model Analysis

Model Evaluation

To determine the offer response from a customer, we ran 3 different models:

  1. KNearestNeighbours Classifier (KNN)
  2. KNN with GridSearchCV
  3. SVC

I also tried to efficiently tune my parameters using the GridSearchCV model to do a 5-fold cross validation on a KNN model using classification accuracy as the evaluation metric. Please refer to Fig x to see which parameters yielded the best results.

Fig 22: Hyperparameter Tuning

The best parameters were n_neighbors:11 and param_weights as distance. However, SVC even yielded better results than KNN with GridSearch, as you can see from the classification report in the Implementation/Refinement section of the blog

Hence using SVC model will give a fairly robust solution to the problem of identifying the right type of offer.

Justification

To put my model into action, I created a function which will take the customer info (Demographic features) as inputs and predict if the customer will respond to a BOGO or a discount offer. In this function, I used the trained SVC model to predict the outcome

This was actually one of the most satisfying functions because it basically put this entire project in a nutshell. I had a sigh of relief when the function spit out the results.

Through EDA and the different machine learning models, I was able to answer my questions I had initially set out with. The first 3 questions were fairly straightforward and I was able to identify patterns in the dataset. However, I wish we had more metrics we could use for a more accurate prediction.

I used the SVC model for my last question to create a function which would determine the type of offer a customer would respond to.

Results

  1. Impact of Gender, Age and Income
  • Females have a higher completion ratio than males.
  • The offer completion % is also increasing as the income level increase
  • We can’t conclude over here that the increase in income results in increase in completion % because offers received by the higher income levels is much lesser than that of the lower income levels
  • This might be due to the fact that there are fewer customers in the higher income brackets. Starbucks needs to ensure that the higher level income brackets receive more offers per customer than they do currently
  • Similarly, with respect to age and gender, females and males between ages 48–68 have received the most number of offers and their offer completion rate is also one of the highest
  • We can also see that offer completion ratios for customers between 98–108 years is the highest, and I would caution again to ignore this stat because of the less data we have at our disposal for this age segment

2. Demographic Least Affected by Offers

  • Customers with higher income (>$84,000) are mainly concentrated in the medium and low impact category
  • Customers between 17 and 30 years of age are also mainly concentrated in the medium and low impact category

3. Top 2 Offers: Please refer to the tables above or the Jupyter notebook for the detailed recommendation

4. ML Model: Please refer to the Analysis/Implementation/Refinement sections for results

Conclusion

Reflection

Overall, this was a great challenging capstone project to test my knowledge from the last 5 months on the course. If I look back at where I started, I wouldn’t ever imagine to be at this place 5 months down the line.

One of the interesting and difficult aspects of this project was it was open-ended and up to me as to how I analyzed and published my recommendation. It was a little daunting when I started but as I got into the problem, it was easy for me to connect the dots and arrive at the results. I was able to answer the questions I had in the beginning. It also was very relatable because in the real world, problems are open-ended and it’s up to the data scientist to figure out how to derive the insights and showcase it to the audience.

Improvement

  1. As mentioned above, I want a better accuracy when the model predicts that a customer will respond to/complete a BOGO/discount offer
  2. I would also like to understand how Starbucks can drive offer completion by having a higher view percentages of offers. This would indirectly help me with 1.
  3. I would also like to find out the time it would take approximately to complete an offer after the customer receives it
  4. In order to improve my accuracy further, I want to understand why the metrics were poor when the model predicted the offer a customer would respond to. I would like to test other models as well as perform GridSearchCV with the SVC model to identify the best parameters. I could also increase the number of cv folds and add more parameters to increase the accuracy of the model

For a detailed analysis, please refer to my Github repository.

--

--