Author: Coskun Erden
Date: 11/27/2024
This project focuses on the Starbucks Capstone Challenge, which simulates customer purchasing decisions influenced by promotional offers. The challenge explores how machine learning models can predict offer completion rates and provide actionable strategies to optimize marketing performance.
Promotional offers are a cornerstone of marketing strategies, designed to boost customer engagement and increase sales. However, their effectiveness often depends on how well they resonate with customers. This project aims to optimize these campaigns by identifying which offers work best for different customer groups and developing a data-driven approach to improve offer completion rates.
The analysis relies on three key datasets:
transcript
The transcript
dataset forms a critical component of the analysis by documenting customer interactions with promotional offers. It contains 306,534 entries and four key columns:
person
: Represents a unique customer ID.event
: Describes the type of interaction, such as offer received
, offer viewed
, or transaction
.value
: Stores additional details as a dictionary. For instance, it can include the offer ID for promotional events or the transaction amount for purchases.time
: Records the time of the interaction in hours since the start of the test.Notably, the dataset is complete, with no missing values, ensuring consistency in analysis.
This dataset provides a detailed timeline of customer behavior, offering insights into how customers engage with promotions and make transactions. By analyzing the transcript
, we can uncover trends in offer effectiveness and customer preferences.
Here’s a snapshot of the dataset for better understanding:
person | event | value | time |
---|---|---|---|
78afa995795e4d85 | offer received | {“offer_id”: “ae264e3637204a6fb9bb56bc8210ddfd”} | 0 |
78afa995795e4d85 | offer viewed | {“offer_id”: “ae264e3637204a6fb9bb56bc8210ddfd”} | 6 |
78afa995795e4d85 | transaction | {“amount”: 9.64} | 12 |
The value
column is particularly noteworthy for its nested structure, requiring parsing to extract meaningful details for further analysis.
Observations
Distribution of Events
Transactions are the most common event, with 138,953 occurrences, indicating frequent customer purchases or monetary interactions.
Offer received events follow with 76,277 occurrences, suggesting widespread distribution of promotional offers.
Offer viewed and offer completed events are less frequent (57,725 and 33,579, respectively), indicating a gap between receiving offers and acting on them.
Time Analysis
The time column has:
Mean: 366 hours
Median: 408 hours
Range: 0 to 714 hours
A box plot of time (below) shows no extreme outliers, with data evenly distributed over the 714-hour range.
Event Timing Distribution
The histogram (below) reveals:
Peaks in offer received events occur at regular intervals, suggesting scheduled distributions.
Offer viewed and offer completed events are spread more consistently, indicating customer engagement over time.
Box Plot by Event
The box plot (below) indicates:
Offer received and offer viewed events have similar time ranges.
Transactions and offer completed events show slightly broader distributions, reflecting more variability in customer responses.
By breaking down and preprocessing this dataset, we can generate valuable insights into customer engagement patterns.
Here’s a structured summary that you can add to your index.md
for the portfolio dataset, following the style used for the transcript dataset:
The portfolio
dataset contains 10 unique promotional offers, each with details on the reward amount, distribution channels, difficulty level, duration, offer type, and an identifier.
10
, 5
, 3
, 2
, and 0
.20
, 10
, 7
, 5
, and 0
.3
to 10
days, with a mean of 6.5
.The portfolio.describe()
output provides the following statistical summary:
Metric | Reward | Difficulty | Duration |
---|---|---|---|
Count | 10 | 10 | 10 |
Mean | 4.2 | 7.7 | 6.5 |
Std Dev | 3.5 | 5.8 | 2.3 |
Min | 0 | 0 | 3 |
25% | 0 | 5 | 5 |
50% | 4 | 8.5 | 7 |
75% | 5 | 10 | 7 |
Max | 10 | 20 | 10 |
The profile
dataset provides demographic and membership information for 17,000 customers. It contains five columns that help analyze customer characteristics and their potential influence on marketing strategies. Below is a summary of the dataset:
gender
: Gender of the customer.
age
: Age of the customer.
id
: Unique identifier for each customer.
became_member_on
: Date when the customer joined, formatted as YYYYMMDD
.
income
: Annual income of the customer (in dollars).
gender
and income
.age
shows the median is 58 years, with an interquartile range of 45 to 73 years.income
shows most customers earn between $50,000 and $75,000, with a few high-income outliers.income
by age category reveals that Mid Career customers tend to have the highest earnings.
The violin plot provides a detailed visualization of income distribution across different age categories in the profile
dataset. Each violin shape represents an age group, with the width of the plot indicating the density of income values within that group. The median income values are marked by white dots inside the violins, and the spread of data is captured by the shape.
Insights from the Violin Plot
became_member_on
column to identify periods of high customer acquisition.The profile
dataset provides foundational insights into customer demographics, enabling the development of targeted marketing strategies and personalized campaigns.
The preprocessing and feature engineering of the transcript
dataset began with transforming the value
column, which initially contained unstructured dictionary data, into meaningful and interpretable features:
offer_id
Column:
offer_id
was extracted from the value
column. This unique identifier enabled tracking of specific offers across various user interactions, such as viewing, receiving, and completing offers.amount
Column:
amount
column. This feature provides monetary details specifically tied to transaction-related events in the dataset.reward
Column:
reward
column. This feature quantifies the benefits users receive from engaging with offers.The newly created columns (amount
and reward
) contained missing values (NaN
) for events that were not directly related to transactions or rewards (e.g., offer viewing or receiving). These missing values will be handled appropriately to ensure the features remained robust and relevant for further analysis.
This step improved the interpretability of the dataset by splitting the dictionary-based value
column into structured components that could be directly analyzed.
By creating the offer_id
, amount
, and reward
columns, the dataset became more granular and interpretable. This structured representation of user interactions allows for deeper insights into user behavior and offer performance in subsequent analysis and modeling stages.
The preprocessing journey continued with an exploration of the amount
column, which displayed a significant skew and extreme outliers. A box plot highlighted these anomalies, showing that a few large transaction amounts were disproportionately influencing the distribution (Figure 1). To address this, winsorization was applied, capping the values at the 95th percentile. This adjustment brought the dataset into a more balanced range while preserving the overall distribution. A post-winsorization box plot (Figure 2) and a histogram (Figure 3) confirmed that the adjustments effectively reduced the impact of outliers.
Figure 1: Box Plot of the amount
Column (Before Winsorization)
Figure 2: Box Plot of the amount
Column (After Winsorization)
Figure 3: Histogram of the amount
Column (After Winsorization)
The relationships between key features were then explored using scatter plots with regression lines. For instance, a positive but variable trend was observed between age and income (Figure 4), where income generally increased with age. However, when examining membership duration against both age (Figure 5) and income (Figure 6), no significant patterns emerged, suggesting that membership duration is relatively stable across these variables. These findings provide important insights into how customer attributes interact with one another.
Figure 4: Scatter Plot of Age vs. Income with Regression Line
Figure 5: Scatter Plot of Membership Duration vs. Age with Regression Line
Figure 6: Scatter Plot of Membership Duration vs. Income with Regression Line
A bar plot was created to visualize the distribution of offer types across different age categories (Figure 7). The analysis revealed that “BOGO” and “Discount” offers were the most popular across all age groups, with higher counts observed in the “mid-career” and “approaching retirement” segments. On the other hand, “Informational” offers were less frequent but evenly distributed across various life stages, highlighting their selective use.
Figure 7: Distribution of Offer Types Across Age Categories
A correlation heatmap of numerical features (Figure 8) was then generated to identify significant relationships. It revealed a strong positive correlation between “difficulty” and “duration,” indicating that more challenging offers tend to have longer durations. A moderate negative correlation was also observed between “reward earned” and “duration,” suggesting that shorter offers might lead to higher rewards.
Figure 8: Correlation Heatmap of Selected Numeric Variables
To address missing data, the dataset was analyzed for completeness. The offer_type
column showed a clear breakdown: “BOGO” had 71,617 instances, “Discount” had 69,898, and “Informational” had 26,066. Missing values were identified and handled to ensure robustness in subsequent analyses.
The dataset was then segmented into two subsets: users who completed offers and earned rewards, and those who received offers but did not complete them. This segmentation provides a focused lens to examine the factors that influence offer completion. The completed subset contained 33,579 entries, while the incomplete subset had 272,955 entries, underscoring the challenge of driving higher offer engagement.
New features were engineered to enhance the dataset’s utility. Binary columns were created to indicate the presence of specific engagement channels such as “web,” “email,” “mobile,” and “social,” enabling a detailed analysis of how different channels influence user behavior. Additionally, a binary target variable, offer_completed
, was introduced, categorizing rows as 1 for completed offers and 0 for other events. Descriptive statistics for this variable revealed that only 10.95% of offers were completed, providing a baseline for predictive modeling.
In this section, I systematically addressed missing values in the dataset to ensure the data’s completeness and integrity for analysis and modeling purposes. The primary tasks performed included:
amount
, reward_earned
, and offer_id
had substantial missing values due to their relevance only to specific events (e.g., transactions or rewards).Figure 1: Heatmap of Missing Values
gender
column, missing entries were filled with the value "Unknown"
, ensuring categorical integrity and enabling the inclusion of these entries in demographic analyses.offer_id
and offer_type
were assigned "Unknown"
to differentiate missing data from valid entries effectively.amount
and reward_earned
, missing values were replaced with 0
. This approach ensured that non-applicable entries (e.g., no transaction or no reward) were accurately represented, avoiding distortions in the analysis.reward_offer
, difficulty
, and duration
were also filled with 0
to handle instances where offers were not completed or applicable, ensuring consistency.age
column was imputed with the median value to account for missing entries, preserving the dataset’s demographic information.age_imputed
) was created to indicate whether an age value was imputed or not.age_cat
column was recreated based on the imputed age
values, categorizing individuals into meaningful demographic segments (e.g., young adult, early career).Figure 2: Age Categorization
df_nonull
) was prepared by dropping rows with missing values in critical columns (age
, income
, age_cat
). This dataset ensures completeness for models requiring clean inputs.income
, time
, and age
at the end of the DataFrame, improving usability during subsequent analysis.Figure 3: Reordered Dataset
In the nonull_df
dataset, the heatmap illustrates the correlation between several key continuous features:
difficulty
, duration
, and no_channels
exhibit strong positive correlations with each other. This suggests that offers with higher difficulty are often associated with longer durations and require more engagement channels for completion. These features are likely interconnected in designing more complex or demanding offers.reward_offer
variable shows a moderate correlation with difficulty
, duration
, and no_channels
. This indicates that as the reward amount increases, the offers tend to be more challenging, involve more channels, or have longer durations, which aligns with the idea that higher rewards may incentivize more effort from users.membership_duration
and time
, show low or negligible correlations with most variables. This implies that these features may operate independently, without strong interdependencies with offer-related attributes.difficulty
, duration
, no_channels
) may necessitate careful handling, such as dimensionality reduction, to prevent multicollinearity issues. On the other hand, independent variables like membership_duration
could contribute unique explanatory power to models predicting user behavior or offer completion.This correlation analysis serves as a foundation for understanding the interplay between offer characteristics and user engagement, guiding further exploration and modeling efforts.
time
, duration
, difficulty
, and income
. These features represent numerical values critical for capturing quantitative relationships in the data.offer_type
, age categories (age_cat)
, gender
, and channel indicators (web
, email
, mobile
, and social
) were included to capture qualitative aspects of the dataset.gender
, age_cat
, and offer_type
, were one-hot encoded, resulting in 24 features after encoding.income
, time
, age
, reward_offer
, difficulty
, duration
, no_channels
, and membership_duration
) were standardized using a StandardScaler
.A logistic regression model was conducted using the SAGA solver with a maximum iteration of 500, ensuring reproducibility by setting a random state of 123. The model was then trained on the prepared training dataset.
This confusion matrix demonstrates the model’s difficulty in identifying completed offers (positive class) while maintaining high accuracy for the incomplete offers (negative class).
Best parameters from RandomizedSearchCV:
lbfgs
l2
The tuned model achieved a best F1-score of 0.41, significantly improving minority class performance while maintaining balanced class weights.
This analysis highlights the preprocessing steps and key results, emphasizing the challenges posed by class imbalance and the need for further refinement to enhance minority class prediction.
The dataset was divided into training (80%) and testing (20%) subsets to ensure the model’s generalization capability. This approach allows for evaluating the model on unseen data while maintaining a large enough training set for learning patterns effectively. The random seed ensured reproducibility of results.
A second model conducted is the Random Forest model, a robust ensemble learning algorithm designed to improve predictive accuracy and control overfitting. To optimize the model, a grid search was employed, tuning hyperparameters such as the number of trees (n_estimators
), maximum tree depth (max_depth
), minimum samples required for node splitting (min_samples_split
), and features considered for the best split (max_features
). Furthermore, the class_weight
parameter was set to balanced, ensuring an adjustment to class imbalance in the dataset.
The optimized Random Forest model achieved the following metrics:
Overall, the Random Forest model provided a better balance in predicting completed and incomplete offers compared to logistic regression, showcasing its suitability for handling imbalanced datasets with complex features.
A Gradient Boosting model was the third classifier evaluated in this study. Gradient Boosting is a powerful ensemble learning algorithm that builds models sequentially, optimizing for errors made by prior models. This model is particularly effective for imbalanced datasets, as it incrementally improves predictions by minimizing the loss function.
The Gradient Boosting model was fine-tuned using RandomizedSearchCV to optimize its hyperparameters. The parameters considered for tuning included:
n_estimators
): Sampled between 50 and 300 to balance model complexity and overfitting risk.learning_rate
): Sampled between 0.01 and 0.3, controlling the contribution of each tree.max_depth
): Ranged between 3 and 10 to manage the complexity of individual trees.The best parameters identified during cross-validation were:
These parameters were selected to optimize the F1-score, which balances precision and recall.
A Support Vector Machine (SVM) model was implemented using the RBF kernel for nonlinear decision boundaries, with a regularization parameter ( C = 1 ) and gamma set to ‘scale’ to handle data distribution effectively. The model was trained on the preprocessed training dataset and evaluated on the test dataset.
Imbalanced Performance: The SVM’s inability to identify any instances of the minority class underscores its sensitivity to imbalanced datasets. While the overall accuracy is high, it is misleading as the model neglects the minority class entirely.
Bias Toward the Majority Class: The perfect recall and high precision for the majority class indicate a strong bias, which is common for models trained on imbalanced datasets without appropriate handling strategies.
Class Weight Adjustments: Incorporating class weights to penalize misclassification of the minority class can help shift the focus of the model.
Oversampling/Undersampling: Techniques such as SMOTE or undersampling the majority class can create a more balanced dataset for training.
Kernel and Parameter Optimization: Exploring other kernels, such as polynomial or linear kernels, combined with hyperparameter tuning, may improve the model’s ability to generalize across both classes.
Alternative Models: Ensemble methods like Random Forest or Gradient Boosting may better capture complex relationships and handle class imbalance more effectively.
This analysis demonstrates the limitations of the SVM model in addressing the class imbalance challenge, highlighting the need for further refinement to achieve balanced performance.
The XGBoost model, renowned for its efficiency in handling imbalanced datasets and complex patterns, was employed to further address the limitations observed in previous models. Hyperparameter tuning was conducted using RandomizedSearchCV, optimizing for the F1-score to ensure better balance between precision and recall for the minority class.
The best parameters identified through RandomizedSearchCV were:
These parameters provided a model tailored to address the dataset’s challenges, particularly class imbalance, ensuring both minority and majority classes were effectively represented.
### Model Comparison
The analysis highlights XGBoo### Model Performance Summary Table
Model | Precision (Minority Class) | Recall (Minority Class) | F1 Score (Minority Class) | Precision (Majority Class) | Recall (Majority Class) | F1 Score (Majority Class) | Overall Accuracy |
---|---|---|---|---|---|---|---|
Logistic Regression | 0.26 | 0.99 | 0.41 | 0.88 | 0.62 | 0.73 | 0.67 |
Random Forest | 0.32 | 0.66 | 0.44 | 0.95 | 0.81 | 0.88 | 0.80 |
Gradient Boosting | 0.64 | 0.58 | 0.61 | 0.94 | 0.96 | 0.95 | 0.91 |
Support Vector Machine | 1.00 | 0.00 | 0.00 | 0.88 | 1.00 | 0.94 | 0.88 |
XGBoost | 0.49 | 0.91 | 0.64 | 0.99 | 0.87 | 0.93 | 0.88 |
This table provides a detailed breakdown of key performance metrics, comparing the models’ ability to handle both the minority and majority classes. The metrics highlight the strengths and weaknesses of each model in the context of this dataset.
remainder__time
(0.443): The most important feature, suggesting that the timing of offer presentations (e.g., time of day or day of the week) significantly impacts offer completion.remainder__duration
(0.298): The duration of the offer plays a critical role, indicating that longer offers provide more opportunities for customers to act.remainder__difficulty
(0.153): Simpler offers are more likely to be completed, emphasizing the importance of designing accessible promotions.remainder__reward_offer
(0.055): While rewards are important, they are less impactful compared to timing and duration.remainder__membership_duration
(0.021): Reflects customer loyalty as a potential driver of offer completion.remainder__income
(0.006) and remainder__social
(0.003): These factors show minimal influence, suggesting income levels and social channels may not strongly determine offer completion.encoder__age_cat_early_career
, senior
) and offer type encodings (e.g., bogo
, discount
, informational
) had nearly zero importance, indicating limited differentiation in their contribution to predictions.time
and duration
directly influence the model’s ability to predict offer completion. These insights can inform strategies for optimizing when and how offers are presented.difficulty
and reward_offer
highlight areas where small changes could improve engagement rates.This project provided valuable insights into predicting offer completion and optimizing promotional strategies using machine learning. By evaluating and comparing five models—Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and XGBoost—the analysis identified Gradient Boosting as the top-performing model, achieving the highest overall accuracy (91%) and a strong balance between precision and recall for both the majority and minority classes.
Key findings from the feature importance analysis revealed that behavioral factors, such as the timing and duration of offers, significantly influence customer engagement, while demographic variables like age and gender had minimal impact. These insights emphasize the importance of focusing on dynamic, behavior-driven targeting strategies rather than static demographic characteristics.
Several actionable recommendations emerged from this study:
Despite its successes, the project highlighted challenges such as class imbalance, which affected minority class predictions. Future work could explore advanced sampling techniques (e.g., SMOTE) and ensemble methods to address this issue. Additionally, further hyperparameter tuning of XGBoost may enhance its performance, given its competitive recall for the minority class.