Exercise: A/B Testing

While learning about common data analysis tasks outside of HEP, I invariably stumbled upon A/B testing. The premise is simple enough: setup two testing groups, one control and one experimental, and determine if a metric in the experimental group is significantly different than the one in the control (this is the same as in Physics Lab 101 which I taught for two semesters at Johns Hopkins).

However, I’ve never done it before with anything resembling real marketing study data so I grabbed a dataset off of Kaggle and decided to give it a shot.

Setup

The dataset I chose on Kaggle is here.

I copy below the “scenario”:

A fast-food chain plans to add a new item to its menu. However, they are still undecided between three possible marketing campaigns for promoting the new product. In order to determine which promotion has the greatest effect on sales, the new item is introduced at locations in several randomly selected markets. A different promotion is used at each location, and the weekly sales of the new item are recorded for the first four weeks.

Here are the features of the dataset:

MarketID: unique identifier for market
MarketSize: size of market area by sales
LocationID: unique identifier for store location
AgeOfStore: age of store in years
Promotion: one of three promotions that were tested
Week: one of four weeks when the promotions were run
SalesInThousands: sales amount for a specific LocationID, Promotion, and week

Thoughts:

There will be an obvious dependence of SalesInThousands on MarketSize. I may want to normalize sales to market size but meaning of that is vague since MarketSize is a categorical variable (“Small”, “Medium”, “Large”). That probably just means splitting the analysis on MarketSize.
The structure of the data is such that MarketA -> [LocationA, LocationB, …]. That may or may not matter.

Questions:

Does SalesInThousands have a dependence on AgeOfStore? Week?
Is SalesInThousands normally distributed for a given MarketSize? Market? Location?
Is the same promotion run for all weeks at one location? (Quick check shows “Yes!”)
Sales will be different each week even if the promotion is the same. Does that mean it’s a good idea to immediately average over the weeks?

Answering initial questions

I first investigated the last bullet point - should I average sales per location over all weeks? This seemed like a good idea at first since it reduced the dataset in a logical way. However, I eventually realized that sales per location per week are largely independent of location and week. That is, the sales per week in a given market are normally distributed and a specific location performing well or not seems to be random (or at least, there isn’t enough data to otherwise say differently). This will be shown with the scatter plots further down but for now, treat each week as a separate measurement and don’t take an average over the four weeks.

Back to the first question then… Looking at the correlation matrix below, we can see that AgeOfStore does not have strong (linear) correlation with SalesInThousands. The only feature with a strong correlation is MarketSize, as expected.

Examining the scatter matrix, we can see if there are any non-linear correlations that may be hidden by the lack of linear correlations.

There don’t seem to be any between SalesInThousands and AgeOfStore. However, the SalesInThousands vs LocationID (bottom-left corner) actually shows a very nice breakdown of SalesInThousands over locations in markets of different sizes. The vertical lines of this plot also reveal that the 100ths digit of the LocationID is determined by the MarketID (this is why MarketID is not included as a variable in the plot).

Projecting this plot onto the SalesInThousands axis, it’s easy to see that the distributions of SalesInThousands for the three market sizes are normally distributed for Small and Medium markets but not Large which appears to have two normal distributions.

Just to makes sure no effect is being missed, let’s view another set of scatter plots to check if there’s anything to differentiate these two peaks among the existing features.

It’s clear there are two “Large” markets which should be treated separately. We’ll use MarketID 3 to create a fourth MarketSize of “Giant”, existing as its own group for the final analysis.

Analysis

We first perform a simple ANOVA test to see if the three promotions are significantly different from one another in each market size category.

Taking the average SalesInThousands per week in each MarketSize, we have the following table summarizing the values:

Sales ($k)	Promo 1	Promo 2	Promo 3
Giant	89.65	79.59	84.92
Large	60.82	48.76	54.05
Medium	47.67	39.11	45.47
Small	60.16	50.81	59.51

The one-way ANOVA returns a p-value of $7.2 \times 10^{-5}$ which is plenty significant to show the means of the these three promos are not equivalent.

We can then break down the sets and test each combination of promotions with a Welch’s t-test, always using the one-sided “greater than” test by taking the alternative distribution as the one with the higher mean value. Below, we see the distribution of SalesInThousands over all weeks in four different markets (rows) and for the three promotion types (color).

The distributions clearly show that Promotion 1 has the highest average mean among the four market sizes (it’s not totally clear for “Medium” but it’s nonetheless true). Performing the t-tests (results below) for each pair shows that Promotions 1 and 3 have p-values greater than 0.99 when compared to Promotion 2 in all market sizes. Additionally, Promotion 1 has a significantly higher mean ($p > 0.95$) everywhere but in the small markets. A simple inversion shows that the null of either Promo 1 or 3 cannot be rejected in favor of the other as the alternative.

Null	Alternative	Small	Medium	Large	Giant
2	1	1.00	1.00	1.00	1.00
2	3	1.00	1.00	1.00	1.00
3	1	0.66	0.98	1.00	1.00

Given Promotion 1 performs significantly best in most market sizes, this is the recommended version to use everywhere.