import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

A Whole Foods Competitor¶

Let's say our Company X is a big competitor to Whole Foods ie. specializing in groceries, produce and other consumer products. The company would have quite a huge array of products that are on sale. And very often, management ends up working with simple logical rules to augment business decisions.

But looking at the store from a data-drive point of view is entirely decision. There's a lot of hidden customer behavior to mine - that can help the store offer a better experience for the customer and also improve the bottom line!

from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://i.insider.com/58ac734c54905756258b5989?width=800&format=jpeg&auto=webp")

Loading the Dataset¶

This is pretty similar to a cleaned up Point of Sale Dataset. Each row represents a transaction, and the items that were purchased

orders = pd.read_csv('store_data.csv', header=None)
orders.head()

This is order-level data, meaning each row contains the items bought by a customer in a given order. This is hence in a long format

We can note that customers buy from anywhere between 1 and 20 items in a given order

We have about 7.5K records to work with, which is enough for us to implement Association Rules on

Primer to Association Rules¶

An association rule is a relationship where the one or multiple items co-occur with other items, preferably with a strong significance.

Example: {Bread, Egg} -> {Milk} Someone who buys bread and eggs is highly likely to also buy Milk

The use of association rule mining is widespread - starting from Optimization in Retail Stores to the 'Other Products You May Like' feature on Amazon. In our scenario, we'd like to look at what products are often co-purchased with what items - so we can drive some smart decisions for the store.

But before that, a bit of terminology

Image(url= "https://miro.medium.com/max/1970/1*NoUFBxsokdnw3wrYyVXkgw.png")

Support¶

This is the number of times a certain itemset appeared in our data. For eg: {Milk} would be extremely common, and hence the support would be high, whereas {French Wine} will have fewer rows and hence lesser support.

We want to mine associations that have a good amount of support.

Image(url= "https://miro.medium.com/max/2590/1*bqdq-z4Ec7Uac3TT3H_1Gg.png")

len(orders)*0.004

30.004

Since our dataset is fairly big, even a small support of 0.5% ~~ 30 records will provide us with meaningful associations. But the parameter has to be set by the client based on the requirements

Confidence¶

This is the likelihood of a person buying item Y given they bought X.

Image(url= "https://miro.medium.com/max/2590/1*E3mNKHcudWzHySGMvo_vPg.png")

Lift

This is the most important parameter we look out for in Association Rules Mining. The lift tells us the likelihood of a customer buying both X and Y over the odds of buying just X

Explain Like I'm 5: If the lift is 5, a customer who has bought X is 5 times more likely to also buy Y

Lo, that's the magic of Association Rules. Simple, Computationally Efficient, but also incredibly meaningful in the insights it offers.

Association Rule Mining - Apriori Algorithm: The How¶

We first generate all possible combinations of itemsets from the different transactions we have.
Pruning: We remove the combinations that have very low support (set a threshold)
Confidence Threshold: We filter for rules that have sufficient confidence
Use the rules with the highest lift values!

The apyori package makes this process super simple for us.

Installation: pip install apyori

Data Munging¶

We need to convert the items purchased in an order into a list of strings for our model to be run

def remover(lit, x):
    try:
        lit = [l for l in lit if l != 'nan']
    except:
        pass
    
    return lit

orderList = []

for order in orders.itertuples():
    orde = [str(o) for o in order[1:]]
    orderList.append(remover(orde, 'nan'))
    
orderList[1]

['burgers', 'meatballs', 'eggs']

Items¶

We have everything from vegetables and meat to smoothies and wine.

import itertools
product_list = itertools.chain.from_iterable(orderList)
plist = pd.DataFrame()
plist['item'] = list(set(product_list))
plist

Generating the rules¶

association_rules = apriori(orderList, min_support=0.0035, min_confidence=0.1, min_lift=2, min_length=2)
association_results = list(association_rules)

Viewing the results in a dataframe

rows = []
for row in association_results:
    obj = row
    item = list(obj[2][0][0])
    support = obj[1]
    to_add = list(obj[2][0][1])
    confidence = obj[2][0][2]
    lift = obj[2][0][3]
    newRow = {'item': item, 'support': support, 'to_add':to_add, 'confidence': confidence, 'lift': lift}
    rows.append(newRow)

rules = pd.DataFrame(rows)
rules.sort_values(by="lift", ascending=False)

Analyzing the rules¶

We have 588 rules generated based on our constraints of support being 0.35% and confidence being 10%.

We can clearly see some logical rules forming. A customer who buys whole wheat pasta is about 4.8x more likely to also buy olive oil and mineral water. (Everyone loves an Italian night!)

Someone who buys brownies is 2x more likely to buy cake as well.

But some of the not so straightforward rules are more informative to analysts. Light Cream co-occuring with Chicken has a lift of 4.8x. These kind of rules can drive strategy by offering new insight of customer behavior to the management

Now, how can we use these rules to profit?

Loss Leader Pricing and Placement¶

A loss leader (also leader) is a pricing strategy where a product is sold at a price below its market cost to stimulate other sales of more profitable goods or services. With this sales promotion/marketing strategy, a "leader" is used as a related term and can mean any popular article, i.e., sold at a normal price.

In short

Offer steep discounts on items you're making loss on anyway
Customers buy the items related by association rules, which make more money.
Profit!

Placement We can also alter the locations of items in stores. For example, if we see that Whole Wheat Pasta and Olive Oil co-occur a lot and both generate good revenue - A good idea is to keep the Mineral Water next to these sections, thereby increasing the chances of the customer buying this too.

Mocking some data

We don't have the information for which products are doing well and which ones aren't. So we come up with boolean (True/False) filter for whether the product is generating loss for the store - and generate it randomly.

np.random.seed = 15
plist['loss'] = np.random.randint(0,2, 120)

lossleaders = plist[plist['loss']==1]
winners = plist[plist['loss']==0]
lossleaders.head()

We can see that Sparkling Water, Brownies and Sandwich are among the items that are not profitable for the store

winners.head()

Soup, Candy Bars and Extra Dark Chocolate are incredibly popular at the store and profitable.

Strategy - The Supercharging Part¶

Now I want to find out the products which are generating loss for us currently, but lead the customer to buy other products along - which are extremely profitable to us.

Explain Like I'm 5 If I give off a loss leader for free, which the customer wasn't anyway going to buy, they are more likely to walk into the store and buy other products which are incredibly profitable for us. Win

Hence, we filter the rules for the same

lossleaders['rules'] = [len(rules[[(item in lhs) & (bool(set(rhs) & set(winners['item']))) for lhs, rhs in zip(rules['item'], rules['to_add'])]]) for item in lossleaders.item]
lossleaders[lossleaders['rules']>0].sort_values(by='rules', ascending=False)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

#rules['to_add'] = rules['to_add'].apply(lambda x: remover(x, 'nan'))
rules['no_winners_rhs'] = [len(set(rhs) & set(winners['item'])) for rhs in rules['to_add']]

Making Decisions¶

rules[[['frozen vegetables'] == r for r in rules['item']]].sort_values(by=['lift'], ascending=False).head()

We can see that a customer who bought frozen vegetables is 2.5x more likely to also buy milk and mineral water, and 2.4x more likely to buy shrimp.

So we can use frozen vegetables as a loss leader as they are not performing well anyway.

Offering a steep discount like 50% off, would mean customers walk into the store's frozen section and buy frozen vegetables, and also pick up shrimp alongside. Profit!

rules[[['olive oil'] == r for r in rules['item']]].sort_values(by=['lift'], ascending=False).head(10)

We can see here that a customer who buys olive oil is 4.1 times more likely to also buy pasta; and 3x more likely to buy spaghetti. (Italian date night at home?)

This can be extremely insightful to the person who plans the layout of the store. A good idea would be to place the olive oil and spaghetti products right next in the pasta aisle, and then offer discounts on the olive oil to get the customer into the store looking for it.

Potential of Association Rules Mining¶

Given more data, I'd ideally do this analysis with a cost-aware prediction - where we can quantify the revenue gain that can be created by using an Association Rule. There are wide variety of contexts in which Association Rules Mining can be applied - from market basket analysis to picking out the dream team for a football game.

I hope this was an interesting read!

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
0	shrimp	almonds	avocado	vegetables mix	green grapes	whole weat flour	yams	cottage cheese	energy drink	tomato juice	low fat yogurt	green tea	honey	salad	mineral water	salmon	antioxydant juice	frozen smoothie	spinach	olive oil
1	burgers	meatballs	eggs	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	chutney	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	turkey	avocado	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	mineral water	milk	energy bar	whole wheat rice	green tea	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	item
0	soup
1	candy bars
2	ham
3	sparkling water
4	extra dark chocolate
...	...
115	champagne
116	bacon
117	frozen smoothie
118	cream
119	honey

	item	support	to_add	confidence	lift
12	[light cream]	0.004533	[chicken]	0.290598	4.843951
258	[whole wheat pasta]	0.003866	[olive oil, mineral water]	0.131222	4.755044
20	[pasta]	0.005866	[escalope]	0.372881	4.700812
289	[frozen vegetables, ground beef]	0.003733	[milk, mineral water]	0.220472	4.593788
52	[pasta]	0.005066	[shrimp]	0.322034	4.506672
...	...	...	...	...	...
170	[ground beef, french fries]	0.003600	[milk]	0.259615	2.003472
111	[eggs, chocolate]	0.006532	[ground beef]	0.196787	2.002850
114	[escalope, mineral water]	0.005599	[chocolate]	0.328125	2.002657
206	[grated cheese]	0.006266	[spaghetti, mineral water]	0.119593	2.002380
89	[cake, pancakes]	0.004133	[spaghetti]	0.348315	2.000542

	item	loss
0	soup	0
1	candy bars	0
2	ham	0
4	extra dark chocolate	0
5	cooking oil	0

	item	loss	rules
62	frozen vegetables	1	28
95	eggs	1	27
44	chocolate	1	25
27	spaghetti	1	16
69	olive oil	1	14
15	french fries	1	9
25	chicken	1	7
30	pancakes	1	7
66	herb & pepper	1	4
42	fresh tuna	1	2
18	meatballs	1	2
33	fresh bread	1	1
40	pasta	1	1
76	black tea	1	1
77	cookies	1	1
91	hot dogs	1	1
93	escalope	1	1
10	brownies	1	1

	item	support	to_add	confidence	lift	no_winners_rhs
29	[frozen vegetables]	0.016131	[tomatoes]	0.169231	2.474464	1
28	[frozen vegetables]	0.016664	[shrimp]	0.174825	2.446574	1
185	[frozen vegetables]	0.011065	[milk, mineral water]	0.116084	2.418737	2
194	[frozen vegetables]	0.011998	[spaghetti, mineral water]	0.125874	2.107549	1

	item	support	to_add	confidence	lift	no_winners_rhs
51	[olive oil]	0.007999	[whole wheat pasta]	0.121457	4.122410	1
247	[olive oil]	0.007199	[spaghetti, milk]	0.109312	3.082509	1
134	[olive oil]	0.007066	[spaghetti, chocolate]	0.107287	2.737290	0
239	[olive oil]	0.008532	[milk, mineral water]	0.129555	2.699415	2
50	[olive oil]	0.008932	[soup]	0.135628	2.684280	1
256	[olive oil]	0.010265	[spaghetti, mineral water]	0.155870	2.609786	1
219	[olive oil]	0.006666	[ground beef, mineral water]	0.101215	2.472998	2
130	[olive oil]	0.008266	[chocolate, mineral water]	0.125506	2.383344	1