In [194]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

A Whole Foods Competitor

Let's say our Company X is a big competitor to Whole Foods ie. specializing in groceries, produce and other consumer products. The company would have quite a huge array of products that are on sale. And very often, management ends up working with simple logical rules to augment business decisions.

But looking at the store from a data-drive point of view is entirely decision. There's a lot of hidden customer behavior to mine - that can help the store offer a better experience for the customer and also improve the bottom line!

In [195]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://i.insider.com/58ac734c54905756258b5989?width=800&format=jpeg&auto=webp")
Out[195]:

Loading the Dataset

This is pretty similar to a cleaned up Point of Sale Dataset. Each row represents a transaction, and the items that were purchased

In [196]:
orders = pd.read_csv('store_data.csv', header=None)
orders.head()
Out[196]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 shrimp almonds avocado vegetables mix green grapes whole weat flour yams cottage cheese energy drink tomato juice low fat yogurt green tea honey salad mineral water salmon antioxydant juice frozen smoothie spinach olive oil
1 burgers meatballs eggs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 chutney NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 turkey avocado NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 mineral water milk energy bar whole wheat rice green tea NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

This is order-level data, meaning each row contains the items bought by a customer in a given order. This is hence in a long format

We can note that customers buy from anywhere between 1 and 20 items in a given order

We have about 7.5K records to work with, which is enough for us to implement Association Rules on

Primer to Association Rules

An association rule is a relationship where the one or multiple items co-occur with other items, preferably with a strong significance.

Example: {Bread, Egg} -> {Milk} Someone who buys bread and eggs is highly likely to also buy Milk

The use of association rule mining is widespread - starting from Optimization in Retail Stores to the 'Other Products You May Like' feature on Amazon. In our scenario, we'd like to look at what products are often co-purchased with what items - so we can drive some smart decisions for the store.

But before that, a bit of terminology

In [197]:
Image(url= "https://miro.medium.com/max/1970/1*NoUFBxsokdnw3wrYyVXkgw.png")
Out[197]:

Support

This is the number of times a certain itemset appeared in our data. For eg: {Milk} would be extremely common, and hence the support would be high, whereas {French Wine} will have fewer rows and hence lesser support.

We want to mine associations that have a good amount of support.

In [198]:
Image(url= "https://miro.medium.com/max/2590/1*bqdq-z4Ec7Uac3TT3H_1Gg.png")
Out[198]:
In [199]:
len(orders)*0.004
Out[199]:
30.004

Since our dataset is fairly big, even a small support of 0.5% ~~ 30 records will provide us with meaningful associations. But the parameter has to be set by the client based on the requirements

Confidence

This is the likelihood of a person buying item Y given they bought X.

In [200]:
Image(url= "https://miro.medium.com/max/2590/1*E3mNKHcudWzHySGMvo_vPg.png")
Out[200]:

Lift

This is the most important parameter we look out for in Association Rules Mining. The lift tells us the likelihood of a customer buying both X and Y over the odds of buying just X

Explain Like I'm 5: If the lift is 5, a customer who has bought X is 5 times more likely to also buy Y

Lo, that's the magic of Association Rules. Simple, Computationally Efficient, but also incredibly meaningful in the insights it offers.

Association Rule Mining - Apriori Algorithm: The How

  1. We first generate all possible combinations of itemsets from the different transactions we have.
  2. Pruning: We remove the combinations that have very low support (set a threshold)
  3. Confidence Threshold: We filter for rules that have sufficient confidence
  4. Use the rules with the highest lift values!

The apyori package makes this process super simple for us.

Installation: pip install apyori

Data Munging

We need to convert the items purchased in an order into a list of strings for our model to be run

In [254]:
def remover(lit, x):
    try:
        lit = [l for l in lit if l != 'nan']
    except:
        pass
    
    return lit

orderList = []

for order in orders.itertuples():
    orde = [str(o) for o in order[1:]]
    orderList.append(remover(orde, 'nan'))
    
orderList[1]
Out[254]:
['burgers', 'meatballs', 'eggs']

Items

We have everything from vegetables and meat to smoothies and wine.

In [255]:
import itertools
product_list = itertools.chain.from_iterable(orderList)
plist = pd.DataFrame()
plist['item'] = list(set(product_list))
plist
Out[255]:
item
0 soup
1 candy bars
2 ham
3 sparkling water
4 extra dark chocolate
... ...
115 champagne
116 bacon
117 frozen smoothie
118 cream
119 honey

120 rows × 1 columns

Generating the rules

In [256]:
association_rules = apriori(orderList, min_support=0.0035, min_confidence=0.1, min_lift=2, min_length=2)
association_results = list(association_rules)

Viewing the results in a dataframe

In [257]:
rows = []
for row in association_results:
    obj = row
    item = list(obj[2][0][0])
    support = obj[1]
    to_add = list(obj[2][0][1])
    confidence = obj[2][0][2]
    lift = obj[2][0][3]
    newRow = {'item': item, 'support': support, 'to_add':to_add, 'confidence': confidence, 'lift': lift}
    rows.append(newRow)

rules = pd.DataFrame(rows)
rules.sort_values(by="lift", ascending=False)
Out[257]:
item support to_add confidence lift
12 [light cream] 0.004533 [chicken] 0.290598 4.843951
258 [whole wheat pasta] 0.003866 [olive oil, mineral water] 0.131222 4.755044
20 [pasta] 0.005866 [escalope] 0.372881 4.700812
289 [frozen vegetables, ground beef] 0.003733 [milk, mineral water] 0.220472 4.593788
52 [pasta] 0.005066 [shrimp] 0.322034 4.506672
... ... ... ... ... ...
170 [ground beef, french fries] 0.003600 [milk] 0.259615 2.003472
111 [eggs, chocolate] 0.006532 [ground beef] 0.196787 2.002850
114 [escalope, mineral water] 0.005599 [chocolate] 0.328125 2.002657
206 [grated cheese] 0.006266 [spaghetti, mineral water] 0.119593 2.002380
89 [cake, pancakes] 0.004133 [spaghetti] 0.348315 2.000542

293 rows × 5 columns

Analyzing the rules

We have 588 rules generated based on our constraints of support being 0.35% and confidence being 10%.

We can clearly see some logical rules forming. A customer who buys whole wheat pasta is about 4.8x more likely to also buy olive oil and mineral water. (Everyone loves an Italian night!)

Someone who buys brownies is 2x more likely to buy cake as well.

But some of the not so straightforward rules are more informative to analysts. Light Cream co-occuring with Chicken has a lift of 4.8x. These kind of rules can drive strategy by offering new insight of customer behavior to the management

Now, how can we use these rules to profit?

Loss Leader Pricing and Placement

A loss leader (also leader) is a pricing strategy where a product is sold at a price below its market cost to stimulate other sales of more profitable goods or services. With this sales promotion/marketing strategy, a "leader" is used as a related term and can mean any popular article, i.e., sold at a normal price.

In short

  1. Offer steep discounts on items you're making loss on anyway
  2. Customers buy the items related by association rules, which make more money.
  3. Profit!

Placement We can also alter the locations of items in stores. For example, if we see that Whole Wheat Pasta and Olive Oil co-occur a lot and both generate good revenue - A good idea is to keep the Mineral Water next to these sections, thereby increasing the chances of the customer buying this too.

Mocking some data

We don't have the information for which products are doing well and which ones aren't. So we come up with boolean (True/False) filter for whether the product is generating loss for the store - and generate it randomly.

In [288]:
np.random.seed = 15
plist['loss'] = np.random.randint(0,2, 120)
In [289]:
lossleaders = plist[plist['loss']==1]
winners = plist[plist['loss']==0]
lossleaders.head()
Out[289]:
item loss
3 sparkling water 1
6 gums 1
8 sandwich 1
9 strawberries 1
10 brownies 1

We can see that Sparkling Water, Brownies and Sandwich are among the items that are not profitable for the store

In [290]:
winners.head()
Out[290]:
item loss
0 soup 0
1 candy bars 0
2 ham 0
4 extra dark chocolate 0
5 cooking oil 0

Soup, Candy Bars and Extra Dark Chocolate are incredibly popular at the store and profitable.

Strategy - The Supercharging Part

Now I want to find out the products which are generating loss for us currently, but lead the customer to buy other products along - which are extremely profitable to us.

Explain Like I'm 5 If I give off a loss leader for free, which the customer wasn't anyway going to buy, they are more likely to walk into the store and buy other products which are incredibly profitable for us. Win

Hence, we filter the rules for the same

In [291]:
lossleaders['rules'] = [len(rules[[(item in lhs) & (bool(set(rhs) & set(winners['item']))) for lhs, rhs in zip(rules['item'], rules['to_add'])]]) for item in lossleaders.item]
lossleaders[lossleaders['rules']>0].sort_values(by='rules', ascending=False)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
Out[291]:
item loss rules
62 frozen vegetables 1 28
95 eggs 1 27
44 chocolate 1 25
27 spaghetti 1 16
69 olive oil 1 14
15 french fries 1 9
25 chicken 1 7
30 pancakes 1 7
66 herb & pepper 1 4
42 fresh tuna 1 2
18 meatballs 1 2
33 fresh bread 1 1
40 pasta 1 1
76 black tea 1 1
77 cookies 1 1
91 hot dogs 1 1
93 escalope 1 1
10 brownies 1 1
In [292]:
#rules['to_add'] = rules['to_add'].apply(lambda x: remover(x, 'nan'))
rules['no_winners_rhs'] = [len(set(rhs) & set(winners['item'])) for rhs in rules['to_add']]

Making Decisions

In [294]:
rules[[['frozen vegetables'] == r for r in rules['item']]].sort_values(by=['lift'], ascending=False).head()
Out[294]:
item support to_add confidence lift no_winners_rhs
29 [frozen vegetables] 0.016131 [tomatoes] 0.169231 2.474464 1
28 [frozen vegetables] 0.016664 [shrimp] 0.174825 2.446574 1
185 [frozen vegetables] 0.011065 [milk, mineral water] 0.116084 2.418737 2
194 [frozen vegetables] 0.011998 [spaghetti, mineral water] 0.125874 2.107549 1

We can see that a customer who bought frozen vegetables is 2.5x more likely to also buy milk and mineral water, and 2.4x more likely to buy shrimp.

So we can use frozen vegetables as a loss leader as they are not performing well anyway.

Offering a steep discount like 50% off, would mean customers walk into the store's frozen section and buy frozen vegetables, and also pick up shrimp alongside. Profit!

In [297]:
rules[[['olive oil'] == r for r in rules['item']]].sort_values(by=['lift'], ascending=False).head(10)
Out[297]:
item support to_add confidence lift no_winners_rhs
51 [olive oil] 0.007999 [whole wheat pasta] 0.121457 4.122410 1
247 [olive oil] 0.007199 [spaghetti, milk] 0.109312 3.082509 1
134 [olive oil] 0.007066 [spaghetti, chocolate] 0.107287 2.737290 0
239 [olive oil] 0.008532 [milk, mineral water] 0.129555 2.699415 2
50 [olive oil] 0.008932 [soup] 0.135628 2.684280 1
256 [olive oil] 0.010265 [spaghetti, mineral water] 0.155870 2.609786 1
219 [olive oil] 0.006666 [ground beef, mineral water] 0.101215 2.472998 2
130 [olive oil] 0.008266 [chocolate, mineral water] 0.125506 2.383344 1

We can see here that a customer who buys olive oil is 4.1 times more likely to also buy pasta; and 3x more likely to buy spaghetti. (Italian date night at home?)

This can be extremely insightful to the person who plans the layout of the store. A good idea would be to place the olive oil and spaghetti products right next in the pasta aisle, and then offer discounts on the olive oil to get the customer into the store looking for it.

Potential of Association Rules Mining

Given more data, I'd ideally do this analysis with a cost-aware prediction - where we can quantify the revenue gain that can be created by using an Association Rule. There are wide variety of contexts in which Association Rules Mining can be applied - from market basket analysis to picking out the dream team for a football game.

I hope this was an interesting read!

In [ ]: