RFM Analysis for Strategic Marketing Segmentation by Barrett Duna¶
Executive Summary¶
An online retailer's transaction data is utilized to segment said retailer's customers (many of which are wholesalers) into distinct groups with similar purchasing behavior. RFM scoring is applied to feature engineer metrics that represent recency and frequency of purchase and monetary spend. Every segment has a 3-digit number attached to it, which encodes purchasing behavior. Each digit in the 3-digit number scores the segment on a specific aspect of purchasing behavior. This analysis provides the information necessary to implement highly targeted marketing campaigns with exact messaging tailored to each individual segment's particular needs.
Mathematical Details of the Triangular Plot¶
Curious as to why the data in the plot above is triangular? Click the link below to skip ahead to the relevant mathematics that give rise to this phenomenon.
RFM Scoring¶
Below is an explanation of RFM scoring's three components: recency (R), frequency (F) and monetary (M). Note that each RFM component below is scored on a scale of $1$ (worst) to $5$ (best).
Recency To score highly on recency, a customer or wholesaler must have purchased recently. On the other hand, a purchaser with a low recency score hasn't done business with the retailer in a long period of time.
Frequency A high-frequency score indicates that the customer or wholesaler purchases often with a small time interval between purchases. A low-frequency score is indicative of spending that occurs in wide intervals.
Monetary High monetary scores imply that the purchaser spends a large amount of money over time, producing a high average daily expenditure. A low monetary score translates to a small economic outlay over time, leading to a low average daily spend.
Segment Strategic Analyses¶
Segment 444¶
This group of purchasers is the online retailer's strongest segment. Given the segment's high rating in all three metrics, this segment is likely mostly comprised of wholesalers. The members of this segment have purchased the retailer's products reasonably recently (within the last 20 days on average), so the segment is still actively doing business with the retailer. This segment harbors a healthy frequency relative to the other segments (an average of 51 days between purchases). It's also interesting to note that this segment overwhelmingly leads in spending with an average expenditure of \$12.98 a day (compared to the second highest, which was \$4.34).
Since, in all likelihood, most of these individuals are wholesalers, one solid strategy would be to implement wholesale pricing optimization based on past transactional data. Online wholesale is likely an incredibly competitive business, so optimizing prices is essential for survival. One thing to note is that margins are likely extremely thin, so price optimization must incorporate costs into the calculation when deriving optimal prices.
Segment 132¶
This segment struggles with recency. On average, the members of this segment purchased 223 days ago. Most of the consumers and wholesalers in this segment have likely churned, and the likelihood that one of these individuals will do business with the online retailer again is small. A "last attempt" email campaign could be sent to these customers and wholesalers, drawing some of them back in, but conversion rates would likely be low, delivering a suboptimal ROI. The online retailer is best off deprioritizing this segment due to likely churn, avoiding wasted efforts and budgetary resources.
Segment 312¶
This segment's members, on average, purchased 42 days ago, ranking the segment 2nd in terms of recency. The strong recency score signals that this segment is actively interacting with the retailer, making it a segment that shouldn't be ignored. This segment's biggest weakness is the average 224-day frequency, the worst amoung all segments. A solid recency with weak frequency likely means this segment is comprised of previous customers and wholesalers who purchased early on and have recently become active again. Attention should be given to retaining this segment's recent resurgence in activity and encouraging the segment to continue purchasing activities.
Segment 222¶
This segment needs special attention. All RFM scores are above $1$, meaning the segment is not seriously lacking in any specific area. This segment could become a substantial source of additional sales if properly handled. The online retailer's primary goal should be to build strong relationships with this segment's members. Strengthening relationships with this segment can be accomplished through typical digital marketing means such as email drip campaigns, social advertising, SEO, PPC and content marketing.
Mathematical Derivations and Proofs Related to the Triangular Plot¶
Introduction¶
Let $k \ge 2$ be the total number of transactions for a specific customer ($k$ cannot be equal to $1$ because this creates an infinite frequency, we have removed single purchase customers from the dataset for this reason).
Let $\theta_0$ be the date of the customer or wholesaler's first transaction.
Let $\theta_1$ be the date of the most recent transaction entered into by the customer or wholesaler.
Let $S$ be the specific customer's total spend over the customer's lifetime.
Fundamental Definitions¶
We define $f$ (frequency), $r$ (recency) and $m$ (monetary) as follows...
$f = \frac{\theta_1 - \theta_0}{k-1}$
$r = N - \theta_1$
$m = \frac{S}{N - \theta_0}$
Variable Constraints¶
$f \in [1, N]$
$r \in [0, N-1]$
$ m \in (0, \infty )$
$ \theta_0 \in [0, N-1]$
$ \theta_1 \in [1, N]$
$ k \in [2, N+1]$
Special Cases¶
CASE #1:¶
Let $\theta_1 = N$, $\theta_0 = 0$ and $k=2$.
$f = \frac{\theta_1 - \theta_0}{k-1} = \frac{N - 0}{2-1} = N$
$r = N - \theta_1 = N - N = 0$
So in this case, $f = N$ and $r = 0$.
CASE #2:¶
Assume $\theta_1 = 1$, $\; \theta_0 = 0$ and $k=2$.
$f = \frac{\theta_1 - \theta_0}{k-1} = \frac{1-0}{2-1} = \frac{1}{1} = 1$
$r = N - \theta_1 = N-1$
Given the values, we observe $f = 1$ and $r = N-1$.
CASE #3:¶
Set $\theta_1 = N$, $\theta_0 = 1$ and $k=N$.
$f = \frac{\theta_1 - \theta_0}{k-1} = \frac{N-1}{N-1} = 1$
$r = N - \theta_1 = N-N = 0$
We see $f = 1$ and $r = 0$.
Derivation of the Boundary Line Equation¶
Given the cases above, we now have enough information to determine the boundary's slope and intercept. As a reminder, here are two points lying on the boundary line obtained from two of the above cases $\lt N, 0 \gt$ and $\lt 1, N-1 \gt$.
$slope = \frac{0 - (N-1)}{N-1} = \frac{-(N-1)}{N-1} = -1$
$intercept = y_1 - (slope)x_1 = 0-(-1)N = N$
Thus the equation for the boundary line is $r(f) = N - f$.
Note, you must distinguish between such formulas as $f = \frac{\theta_1 - \theta_0}{k-1}$ and $r = N - \theta_1$ which are formulas for the values of $r$ and $f$ and the equation for the boundary line which we have derived here.
Proof of Lower Triangular Property¶
Note, our proposed boundary line is $r(f) = N - f$.
We claim $ r \le N - f $ (a statement which implies the triangular property) for any valid point $\lt f, r \gt$. This is equivalent to the claim that $ r + f \le N$.
Note...
$f = \frac{\theta_1 - \theta_0}{k-1}$
$r = N - \theta_1$
And $k-1 \ge 1$ because $k \ge 2$, and thus...
$$ r + f \le r + (k-1)f \le (N - \theta_1) + (\theta_1 - \theta_0) \le N - \theta_0 \le N$$Thus, the proof is complete and the lower triangular property has been explained.
Maximizing $f$¶
By the above, it is sufficient to maximize $f$ on the boundary line.
We know that $r = N - f$.
This implies that $f = N - r$.
Since $r$ must be nonnegative, the value of $f$ when maximized occurs when $r = 0$ which is observed in case #1.
Thus, $max(f) = N$.
Maximizing $r$¶
Since we have shown above that all points lie on or below the boundary line, it suffices to find the max on the boundary line.
We have $r(f) = N - f$.
Due to the fact that $f \ge 1$, $r$ is maximized when $f = 1$ which occurs in case #2.
So $max(r) = N-1$.
Preliminaries¶
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import StandardScaler # data standardization
import matplotlib.pyplot as plt # data viz
from sklearn.cluster import KMeans # KMeans Algo
import seaborn as sns # data viz
# read in data
f_path = 'online_retail.csv'
data = pd.read_csv(f_path)
data.head(3)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | |
---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 |
# drop columns 'StockCode' and 'Description'
data.drop(columns=['StockCode', 'Description'], inplace=True)
# cast 'CustomerID' feature to int
data['CustomerID'] = data['CustomerID'].astype(int)
# convert 'InvoiceDate' to a datetime.date
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate']).dt.date
Purchase Recency Analysis¶
# get the most recent purchase date by customer
most_recent_purchase = data.groupby('CustomerID')
most_recent_purchase = most_recent_purchase.agg({'InvoiceDate': 'max'})
most_recent_purchase.rename(columns={'InvoiceDate': 'MostRecentPurchase'}, inplace=True)
# calculate the most recent date in the dataset
most_recent_date = data['InvoiceDate'].max()
# calculate recency
recency = (most_recent_date - most_recent_purchase)
recency.rename(columns={'MostRecentPurchase': 'Recency'}, inplace=True)
recency['Recency'] = recency['Recency'].apply(lambda x: int(x.days))
Purchase Monetary Analysis¶
data.head(3)
InvoiceNo | Quantity | InvoiceDate | UnitPrice | CustomerID | |
---|---|---|---|---|---|
0 | 536365 | 6 | 2010-12-01 | 2.55 | 17850 |
1 | 536365 | 6 | 2010-12-01 | 3.39 | 17850 |
2 | 536365 | 8 | 2010-12-01 | 2.75 | 17850 |
# calculate quantity x price and total customer spend
order_value = data[['CustomerID', 'Quantity', 'UnitPrice']].copy()
order_value['OrderValue'] = order_value['Quantity'] * order_value['UnitPrice']
total_customer_spend = order_value.groupby('CustomerID')
total_customer_spend = total_customer_spend.agg({'OrderValue': 'sum'})
total_customer_spend.rename(columns={'OrderValue': 'TotalCustomerSpend'}, inplace=True)
# calculate each customer's first purchase date
first_purchase_date = data.groupby('CustomerID').agg({'InvoiceDate': 'min'})
first_purchase_date.rename(columns={'InvoiceDate': 'FirstPurchaseDate'}, inplace=True)
# calculate most recent transaction date in the dataset
most_recent_date = data['InvoiceDate'].max()
# calculate time from first transaction to most recent
account_age = pd.DataFrame((most_recent_date - first_purchase_date['FirstPurchaseDate']).apply(lambda x: int(x.days)))
account_age.rename(columns={'FirstPurchaseDate': 'AccountAge'}, inplace=True)
# join total customer spend with account age to calculate the monetary metric
monetary = account_age.join(total_customer_spend)
monetary['Monetary'] = monetary['TotalCustomerSpend'] / monetary['AccountAge']
monetary.drop(columns=['AccountAge', 'TotalCustomerSpend'], inplace=True)
monetary.head(3)
Monetary | |
---|---|
CustomerID | |
12747 | 11.287561 |
12748 | 77.512413 |
12749 | 18.830892 |
Purchase Frequency Analysis¶
# mrt is the customers most recent transaction date
mrt = data.groupby('CustomerID').agg({'InvoiceDate': 'max'})
mrt.rename(columns={'InvoiceDate': 'MostRecentTransaction'}, inplace=True)
# ft is the customers earliest invoice date
ft = data.groupby('CustomerID').agg({'InvoiceDate': 'min'})
ft.rename(columns={'InvoiceDate': 'FirstTransaction'}, inplace=True)
# invoice dates grouped by customer and invoice date
invoice_dates = data.groupby(['CustomerID', 'InvoiceDate'])
invoice_dates = invoice_dates.agg({'InvoiceNo': 'first'}).reset_index()
invoice_dates.drop(columns='InvoiceNo', inplace=True)
invoice_dates.set_index('CustomerID', inplace=True)
# calculate number of transactions per customer (k)
k = invoice_dates.groupby('CustomerID')
k = k.agg({'InvoiceDate': 'nunique'})
k = k.rename(columns={'InvoiceDate': 'K'})
# join most recent transaction, first
# transaction and number of transactions
frequency = mrt.join(ft).join(k)
# only keep customers who made two or
# more transactions because frequency is
# infinite for a single transaction
frequency = frequency[frequency['K'] > 1].copy()
# adjust k to be k-1 to match the defintion
# of frequency
frequency['K'] = frequency['K'] - 1
# calculate frequency
frequency['Frequency'] = (frequency['MostRecentTransaction'] - frequency['FirstTransaction'])
frequency['Frequency'] = frequency['Frequency'].apply(lambda x: int(x.days))
frequency['Frequency'] = frequency['Frequency']/(frequency['K'])
# drop unnecessary columns
frequency.drop(columns=['MostRecentTransaction', 'FirstTransaction', 'K'], inplace=True)
Join RFM Values¶
# join recency, frequency and monetary data
rfm_data = recency.join(frequency).join(monetary)
rfm_data.dropna(inplace=True)
KMeans RMF Clustering¶
# KMeans clustering of rfm data with 4 clusters
n_clusters = 4
random_state = 0
rfm_kmeans = KMeans(
n_clusters=n_clusters,
random_state=random_state
)
rfm_kmeans.fit(rfm_data)
# extract the centroids from the fit model
centroids = rfm_kmeans.cluster_centers_
# create a new column with predicted clusters on rfm data
rfm_data['RFMCluster'] = rfm_kmeans.predict(rfm_data)
RFM Cluster Aggregation¶
# build a dataframe with means of recency, frequency
# and monetary, broken apart by cluster
cluster_data = rfm_data.groupby('RFMCluster')
cluster_data = cluster_data.agg({
'Recency': 'mean', 'Frequency': 'mean', 'Monetary': 'mean'
})
cluster_data
Recency | Frequency | Monetary | |
---|---|---|---|
RFMCluster | |||
0 | 19.583051 | 50.994689 | 12.981322 |
1 | 42.113269 | 224.430421 | 2.419265 |
2 | 222.645161 | 60.128648 | 2.507621 |
3 | 95.102450 | 84.252849 | 4.343776 |
# function to score recency values
def r_score(recency_value):
if recency_value <= r_quantiles[0.2]:
return 5
elif recency_value <= r_quantiles[0.4]:
return 4
elif recency_value <= r_quantiles[0.6]:
return 3
elif recency_value <= r_quantiles[0.8]:
return 2
else:
return 1
# function to score frequency values
def f_score(frequency_value):
if frequency_value <= f_quantiles[0.2]:
return 5
elif frequency_value <= f_quantiles[0.4]:
return 4
elif frequency_value <= f_quantiles[0.6]:
return 3
elif frequency_value <= f_quantiles[0.8]:
return 2
else:
return 1
# function to score monetary values
def m_score(monetary_value):
if monetary_value <= m_quantiles[0.2]:
return 1
elif monetary_value <= m_quantiles[0.4]:
return 2
elif monetary_value <= m_quantiles[0.6]:
return 3
elif monetary_value <= m_quantiles[0.8]:
return 4
else:
return 5
# calculate r quantiles for r scoring
r_quantiles = rfm_data['Recency'].quantile([0.2, 0.4, 0.6, 0.8])
# calculate f quantiles for f scoring
f_quantiles = rfm_data['Frequency'].quantile([0.2, 0.4, 0.6, 0.8])
# calculate m quantiles for m scoring
m_quantiles = rfm_data['Monetary'].quantile([0.2, 0.4, 0.6, 0.8])
# calculate r,f and m scores
rfm_data['RScore'] = rfm_data['Recency'].apply(r_score)
rfm_data['FScore'] = rfm_data['Frequency'].apply(f_score)
rfm_data['MScore'] = rfm_data['Monetary'].apply(m_score)
# calculate average RFM scores by cluster
cluster_scores = rfm_data.groupby('RFMCluster')
cluster_scores = cluster_scores.agg({'RScore': 'mean', 'FScore': 'mean', 'MScore': 'mean'})
cluster_scores = cluster_scores.round().astype(int)
# combine cluster scores into one 3-digit code
cluster_scores['ClusterScores'] = (
cluster_scores['RScore'].astype(str) +
cluster_scores['FScore'].astype(str) +
cluster_scores['MScore'].astype(str)
)
# average RFM scores by cluster
cluster_scores
RScore | FScore | MScore | ClusterScores | |
---|---|---|---|---|
RFMCluster | ||||
0 | 4 | 4 | 4 | 444 |
1 | 3 | 1 | 2 | 312 |
2 | 1 | 3 | 2 | 132 |
3 | 2 | 2 | 2 | 222 |
RFM Cluster Plot¶
# calculate min and max date of the dataset
# to determine N
min_date = data['InvoiceDate'].min()
max_date = data['InvoiceDate'].max()
# calculate N
N = (max_date - min_date).days
# define slope and intercept for RFM boundary life
intercept = N
slope = -1
# create a 10x6 figure
plt.figure(figsize=(10, 6))
# set up scatter plot
sns.scatterplot(
data=rfm_data, x='Frequency', y='Recency',
hue='RFMCluster', palette='viridis', s=100, alpha=0.6
)
# title, label and legend text
plt.title('Customer Segments Based on Frequency and Recency of Purchase')
plt.xlabel('Average Days Between Purchases (Frequency)')
plt.ylabel('Days Since Last Purchase (Recency)')
plt.legend(title='Segment')
# add x and y axis lines
plt.axhline(y=0, color='black')
plt.axvline(x=0, color='black')
# draw RFM Score boxes and text on centroids
for i in range(centroids.shape[0]):
(
plt.text(centroids[i][1], centroids[i][0],
s=cluster_scores.loc[i, 'ClusterScores'],
bbox=dict(facecolor='white'))
)
# draw RMF boundary dashed line
axes = plt.gca()
x = np.array(axes.get_xlim())
y = intercept + slope * x
plt.plot(x, y, '--')
# save figure and show it
plt.savefig('ccs.png', bbox_inches='tight')
plt.show()