April 8, 2018

E-commerce recommendation systems: basket analysis.

Once novelty recommendation systems are used now by more and more e-commerce sites to help customers find products to purchase. For e-commerce business owners these tools facilitate cross-sales.

Usage

Amazon is one of the most prominent organizations that used recommendations to increase sales. According to fortune.com Amazon was able to increase sales by 29% in 2012 as a result of implementing recommendation system. 35% of Amazon’s revenue is generated by its recommendation engine (source).

Amazon is using collaborative filtering approach

However there is another approach to mine items that a frequently bought together - association rule mining.

The difference is the unit of aggregation: collaborative filtering unit is user or items.

With user-to-user collaborative filtering approach algorithm finds a user which similar to you and suggests to buy the same items he bought. This is computationally difficult as there might be more users than items or users have very dissimilar interests.

With item-to-item collaborative filtering approach which is used by Amazon, algorithm finds a neighborhood of items that are bought/viewed by you and users that bought viewed the same item. Then items from that neighborhood are displayed as recommendations.

Association rule unit is a session (order), which makes this approach less personalized, as users and user purchase history is not taken into account. But it has some strong points:

  • it’s extremely simple and fast
  • it will work with very small and sparse customer bases
  • knowledge of the customer beyond what products or services they currently have is not necessary

However, there is one major drawback that makes association rule algorithms less favored by e-commerce - inability to catch so called “long tail” - online stores have a lot of items and some of them are bought rarely. But among these rare items there might be a pair that are of interested to specific group of people. Association rules are unlikely to finds them.

But these algorithms are useful for offline stores, where it’s not possible to track users and find application in other areas.

Algorithms

Apriori: it was introduced in 1993 , more than 20 years ago. It remains one of the most important data mining algorithms, not because it is the fastest, but because it has influenced the development of many other algorithms (source)

Apriori-TID: is a variation of the Apriori algorithm. It was proposed in the same article as Apriori as an alternative implementation. It produces the same output, but does less scans of the transaction database

Eclat: is faster than Apriori, but less memory efficient

FP-Growth: most advanced and efficient implementation of frequent pattern mining

Input parameters

  • order list: database of all orders (transactions). Each entry is a subset of all items that are being sold (here item count is dropped)

Example:

1 {item1, item3, item4}
2 {item2, item3, item5}
3 {item1, item2, item3, item5}
4 {item2, item5}
  • list of all possible items on sale: optional parameter that can be determined by scanning order list once

Example: items = {item1, item2, item3, item4, item5}

  • minimum support: minimum frequency for an item or itemset to count as frequent

Example: min_support=0.1 means that we are interested in items or itemsets that appeared in minimum 10% of all orders

  • minimum confidence: parameter that says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears

This can be calculated like so confidence({X -> Y}) = support({X, Y}) / support(X)

  • minimum lift (optional as not used in many implementations): parameter that says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. A lift value greater than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X is bought

Formula for lift is lift({X -> Y}) = support({X, Y}) / (support(X) * support(Y))

Output:

A list of rules in a form (from_itemset, to_itemset, confidence, lift), where from_itemset and to_itemset are subsets of items on sale.

Making a recommendation

Recommendation can be made, for example at a checkout, when user has item1 and item2 in the shopping cart. In that case from_itemset = {item1, item2} and using the list of rules we may find respectiveto_itemset with maximum confidence and/or lift.

Apriori algorithm.

1st step:

Let the candidates_1 set to contain all individual items.

Now let’s calculate support for each candidate in candidates_1.

Populate layer_1 set with those candidates, which support is >= min_support

2nd step:

Let the candidates_2 be populated with all pairs from layer_1

For example, if layer_1 = {item1, item3, item4}

then candidates_2 = [{item1, item3}, {item1, item4}, {item3, item4}]

Now let’s calculate support for each candidate in candidates_2 as in previous step by going through each order in order list. Populate layer_2 with those candidates from candidates_2, which support is >= min_support

Nth step:

Repeat step 2 until you can form candidates sets. Candidates for a next step are formed from a layer of a current step by unions of all possible pairs in layer set (elements of a layer should be sets to support this). This is also known as self-join. This is how support dictionary - filtered set of combinations of items and their respective support - is formed.

In each step populate a support dictionary using itemsets as keys and their correspondent support as values.

Association rules

Next step is to calculate all association rules. Let’s fix a hyper-parameters min_confidence and min_lift.

For all sets L in support dictionary create all non-empty subsets S of L, and evaluate confidence and lift for a rule {S -> L-S} and save rules, whose parameters are greater of equal than minimal.

In a notion {X -> Y}, X is called from-itemset, and Y - to-itemset.

Apriori-TID

  • the order list is not used at all for counting the support of candidate itemsets after the first pass.
  • the candidate itemsets are generated the same way as in Apriori algorithm
  • rules are generated the same way after support dictionary is populated. The only different is that in each layer > 1 a separate list C' is created. C' hold transaction IDs with corresponding itemsets from candidates on that layer that are in that particular transaction.

For example, using order list from above first layer would yield such support dictionary:

item1 0.5
item2 0.75
item3 0.75 
item5 0.75

Second layer candidates set is:

{item1, item2} 0.25
{item1, item3} 0.5 *
{item1, item5} 0.25
{item2, item3} 0.5 *
{item2, item5} 0.75 *
{item3, item5} 0.5 *

So layer_2 will consist from itemsets with *. And at this step a separate data structure is created to hold transaction IDs for these sets that made it to layer_2

1 {item1, item3}
2 {item2, item3} {item2, item5} {item3, item5}
3 {item1, item2} {item1, item3} {item1, item5} {item2, item3} {item2, item5} {item3, item5}
4 {item2, item5}

Using this additional list we will calculate support for candidates for layer_3.

Eclat

Another way to speedup support calculation is to keep a list of transaction IDs for every itemset at each layer. It speeds up support calculation even more, but these transaction lists can be large in size.

Again using the same order list in first step, transaction ID list will be

item1 1, 3
item2 2, 3, 4
item3 1, 2, 3
item4 1
item5 2, 3, 4

item4 won’t be included in the next layer, and this list can help us to calculate lists for all possible pairs for the next step:

{item1 item2} 3
{item1 item3} 1, 3
{item1 item4} 1
{item1 item5} 3
{item2 item3} 2, 3
...etc

FP-Growth

This one is different from Apriori-like algorithms in a way that candidate itemsets are not generated explicitly. FP-growth uses a suffix tree (FP-tree) structure to encode transactions thus speeds up support calculation.

Tests

Dataset was taken from http://archive.ics.uci.edu/ml/datasets/Online+Retail Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Preprocessing

As each row contains InvoiceNo and StockCode, let’s group items by InvoiceNo

import csv
from collections import defaultdict

order_list = defaultdict(list)

with open('data.csv', encoding="iso-8859-1") as f:
    csv_reader = csv.DictReader(f)
    for row in csv_reader:
        order_list[row['InvoiceNo']].append(row['StockCode'])

Some stats of the dataset:

  • number of orders: 25900
  • 4070 items on sale
  • minimum order: 1 item
  • maximum order: 1114 items
  • mean order size: ~21 items

Python implementation

I’ve implemented all 4 aforementioned algorithms here. FP-Growth needs some more testing though, so I’ve used Apache Spark

Generated rules

All algorithms showed same 3 rules, with 2nd and 3rd being the same pair.

Not very impressive findings, so it worth trying to experiment with different input parameters. But based on them we can recommend someone buying “JUMBO BAG PINK POLKADOT” to get a “JUMBO BAG RED RETROSPOT” as well. Same goes for teacups.

© Alexey Smirnov 2023