Simplifying Concepts

Essence of Life is in Simplicity

Apriori Algorithm

Apriori Algorithm

  • It is a scalable mining method that is it can be applied on large number of transactions.
  • It is mining algorithm designed to extract frequent itemsets.
  • The items that occur frequently in the transactions are knows as frequent items.
  • Apriori algorithm uses the prior knowledge of frequent itemset properties, hence apriori.

Certain terminologies

  • Ck: Candidate itemset of size k
  • Lk: Frequent itemset of size k
  • Transaction-Id: basically is reference to a transaction, like say receipt number of customer that would be stored in a database on buying certain items from shopping mart or mall.
  • Items: These are the commodities that are purchased like say bread, milk. In below example I1, I2… represent items.
  • Support:It states the minimum number of count of particular item or itemset that is permitted.

I would be soon explaining what Ck and Lk are once I give out the steps of the algorithm.For proper understanding we would directly dive into an example.In this case we are considering support to be 2(Support=2)

Table-1

Transaction-Id Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3

(Don’t worry if you are not able to decipher these steps )

General Apriori Algorithm Steps

  • Step1: Find frequent itemset Lk+1.
  • Step2: This step is called join step. In this Ck is generated by performing join operation. A kind of cross multiply operation you may have seen in SQl. That is (Lk+1* Lk+1). No worries if you have not understood this would be clear once you start seeing the example.
  • Step3: Pruning step, in this we remove certain items based on their count.
  • Step4: A frequent itemset Lk has been achieved.

Now, moving back to our example, in this T100 represents transaction number.Initially the algorithm will scan the database and prepare a candidate set-1 (C1) table where k=1. This table basically contains the item count for each item. From Table-1 it is clear that there are five items namely I1, I2, I3, I4 and I5.

C1

Item Count
I1 6
I2 7
I3 6
I4 2
I5 2

In C1 basically we have count of the items that are present in all of the transactions. From this C1, L1 is computed or generated.

L1

Item Count
I1 6
I2 7
I3 6
I4 2
I5 2

In case if the C1 (candidate itemset1) had any item less than support count(less than 2) that would have been removed or pruned. Say if I5 were 1 it would be not there in the L1 table.

Now moving ahead C2 is generated from L1. C2 is obtained by performing join operation on L1. So the candidate set formed is candidate set of order which contains two items. In this a pair can appear only once for instance I1, I2 is considered same as I2, I1.

C2

Item Count
I1,I2 4
I1,I3 4
I1,I4 1
I1,I5 2
I2,I3 4
I2,I4 2
I2,I5 2
I3,I4 0
I3,I5 1
I4,I5 0

Now pruning is performed to obtain L2 from C2. Thus L2 contains frequent items with count >= 2.

L2

Item Count
I1,I2 4
I1,I3 4
I1,I5 2
I2,I3 4
I2,I4 2
I2,I5 2

The count is obtained from Table-1 that is how many times (I1, I2) have occurred together. So it is clear that {I4, I5}, {I1, I4}, {I3, I4}, {I3, I5} have count less than 2 so they are not placed in L2 table above.Now generate C3 that is candidate set of size 3. That is each itemset can now contain 3 items.

C3

Itemset Count
I1,I2,I3 2
I1,I2,I5 2
I1,I2,I4 1
I1,I3,I5 0
I2,I3,I4 1
I2,I4,I5 0
I1,I3,I4 0

C3 is built by join operation on L2 that is (L2*L2). So in this itemset of 3 is formed by joining (I1, I2) and (I1, I3) so itemset created would contain (I1, I2,I3). In other words in case if there is one common item we remove and if say we had (I1, I2) and (I3, I4) then it would not be put into C3 that is it would have been ignored. Similarly (I1, I2) and (I1, I5) are joined to form (I1, I2, I5) and its count is noted from Table-1.(I1, I2) and (I1, I3) are joined to form (I1, I2, I3) so its count is 2. In the similar fashion whole C3 is built.

Now L3 is built by pruning:
L3

Itemset Count
I1,I2,I3 2
I1,I2,I5 2

Proceeding further C4 is built from L3 through join operation. That is L3*L3. Since it is C4 it should contain four items in itemset.

C4

Itemset Count
I1,I2,I3,I5 1

Now L4 will have no records or itemset because count of (I1, I2, I3, I5) is 1. L4 is said to be null.The frequent itemset by definition is Lk-1 when Lk becomes null.So L4 is null the frequent itemset by definition is L3

So in other words itemset {I1, I2, I3} and (I1, I2, I5) are the most frequently brought items.This was the glimpse of Apriori algorithm. In the next post I would be giving insights on Pseudo code of Apriori algorithm and Generation of Association rules.”Click here”


Note: Hey guys please do share this knowledge because “Knowledge Grows on Sharing” but do not copy & paste to other forums or websites. If you like this do give a facebook share and fb like. Stay tuned !!!

Post a comment

error: Content is protected !!