Simplifying Concepts

Essence of Life is in Simplicity

Association Rule Mining

Association Rule Mining

Apriori Algorithm

Hi guys welcome back, in this post I would be covering about Apriori algorithm’ s association rules, its disadvantages and more specifically its limitations. First of all I would deal with pseudo code of Apriori algorithm.

Pseudo Code: (Ignore this if you get confused the steps in previous post are sufficient)
Ck: Candidate itemset of size k.
Lk: Frequent itemset of size k.
L1= {contains frequent items}
For (k=1; Lk! = null; k++)

    1. Ck+1 = candidates generated from Lk (through join operation);
    2. For each transaction t(say T100, T200, ..) in database increment the count of all candidates in Ck+1 that are contained in t;
    3. Lk+1= candidates in Ck+1 with min_support;

End
Return Union (Lk);

What is Association rule?
Representation: computer ➞ Antivirus software
Association depicts whether purchase of computer is associated with antivirus software. That is, Is Antivirus software bought frequently provided customer buys computer? This is what association is.

Association rule thus describes whether a particular item on Right hand side of the arrow is bought frequently with item on left hand side. Association uses the concept of confidence.Confidence is the minimum accepted percentage that depicts whether particular item is frequently purchased with other item. So if an association has confidence >= min_confidence specified then it is said to be strong association. (Again don’t worry this would be clearer in the below example)

Confidence = support_count(l)/support_count(s)

So from the previous post “Click here” we got Lk = {(I1, I2, I3), (I1, I2, I5)}

Generating Association Rules:

For each frequent itemset L (L refers to individual elements of Lk i.e. (I1, I2, I5) or (I1, I2, I5)) generate all non empty subsets of L.
s’ refers to item on left hand side and is usually subset of L.

General Association rule is depicted by S ➞ (L-S)

Considering L= (I1, I2, I5)
Subsets possible = ({I1}, {I2}, {I5}, {I1, I2}, {I2, I5}, {I1, I5})

When s= {I1} association rule is depicted by {I1} ➞ {I2^I5} that is from S ➞ (L-S)
Confidence=support_count(L)/support_count(s) = 2/6 =33.33% < 50% Support_count(I1,I2,I5)=2 Support_count(I1)=6 from previous post-Apriori Algorithm example.
So the above association {I1} ➞ {I2^I5} is not strong association.

Similarly calculate confidence for {I2} i.e. s= {I2}
{I2} ➞ {I1^I5} confidence=support_count(I1, I2, I5)/support_count(I2)= 2/7 =28.5% < 50% So again this is not a strong association {I5} ➞ {I1 ^ I2} confidence=2/2 =100% so greater than 50%
So above association is a strong association.

{I1 ^ I2} ➞ {I5} confidence=support_count(I1, I2, I5)/support_count(I1,I2)=2/4 =50%
{I1 ^ I5} ➞ {I2} confidence=2/2 =100% >50% hence strong association.
{I2 ^ I5} ➞ {I1} confidence=2/2=100% >50% hence strong association.

Now follow the same procedure to calculate association rule mining for L= {I1, I2, I3} in the similar way.

Limitations of Apriori Algorithm
1. It involves generation of candidate set of size k. It is not feasible to generate huge candidate set of k for large number of items and that too for large number of transactions/ records.
2. Further the database needs to be repeatedly scanned and checked for a large set of candidates by pattern matching which is again a performance issue in case of large number of transactions.

This has led to creation and usage of new algorithm for frequent item mining known as Frequent pattern growth (FP growth method)


Note:That’s it, for any doubts and queries you can post in the comment section below. Keep visiting and sharing this page because “Sharing is caring”.

Apriori Algorithm

Apriori Algorithm

  • It is a scalable mining method that is it can be applied on large number of transactions.
  • It is mining algorithm designed to extract frequent itemsets.
  • The items that occur frequently in the transactions are knows as frequent items.
  • Apriori algorithm uses the prior knowledge of frequent itemset properties, hence apriori.

Certain terminologies

  • Ck: Candidate itemset of size k
  • Lk: Frequent itemset of size k
  • Transaction-Id: basically is reference to a transaction, like say receipt number of customer that would be stored in a database on buying certain items from shopping mart or mall.
  • Items: These are the commodities that are purchased like say bread, milk. In below example I1, I2… represent items.
  • Support:It states the minimum number of count of particular item or itemset that is permitted.

I would be soon explaining what Ck and Lk are once I give out the steps of the algorithm.For proper understanding we would directly dive into an example.In this case we are considering support to be 2(Support=2)

Table-1

Transaction-Id Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3

(Don’t worry if you are not able to decipher these steps )

General Apriori Algorithm Steps

  • Step1: Find frequent itemset Lk+1.
  • Step2: This step is called join step. In this Ck is generated by performing join operation. A kind of cross multiply operation you may have seen in SQl. That is (Lk+1* Lk+1). No worries if you have not understood this would be clear once you start seeing the example.
  • Step3: Pruning step, in this we remove certain items based on their count.
  • Step4: A frequent itemset Lk has been achieved.

Now, moving back to our example, in this T100 represents transaction number.Initially the algorithm will scan the database and prepare a candidate set-1 (C1) table where k=1. This table basically contains the item count for each item. From Table-1 it is clear that there are five items namely I1, I2, I3, I4 and I5.

C1

Item Count
I1 6
I2 7
I3 6
I4 2
I5 2

In C1 basically we have count of the items that are present in all of the transactions. From this C1, L1 is computed or generated.

L1

Item Count
I1 6
I2 7
I3 6
I4 2
I5 2

In case if the C1 (candidate itemset1) had any item less than support count(less than 2) that would have been removed or pruned. Say if I5 were 1 it would be not there in the L1 table.

Now moving ahead C2 is generated from L1. C2 is obtained by performing join operation on L1. So the candidate set formed is candidate set of order which contains two items. In this a pair can appear only once for instance I1, I2 is considered same as I2, I1.

C2

Item Count
I1,I2 4
I1,I3 4
I1,I4 1
I1,I5 2
I2,I3 4
I2,I4 2
I2,I5 2
I3,I4 0
I3,I5 1
I4,I5 0

Now pruning is performed to obtain L2 from C2. Thus L2 contains frequent items with count >= 2.

L2

Item Count
I1,I2 4
I1,I3 4
I1,I5 2
I2,I3 4
I2,I4 2
I2,I5 2

The count is obtained from Table-1 that is how many times (I1, I2) have occurred together. So it is clear that {I4, I5}, {I1, I4}, {I3, I4}, {I3, I5} have count less than 2 so they are not placed in L2 table above.Now generate C3 that is candidate set of size 3. That is each itemset can now contain 3 items.

C3

Itemset Count
I1,I2,I3 2
I1,I2,I5 2
I1,I2,I4 1
I1,I3,I5 0
I2,I3,I4 1
I2,I4,I5 0
I1,I3,I4 0

C3 is built by join operation on L2 that is (L2*L2). So in this itemset of 3 is formed by joining (I1, I2) and (I1, I3) so itemset created would contain (I1, I2,I3). In other words in case if there is one common item we remove and if say we had (I1, I2) and (I3, I4) then it would not be put into C3 that is it would have been ignored. Similarly (I1, I2) and (I1, I5) are joined to form (I1, I2, I5) and its count is noted from Table-1.(I1, I2) and (I1, I3) are joined to form (I1, I2, I3) so its count is 2. In the similar fashion whole C3 is built.

Now L3 is built by pruning:
L3

Itemset Count
I1,I2,I3 2
I1,I2,I5 2

Proceeding further C4 is built from L3 through join operation. That is L3*L3. Since it is C4 it should contain four items in itemset.

C4

Itemset Count
I1,I2,I3,I5 1

Now L4 will have no records or itemset because count of (I1, I2, I3, I5) is 1. L4 is said to be null.The frequent itemset by definition is Lk-1 when Lk becomes null.So L4 is null the frequent itemset by definition is L3

So in other words itemset {I1, I2, I3} and (I1, I2, I5) are the most frequently brought items.This was the glimpse of Apriori algorithm. In the next post I would be giving insights on Pseudo code of Apriori algorithm and Generation of Association rules.”Click here”


Note: Hey guys please do share this knowledge because “Knowledge Grows on Sharing” but do not copy & paste to other forums or websites. If you like this do give a facebook share and fb like. Stay tuned !!!

HITS Algorithm

HITS Algorithm

  • HITS Algorithm stands for Hyperlink Induced Topic Search Algorithm.
  • HITS Algorithm classifies relevant web pages as authorities and hubs given a certain search query.
  • It is query independent.
  • It has been implemented by ASK.COM search engine.

Reason and concept behind the development of HITS Algorithm was:

  • Prior to HITS Algorithm there was a text based ranking system.
  • So, for a given query (search), keyword matches were done.
  • And the document with most occurrences of keyword appeared in the result.
  • This was ridiculous and often returned irrelevant data.
  • Also it lacked synonymic capacity.
  • For instance, User may refer “automobile” for a car or vehicle.
  • Finally, it would also return WebPages with just word “automobile” billion times as the first result. So we can imagine how ridiculous it was?

So in order to tackle these anomalies HITS Algorithm was introduced. It used the link structure of web in order to discover and rank relevant pages. To understand this lets take an example:

  • Consider an example that user wants to search for top automobile manufacturers in last 2 years.
  • The user may be expecting the list of top car brands as the result of search.
  • However from the perspective of user an automobile is car, but for computer automobile is just an automobile.
  • There needs to be mapping required semantically (automobile = car).
  • However again, it is useless since searching remains stills text based and further car manufacturer may not use automobile in their description.
  • So there needs to be different mechanism for ranking.
  • So the concept of Hub and Authority was introduced which forms the basis of HITS.

Concepts

  • Page ‘i’ is called an authority for the query if it contains valuable information on subject.
  • Official websites of carmakers can be considered AUTHORATIVE.
  • These are the ones that are truly relevant to the query.
  • There is SECOND category of pages relevant in process of finding the AUTHORATIVE pages, these are called HUBS.
  • HUB’s role is to advertise the AUTHORATIVE pages.
  • They point the search engine in right direction.
  • HUBS even can be blogs describing the automobiles.

HITS Algorithm identifies good Authorities and Hubs by assigning two weights to pages namely:

    a.Authority weight/ranking:

    • These are pages with many in-links (many links pointing to particular URL).
    • Also they are the pages pointed by pages with HIGH HUB WEIGHT.

    b.Hub weight/ranking:

    • These are the pages with many out-links. (Pointing or referrers to other sites).
    • These pages serve as organiser of information/topic.
    • Also pages that point to pages with HIGH AUTHORITY have HIGH HUB WEIGHT.

AUTHORATIVE and HUBS have mutual reinforcement relationship.

That’s it for this post more information regarding working of the HITS Algorithm and its Advantages & Disadvantages would be covered in my next post. “Click Here”.


Note: Hey guys please do share this knowledge because “Knowledge Grows on Sharing” but do not copy & paste to other forums or websites. If you like this do give a facebook share and fb like. Stay tuned !!!

error: Content is protected !!