Clustering & Retrieval 

K - nearest neighbors: 

The K- Nearest neighbors algorithm allows us to identify a group of  "k" elements that or of similar features as the "query" element. For example, i identifying 10 books that are of similar type as "Harry Potter" or "Sherlock Holmes". 

The concept is basically identifying the "distance" between the elements and identify the closet "K". 

 Well how does this work in real world when there is so much noise? For example, i want know the top 10 books on Football/Soccer in an online Library for this topic. In each book we have thousands of words. Identifying the keywords in the "query" book and counting those words is the first step. The second step is the weight of the "feature". This is part of feature selection/engineering concept. Lot of experimentation is required and domain knowledge is required. 

Now how to eliminate words: TF * IDF (Term Frequency * Inverse Document Frequency)

TF = Word count of each unique word 
IDF = LOG ( Total Documents/ (1+count of documents using the specific word) )

the IDF will eliminate the words like "a" "the" "of" etc.. which are very common in English. 

This concept is not limited to books but any clustering and retrieval exercises in which features that can used to differentiate the good vs better vs best. 

For example, on Amazon you wanted to buy running shoes for men for trail in the price range $45-$89. So now you have identified the specific features that make sense to you. But if you did not make a specific selection but just clicked on a listing randomly, the TF-IDF concept will help you retrieve items based on the product features as defined by the manufacturer. Isn't that cool :) 

Here we re-introduce a concept of Euclidean distance that we learned in high school. The distance between two products P and Q with features (p1, p2, p3... pn) and (q1, q2 , q3 .... qn) is 
SQ RT (sq(p1-q1) + sq (p2-q2) ..... sq (pn-qn)  ) 

Well this is not good enough as we have not prioritized our features or assigned weights. Hence assigning weights to each feature needs to be done. A subject matter expert would be needed. 

Well all looks good so far. How can we know the similarity between two Products on amazon:
Here the steps:
1) Identify your features 
2) Calculated TF and IDF for each product 
3) generate TF * IDF matrix for each 
4) How for each products calculate  Q*A of TF*IDF 

   

F1
F2
F3
F4
F5
Q
1
5
0
2
0
A
2
3
4
1
0
B
0
1
2
3
2
C
2
3
2
0
1
 
So for Product 
A = 2 +15 +0 +2 + 0 = 19
B = 0+5+0+2+0 = 7
C = 2+15+0+0+0 = 17

Hence, product A is closer to Q than anything else. 



Comments

Popular posts from this blog