COSC 419: Learning Analytics

A2: Document Similarity [19 pts]

Due date: Jan 31, 2020, 2:00pm (extended to Sunday Feb 02, 2020, midnight)

What to submit:

Submit the following on Canvas:


This exercise gets you to build a document profile using the content based method described in class.

Download a set of 28 text files from alldocs.zip. For each file, preprocess the text by (i) converting everything to lowercase, (ii) removing all the stop-words (a list of 125 here) and punctuations and basically non-word characters, and (iii) applying the Porter stemming algorithm (there are various implementations depending on the language of your choice). As an example, after you do the preprocessing above for the file t11.txt, you should get the output here. Note that I only applied stemming once to produce this output, so if you applied in multiple times, you should have a shorter version. Also note that some terms were causing issues for my IDF calculation later (I'll mention later) so I also ended up removing all words that were fewer than 3 characters long.

Apply the TFIDF algorithm to create a document vector for each text file. Here, note that while TF for a given term may be 0, IDF cannot be 0. This is because we create a document vector based on all unique words from the entire document set. So the document frequency must be at least 1. If you get something with IDF as 0, then something is wrong in your code and you may want to double check. Also, note that log( 1 ) = 0, so whenever you have terms that occur in all 28 docs, then those terms will also need to be removed. In my case, this only happened for a few words with one or two letters long. So I removed that in my preprocessing phase. My document vectors were over 3000 terms long.

Next, for each pair of documents, compute their cosine similarity and store the results in a NxN matrix (where N = total number of documents, in our case, N = 28). My cosine similarity matrix looks like this. You are likely not to get something identical.

Lastly, apply agglomerative hierarchical clustering on the set of documents using MIN (i.e., single linkage) as your measure of similarity between two clusters. If you are using a package to call this algorithm, then return the dendrogram in your result. Corresponding to that result, identify the set of documents (by file name) that are in each of your clusters for k=2.

On the other hand, if you implemented the above clustering algorithm on your own, then display the items in each of your cluster (by file name) for k=2. My merged results gave me the following merges in this order. Again, your solution might not be exactly the same as mine. Just make sure it matches your cosine similarity matrix.

Since I implemented my algorithm manually, I created a low-tech version of the dendrogram for your reference (y-axis omitted and not to scale). Here, I'm showing all the merges (until k=1). Note that in my solution, when k=2, one cluster k2 while the other cluster has all the remaining documents, which isn't very interesting. What is interesting is that you sort of see two main groups of docs (see the two highlighted clusters in the dendrogram). The ones in the left highlighted cluster has mostly t*.txt while the ones in the right highlighted cluster has mostly k*.txt (with t14.txt mixed in). After that, the dendrogram shows that (k3,k12) gets merged, then individually merging t9, t11, t4, and lastly, k2. What can we make of this? This dendrogram tells us that most of the content in the t*.txt ("teen" horror stories) are more similar to each other and most of the content in the k*.txt ("kids" fables) are more similar to each other, with a few exceptions. You can read the story in t14.txt, the one that got mixed into the kids cluster, and you'll find that it's not very scary or gory. You can also read the story in k2.txt, which is actually Jack and the Beanstalk, and it has words like: terrified, cried, fearsome, blood, bones, more blood, and died. On a personal note, my son when he was around 3 read this story and was really scared of giants for a really long time after.

Grading Criteria