Assignment 5: Clustering by Language

In this assignment you are given some text written in English or non-English (Lorem Ipsum). You are to write some code to automatically cluster this text into the appropriate category. To do so, you will implement k-means clustering.

The file text.txt contains 40 lines of text written in either English or Lorem Ipsum. Your PHP code should: (1) read this text into a data structure; (2) extract from all of the text a list of unique words; (3) extract from each line of text the number of occurrences of each word found in step 2; (4) perform k-means clustering using this word count as a feature vector.

To help get you started, we provide code for steps 1-3 in kmeans.txt. I’ve named this file kmeans.txt so that your browser doesn’t try to execute the PHP code that it contains – you should copy the contents of this file into a file called kmeans.php. This code generates a data structure that you will need to implement k-means clustering. The data structure features is a 2-D array of size 40 x 907 corresponding to, respectively, the number of data points to be clustered and the number of unique words. This means that $features[$i] is a 1-D array containing the features (word count) of the i-th line of text.

In the code that we provide you should only have to add code below the comment your code here...

We have also provided two additional functions that you may find useful when implementing k-means clustering: (1) addArrays takes as input two arrays of the same length and returns a new array that is the point-wise sum of these two arrays; and (2) computeDistance takes as input two vectors of any length and returns the Euclidean distance between them. You can assume that there will be only two clusters (you are welcome to write your code in such a way that it can partition more than two clusters and then differentiate between multiple languages – this will require you to add more data, in a third language, to the file text.txt).

In my solution, I created an array called membership of length 40. The i-th entry of this array corresponds to the cluster (a value of 0 or 1) to which the i-th line of text belongs. On each iteration of the k-means clustering, you should print the value of membership so that the output of your code looks something like this:

0110000000100100000001000000100000000000

1111100001101100001101111000110001000000

1111100001101100001111111000110001100000

To keep things simple, you can hardwire your k-means to make five iterations or if you are feeling ambitious, test the previous vs. current cluster assignment to determine if you should stop iterating.

Your output may look a little different depending on the initial randomly selected clusters but if your clustering is working properly it should (usually) return the same answer as shown on the last line (because the cluster assignment of 1 or 0 is arbitrary your solution may be the same except that the values of 1 and 0 are reversed).