METCS777 Large Scale Supervised Learning
Assignment 6
Large-Scale Supervised Learning
MET CS777
Description
In this assignment, you will be implementing a regularized, logistic regression to classify text documents.
Data
You will be dealing with a data set that consists of around 170,000 text documents (7.6 million lines of text), and a test/evaluation data set that consists of 18,700 text documents (almost exactly one million lines of text). All but around 6,000 of these text documents are Wikipedia pages; the remaining documents are descriptions of Australian court cases and rulings. At the high level, your task is to build a classifier that can automatically figure out whether a text document is an Australian court case or not.
We have prepared three data sets to use.
1. The Training Data Set (1.9 GB of text). This is the set to train the logistic regression model.
2. The Testing Data Set (200 MB of text). This is the set to evaluate the model.
3. The Small Data Set (37.5 MB of text). This is to use for training and testing of the model locally before trying to do anything in the cloud.
Some Data Details to Be Aware Of. You should download and look at the SmallTrainingData.txt file before you begin. You’ll see that the contents are sort of a pseudo-XML, where each text document begins with a <doc id = ... > tag and ends with </doc >.
Note that all of the Australia legal cases begin with something like <doc id = “AU1222” ...>. Doc id for an Australian legal case always starts with AU. You will be trying to figure
out if the document is an Australian legal case by looking only at the contents of the document and document Id.
Boston University Metropolitan College
Tasks
Task 1 (10 points): Data Preparation First, you need to write Spark code that builds a dictionary that includes the 20,000 most frequent words in the training corpus. This dictionary is essentially an RDD that has words as the keys, and the relative frequency position of the word as the value. For example, the value is zero for the most frequent word, and 19,999 for the least frequent word in the dictionary.
Next, you will convert each of the documents in the training set to a TF (“term frequency”) vector that has 20,000 entries. For example, in a particular document, the
entry in the 177th value in this vector is a double that captures the frequency of the 177th most common word in this document. Likewise, the first entry in the vector is a double that captures the frequency of the most common word in the corpus in this document.
Then create the TF-idf matrix based on top-20k words similar to the previous assignments.
To get credit for this task, give us the average TF value of the words “applicant”, “and”, “attack”, “protein”, and “court” for court documents and wikipedia documents (Average
on Documents). You need to have a printout in your code for these outputs.
This would be then 5 numbers for Wikipedia documents, and 5 numbers for the court cases. Print these values for the large training data set.
Report how long the task takes to run.
my wechat:_0206girl
Don't hesitate to contact me