WeChat:lovexc60
1 Project Description
The aim of this project is to implement a machine learning model based on Naive Bayes for a
3 Data Description
The dataset is a corpus of movie reviews originally collected by Pang and Lee. This dataset
• Each sentence has a SentenceId.
The training, dev and test set contain respectively 7529, 1000 and 3310 sentences. The sentences
In the following table you can find several sentences and their sentiment score.
4 Evaluation
Systems are evaluated according to macro-F1 score, i.e. the mean of the class-wise F1-scores:
where N is the number of classes. F1–score is calculated for each class i:
5 Project Roadmap
1. Implement some preprocessing steps:
• You are free to add any preprocessing step (e.g. lowercasing) before training your
• Implement a function to map the 5-value sentiment scale to a 3-value sentiment scale.
Namely, the labels “negative” (value 0) and “somewhat negative” (value 1) are merged
1). And finally, “somewhat positive” (value 3) and “positive” (value 4) will be mapped
sentiment analysis task using the Rotten Tomatoes movie review dataset. Obstacles like sen
tence negation, sarcasm, terseness, language ambiguity, and many others make this task very
challenging.
contains tab-separated files with phrases from the Rotten Tomatoes dataset. The data are split
into train/dev/test sets and the sentences are shuffled from their original order.
• They all have been tokenized already.
are labelled on a scale of five values:
0. negative
1. somewhat negative
2. neutral
3. somewhat positive
4. positive
models. Explain what you did in your report.
into label “negative” (value 0). “Neutral” (value 2) will be mapped to “neutral” (value
to the label “positive” (value 2).