This repository contains all the information for the first assignment of the Data Mining 2018 Fall course. All instructions are posted below.
- Student Name: Henry
- File Name: Henry_103032027.ipynb
-
First, you should attempt the take home exercises provided in the notebook we used for the first lab session. Attempt all the exercises, as it is counts towards the final grade of your first assignment (20%).
-
Then, download the dataset provided in this link. The sentiment dataset contains a
sentence
andscore
label. Read the specificiations of the dataset before you start exploring it. -
Then, you are asked to apply each of the data exploration and data operation steps learned in the first lab session on the new dataset. You don't need to explain all the procedures as we did in the notebook, but you are expected to provide some minimal comments explaining your code. You are also expected to use the same libraries used in the first lab session. You are allowed to use and modify the
helper
functions we provided in the first lab session or create your own. Also, be aware that the helper functions may need modification as you are dealing with a completely different dataset. This part is worth 30% of your grade! -
In addition to applying the same operations from the first lab, we are asking that you attempt the following tasks on the new sentiment dataset as well (40%):
- Use your creativity and imagination to generate new data visualizations. Refer to online resources and the Data Mining textbook for inspiration and ideas.
- Generate TF-IDF features from the tokens of each text. Refer to this Sciki-learn guide on how you may go about doing this. Keep in mind that you are generating a matrix similar to the term-document matrix we implemented in our first lab session. However, the weights will be computed differently and should represent the TF-IDF value of each word per document as opposed to the word frequency.
- Using both the TF-IDF and word frequency features, try to compute the similarity between random sentences and report results. Read the "distance simiilarity" section of the Data Mining textbook on what measures you can use here. Cosine similarity is one of these methods but there are others. Try to explore a few of them in this exercise and report the differences in result.
- Lastly, implement a simple Naive Bayes classifier that automatically classifies the records into their categories. Try to implement this using scikit-learn built in classifiers and use both the TF-IDF features and word frequency features to build two seperate classifiers. Refer to this nice article on how to build this type of classifier using scikit-learn. Report the classification accuracy of both your models. If you are struggling with this step please reach us on Slack as soon as possible.
-
Presentation matters! You are also expected to tidy up your notebook and attempt new data operations and techniques that you have learned so far in the Data Mining course. Surprise us! This segment is worth 10% of your grade. The idea of this exercise is to begin thinking of how you will program the concepts you have learned and the process that is involved.
-
After completing all the above tasks, you are free to remove this header block and submit your assignment following the guide provided in the README.md file of the assignment's repository.