GithubHelp home page GithubHelp logo

phlinhng / dm_2018_hw_1 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from omarsar/dm_2018_hw_1

0.0 0.0 0.0 9.27 MB

Data Mining (NTHU 2018 Fall) Assignment 1

Jupyter Notebook 99.91% Python 0.09%

dm_2018_hw_1's Introduction

Data Mining (2018 Fall) - Assignment 1

This repository contains all the information for the first assignment of the Data Mining 2018 Fall course. All instructions are posted below.

Informtaion

Assignmet Instructions

  • First, you should attempt the take home exercises provided in the notebook we used for the first lab session. Attempt all the exercises, as it is counts towards the final grade of your first assignment (20%).

  • Then, download the dataset provided in this link. The sentiment dataset contains a sentence and score label. Read the specificiations of the dataset before you start exploring it.

  • Then, you are asked to apply each of the data exploration and data operation steps learned in the first lab session on the new dataset. You don't need to explain all the procedures as we did in the notebook, but you are expected to provide some minimal comments explaining your code. You are also expected to use the same libraries used in the first lab session. You are allowed to use and modify the helper functions we provided in the first lab session or create your own. Also, be aware that the helper functions may need modification as you are dealing with a completely different dataset. This part is worth 30% of your grade!

  • In addition to applying the same operations from the first lab, we are asking that you attempt the following tasks on the new sentiment dataset as well (40%):

    • Use your creativity and imagination to generate new data visualizations. Refer to online resources and the Data Mining textbook for inspiration and ideas.
    • Generate TF-IDF features from the tokens of each text. Refer to this Sciki-learn guide on how you may go about doing this. Keep in mind that you are generating a matrix similar to the term-document matrix we implemented in our first lab session. However, the weights will be computed differently and should represent the TF-IDF value of each word per document as opposed to the word frequency.
    • Using both the TF-IDF and word frequency features, try to compute the similarity between random sentences and report results. Read the "distance simiilarity" section of the Data Mining textbook on what measures you can use here. Cosine similarity is one of these methods but there are others. Try to explore a few of them in this exercise and report the differences in result.
    • Lastly, implement a simple Naive Bayes classifier that automatically classifies the records into their categories. Try to implement this using scikit-learn built in classifiers and use both the TF-IDF features and word frequency features to build two seperate classifiers. Refer to this nice article on how to build this type of classifier using scikit-learn. Report the classification accuracy of both your models. If you are struggling with this step please reach us on Slack as soon as possible.
  • Presentation matters! You are also expected to tidy up your notebook and attempt new data operations and techniques that you have learned so far in the Data Mining course. Surprise us! This segment is worth 10% of your grade. The idea of this exercise is to begin thinking of how you will program the concepts you have learned and the process that is involved.

  • After completing all the above tasks, you are free to remove this header block and submit your assignment following the guide provided in the README.md file of the assignment's repository.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.