GithubHelp home page GithubHelp logo

pokemondataanalysis's Introduction

Hadoop - Pokémon Data Analysis

Motivation and background

This project consists of analyzing different Pokémons’ data for different purposes. We focused on doing top ten lists and averages of certain values. We chose this project because it seemed fun and creative. We grew up watching Pokémon on TV when we were kids and it was easy for us to make this project, because we are familiar with this subject. The data set is represented by a csv file containing informations regarding Pokémons. The information is organised in 10 columns, as follows: Number, Name, Type 1, Type 2, Total, Hit Points, Attack, Defense, Special Attack, Special Defense and Speed, each of them containing different values represented by strings or numbers. For the processing part, the data set has to fulfill some well established structures. As the csv file is structured in 10 columns, there were a lot of empty cells on the Type 2 column which had to be filled with predefined notations in order to achieve the deserved form. The purpose of this project consists in 5 tasks:

Find out the top 10 Pokémons based on their total power; Find out the top 10 fastest Pokémons;
Find out the average HP (Hit Points) of all the Pokémon; Find out the top 10 Pokémon’s based on their special attack; Find out the average speed of all the Pokémon.

Computational experiments

Before we began this project, we looked upon the WordCount code to see how a basic Hadoop project works. After that, we modified the code based on our needs. Firstly, in the Mapper class we have to extract the informations we need. In order to do this, we are splitting the input data using a comma as a delimiter and then we have to select which column will be used in the processing part. According to the Text input format, we receive one line for each iteration of the Mapper. This line is split and then we collect the columns which are going to be used later in the Reducer class. For the “Top 10 Pokémons” tasks, firstly, we created a custom class named TopPokemonWritable, a Writable object that stores two values, to help us define an useful data type and to override the compareTo method. Reducer part involves using a TreeSet having the size equal to 10. We overridden the reducer method so we can add in the TreeSet up to ten values who meet the required properties. If we have more that 10 records, we remove the one with the least value for the current property as the TreeSet is sorted in descending order. This order was achieved by the override we made to the compareTo method in the TopPokemonWritable class. Without this override, the TreeSet would have been sorted in ascending order by default. This way we have obtained the top 10 records from the data set. The “Average” tasks were a little more trickier. To calculate an average, we need two values for each group: the sum of the values that we want to average and the number of values that went into the sum. These two values can be calculated in the Reducer class. The Mapper will output two columns of data, which will be processed in reduce method. For every value in valueList (i.e. see line 42 in the AverageSpeed class), we perform an addition to a predefined sum variable. For every map that comes in reducer, the cleanup method will calculate the average, so at the end we will have the main average that we need.

pokemondataanalysis's People

Contributors

margaritcatalin avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.