GithubHelp home page GithubHelp logo

msk610 / votermla Goto Github PK

View Code? Open in Web Editor NEW
9.0 3.0 0.0 82 KB

Machine Learning Algorithms used to predict which candidate a county voted for based on 9 distinct attributes such as population size and density

License: MIT License

C++ 100.00%
machine-learning-algorithms supervised-learning unsupervised-learning election

votermla's Introduction

VoterMLA

Machine Learning Algorithms to determine whether a county in U.S would vote for a Republican or Democrat candidate for president using data such as population size and density and candidate voted for in the 2016 election

##Table of Contents

##Dataset The dataset that wase used for training and [testing] (https://github.com/msk610/VoterMLA/blob/master/votes-test.csv) the algorithms were taken from Kaggle. Each dataset has 10 attributes. The first column shows Democrat and it is 1 or 0, if it is 1 it means that the county voted for the Democrat candidate (Clinton) and if it is 0 then the county voted for Republican candidate (Trump) during 2016 election. The dataset is somewhat biased as most of the counties that were used to train and test the data were counties that voted for the Republican candidate. In order to train and test a specific algorithm simply remove uncomment the specific algorithm that you wish to use.

##Supervised-Learning For this project three supervised algorithms were implemented, K-Nearest-Neighbor(KNN), Perceptron, and ID3 (Decision Tree). Out of all of those algorithms KNN performed the best as it achieved an accuracy of 100% for the testing set after normalizing the data. The normalization fo the data was implemented by discovering the MAX, MIN, and AVG values of the dataset using excel. Of course a simple a script can be utilized to do that as well. The formula for normalizing the data is:

normalized_attr_val = (data[attr]-MIN[attr])/(MAX[attr]-MIN[attr])

Normalization is necessary for the perceptron algorithm and it is already implemented in the County data structure, for example for an instance of the County class aCounty you would simply write: aCounty.normalize(). Following shows the accuracy statistics attained from the three algorithms with and without the normalization of the data:

Algorithm Not Normalized Normalized
KNN 87.25% 100%
Perceptron N/A 89.96%
ID3 86.23% 83.78%

###KNN For KNN algorithm a simple eucledian distance formula was used for the attribute values to determine the set of nearby neighbors. For this algorithms various sizes of neighbor set were utilized. The best neighbor set size according to statistics was determined to be 7. Based on which candidate majority of candidate voted, the algorithm made prediction. This algorithm had the best performance with respect to accuracy but to make a prediction for one instance it would iterate through the entire dataset, so for each instance in testing set it would iterate through the entire training set.

###Perceptron Unlike KNN, the Perceptron algorithm went through the data set only based on the number of times it was trained. For my implementaion, the Perceptron was trained about 5 times. For this algorithm the dataset whether training or testing needs to normalized because it uses this list of weights for each attribute. I initialized a list of weights with the value .1 and set a learning rate of 1, and every iteration of the training set it would attempt to predict 0 or 1 as in which candidate using the weight values and if it guess incorrectly then the weights would be modified based on the learning rate and the error. Every time I trained using Perceptron, I would divide the learning rate by 10 since the next time it goes through the set you would want a smaller effect on the weights. I trained it only 5 times because that returned the best accuracy for the training set.

###ID3 ID3 is a decision tree that makes smaller subsets of the bigger dataset. It essentially splits the data into smaller subsets using the attribute value. For example, if you set split value to 10,000 for the population size attribute, then every county that has a larger population than that would be in right node otherwise, it would be in left node. For this algorithm, the nodes are rather called bins. This algorithm would split until there is no more attribute to split the data on. The split would happen based on how much information can be gain based on splitting a dataset, which would be determined by calculating the entropy of the attribute and finding the net gain of information. The initial best split that my ID3 implementation found was percentage of population with Bachelor's degree.

##Unsupervised-Learning For usupervised learning, two clustering algorithms were implemented. Those two were, K-Means and Agglomerative Nesting or Agnes. For K-Means, silhouette coefficient was determined for each cluster to determine whether it generated meaningful clusters. The closer the silhouette coefficient value is to 1 the more meaningful the cluster is. As expected, normalizing the data did not have any significant effect on the coefficient value. Following table shows the silhouette coefficient values for the three clusters generated by K-Means:

K-Means Cluster Not Normalized Normalized
Cluster 1 0.993047 0.99826
Cluster 2 0.986083 0.99928
Cluster 3 0.998157 0.99779

###K-Means K-Means is one of the simpler clustering algorithms. It begins with three random centroids among the dataset and making clusters based on how close an instance is from the centroids and then reinitializing centroids based on the clusters that were setup. This process is repeated until the centroids are not modified. To determine centroids, I calculated the mean for all the attributes and using eucledian found which instance is closest to the mean. The same process was used to determine how to assign an instance to a cluster. Essentially first finding the mean then finding which mean the instance is closest to.

###Agnes For this clustering, it is essentially a tree of cluster. All the instances of the training data begins with as a node and you essentially build bigger nodes from these small nodes, and you repeat the process of building a bigger node until a bigger node cannot be made or the node that was built has all of the instances. This clustering algorithm takes a long time to build.

votermla's People

Contributors

msk610 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.