GithubHelp home page GithubHelp logo

je-nunez / apache_spark Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 109 KB

K-Means clustering in Apache Spark in local mode, creating a Spark RDD from one sheet of an Excel spreadsheet

License: GNU General Public License v2.0

Scala 100.00%

apache_spark's Introduction

spark

Build Status

Some tests with clustering in Apache Spark in local mode

WIP

This project is a work in progress. The implementation is incomplete and subject to change. The documentation can be inaccurate.

Description

It tries to do a K-Means clustering using Apache Spark in Scala on data from the The Center for Microeconomic Data of the Federal Reserve of New York. In particular, from the Excel workbook 2014 Housing Survey, available under https://www.newyorkfed.org/microeconomics/data.html.

The clustering tries to increase different cardinalities in the set of clusters, and to compare each set according to the measure of its Within Set Sum of Squared Errors.

Note: The program tries to download the Excel workbook of the 2014 Housing Survey directly, so it depends on the availability of the link https://www.newyorkfed.org/medialibrary/interactives/sce/sce/downloads/data/FRBNY-SCE-Housing-Module-Public-Microdata-Complete.xlsx.

Example Output

The output reported by this program is the set of clusters found, for example, reporting one cluster <#i>, for some integer i:

 Reporting center of KMeans cluster <#i>
   ...
   ... value of the columns of the K-means center in this cluster <#i>...
   ...
 Number for non-zero columns for center of KMeans cluster <#i>: <#number-non-zero-columns>

where the value of the columns have the column index, the column name, the value of that column in the K-means center, and the description of that column.

For example:

 Reporting center of KMeans cluster 0
 ...
 Column 117 HQH6a         value:  0.55962734 Last refinanced mortgage on primary residence
 Column 118 HQH6a2_1      value:  0.70164064 Terms changed when last refinanced: The interest rate was lowered
 Column 121 HQH6a2_4      value:  0.61233999 Terms changed when last refinanced: The term of the mortgage decreased
 Column 124 HQH6a2_7      value:  0.52907140 Terms changed when last refinanced: Changed mortgage servicer
 Column 154 HQH6a3part2_1 value: -0.02503919 Magnitude of change in monthly payment
 Column 156 HQH6a4_2      value:  0.13441488 Use of money from decreased monthly payment: Paid down other debt (e.g. on credit cards, auto loans, student loans, or medical bills)
 Column 158 HQH6a4_4      value:  0.10726969 Use of money from decreased monthly payment: Used it to make renovations or improvements to the home
 Column 159 HQH6a4_5      value:  0.30495740 Use of money from decreased monthly payment: Used it to pay for other expenses
 Column 160 HQH6a4_6      value:  0.46714597 Use of money from decreased monthly payment: Used it to purchase financial assets (e.g. stocks)
 Column 163 HQH6b_1       value:  0.72465210 Percent chance will refinance over next 12 months
 ...
 Number for non-zero columns for center of KMeans cluster 0: 175
 ... [the centers of the other clusters, cluster 1 and so on, reported below] ...

Note in the sample above that the columns indexed 119, 120, 122, 123, 125 to 153, 155, 157, 161 and 162, do not appear in the center of this cluster 0 found, because their value in this center is 0.0, so they are omitted in the report for the sake of brevity.

apache_spark's People

Contributors

je-nunez avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.