GithubHelp home page GithubHelp logo

insightcodingchallenge2015's Introduction

InsightCodingChallenge2015

Insight Data Engineering Coding Challenge

##Dependencies

Name Version
Apache Spark 1.4.0
Apache Maven 3.3.3
Java 1.8

Tested on: OS X Yosemite 10.10.4 Java 1.8.0_31

##Instructions:

Download Maven 3.3.3: see https://maven.apache.org/download.cgi

Install Maven 3.3.3: see https://maven.apache.org/install.html

Download Spark 1.4.0: see https://spark.apache.org/downloads.html select pre-built for hadoop 2.6 and later

Install Spark 1.4.0: see https://spark.apache.org/docs/latest/

Extract spark, clone this repository into the spark-1.4.0-bin-hadoop2.6 directory

cd to the InsightCodingChallenge2015 directory (within the spark-1.4.0-bin-hadoop2.6 directory) and execute run.sh with ./run.sh

##WordsTweeted Description

I chose to use Apache Spark for this problem. Spark is awesome. Spark permits the counting of unique words for this problem to be done in parallel on a cluster. Then, the word sorting can also be done in parallel, so it scales well. Thanks to Resilient Distributed Datasets, the system is resistant to failures on a node. I hope to learn more about Spark and map/reduce. I implemented a very similar version to the WordCount example at http://spark.apache.org/examples.html using Java 8 lambda expressions.

##MedianUnique Algorithm Description

The MedianUnique implementation is based on counting sort. We will keep track of the number of times that we have seen tweets with 1 unique word, tweets with 2 unique words, and so on.

The number of unique words for a given tweet are produced by splitting the tweet on white spaces and then constructing a HashSet from the resulting array of Strings. The HashSet will prevent duplicate words, and the number of elements remaining is the unique word count for that tweet.

Next, since we know that a tweet can not be more than 140 characters, the maximum number of unique words that may exist in a tweet is 70 (single character words followed by single white spaces) so we only need 70 'buckets'. We also know that the result of a count will be from the set of natural numbers. Taking advantage of this knowledge we can keep a count of the number of times we have seen each of the unique word counts from 1 to 70. We also separately keep track of the total number of tweets we have seen. When we need to compute the median (after every tweet is parsed), we can go through the elements of each bucket until the sum of elements is greater than or equal to half of the tweets we have seen so far. This will indicate which bucket the middle value resides in. The index of the bucket that the middle value resides in is the median. There are special cases for when the average of two buckets must be computed, and when there is only one element.

Inserting a new unique word count in this way is an O(1) operation. Computing the median will in the worst case involve summing over 70 buckets, no matter how many tweets we see. Since we do not need to keep track of one number for every tweet we increase our memory efficiency.

##Acknowledgements:

Thank you to Apache Spark for providing the distributed computing software and the starting example for this word counting solution.

Thanks to the Insight Data Engineering group for providing this opportunity.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.