GithubHelp home page GithubHelp logo

13-map-reduce-collect's Introduction

This is an assignment to the Software Architecture class at the Technische Hochschule Nürnberg.

Assignment 13: Map-Reduce and Collect

This assignment covers some of the more advanced concepts of the Java 8 Streams API:

  • Map-Reduce
  • Collecting
  • Grouping-By

We'll use Map-Reduce to implement the classic word count example. The sample repository contains 3.000 tweets of Donald Trump which we will analyze in this assignment.

As an alternative, you can use the provided generator which retrieves live tweets from the Twitter API. To be able to use this generator, you have to do some additional configuration.

A clever data scientist discovered that most of the angrier tweets came from Android where the nicer ones were written with an iPhone (first article and follow up article). We will have a look if we can use Java's Streams to group the tweets by the kind of client which was used to create them.

Setup

  1. Create a fork of this repository (button in the right upper corner)
  2. Clone the project (get the link by clicking the green Clone or download button)
  3. Import the project to your IntelliJ
  4. Read the whole assignment spec!

Remark: the given test suite is incomplete and will not succeed right after checkout!

Generators

To be able to analyze the tweets, we need a generator which loads the given tweets from a JSON file and exposes them as a Stream.

The following UML shows the class structure of the generators and the factory which exposes them.

Generator class spec

Remark: the UML is not complete but just meant as implementation guideline and for orientation.

Hint: the dependency to GSON is already added to your build.gradle and GSON can be used to deserialize tweets:

Gson gson = new Gson();
Tweet[] tweets = gson.fromJson(reader, Tweet[].class);

Where reader is an instance of the abstract class Reader. To access files from the resources folder, use the <Class class="getResourceAsStream method:

getClass().getResourceAsStream("/path/to/trump_tweets.json");

Side note: this is only possible if you are in a non static context, otherwise you have to write something like this: MyClass.class.getResourceAsStream(...).

To implement the OfflineTweetStreamGenerator follow these steps:

  1. Create the class OfflineTweetStreamGenerator.
  2. Implement the interface TweetStreamGenerator.
  3. Implement the method getTweetStream by using GSON and the helper method Arrays.stream(T[] array) to load the JSON file and create a new Stream from the deserialized tweet array.

Advanced variants:

  • Use the try-with-resources statement for the deserialization of the tweets
  • For a more functional way, read up on this Gist.

Collecting

Up to now, Streams are nice to have but printing all results to the command line with the terminator forEach(...) is not really practicable.

To be able to process the data we need to collect them, e.g. in a List<> or a Set<> or any other Collection.

Fortunately, the Stream API already defines a method .collect(...) and the JDK contains the utility class Collectors which defines the most common collectors.

Once the OfflineTweetStreamGenerator works, we can start to analyze the tweets in the class TrumpTweetStats. The following UML contains the signatures of the methods to be implemented in this assignment.

TrumpTweetStats spec

The method stubs are given already; the UML is only meant to keep the signatures in mind while you're reading the spec.

The first two tasks are the implementation of the methods:

  • calculateSourceAppStats
  • calculateTweetsBySourceApp

calculateSourceAppStats

This method groups the tweets by the app they were created with and counts how many tweets were created with which app. In SQL you would write it like this:

SELECT source, count(*) FROM tweets GROUP BY source

If you want to try it on your own: the repository contains a SQL script that creates a tweet table and insert all the tweets to it. The script has been tested on MSSQL and PostgreSQL but if you want to use it on MySQL or MariaDB some additional work might be necessary.

Every Tweet instance has two methods to access the source:

  • getSource()
  • getSourceApp()

The first one returns a string like this: <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>. The second one extracts the actual name of the app Twitter for iPhone. It does not matter for the assignment which one you choose but the second one is a little bit prettier when the result is printed.

Side note: the task is thought to be solved with collect but it's also possible using reduce!

calculateTweetsBySourceApp

This method is very similar to calculateSourceAppStats but instead of just counting the tweets, it collects them as a Set<Tweet> for further analysis.

While the method calculateSourceAppStats was very easy to implement with SQL, this method is impossible to implement in SQL because SQL does not define a map of lists (i.e. a tuple of tuples)!

Side note: the task is thought to be solved with collect but it's also possible using reduce!

Map-Reduce

A classical algorithm used to process huge amounts of data is Map-Reduce. As the name already indicates the algorithm consists of two steps:

  • Map - transform the data (in parallel)
  • Reduce - retrieve all partial results and aggregate them

If you're looking for examples of Map-Reduce, the first hit will most likely be the word count problem. It's relatively simple to implement as there's not much transformation required and it demonstrates the concept very well.

We want to analyze which words are the most common in the given tweets. The following flow chart is meant as orientation how to implement the Map-Reduce algorithm.

Map-Reduce flow chart

The text of the tweet can be split like this:

String[] split = "Hello World".split("( )+");

The reduce method requires a so called accumulator. For this method, an instance of HashMap<> or LinkedHashMap<> seems to be a good idea. The next part is the reduction step and should be an instance of BiFunction<>: a function where the accumulator and a single value is passed in and the accumulator is returned after the value is processed (e.g. inserted to a list). (The last part is a combiner that combines two accumulator values but you won't need it this time.)

Debugging hint: the latest IntelliJ Idea ships with the plugin Java Stream Debugger. The plugin visualizes how the stream is transformed step by step (including the actual data). That's very helpful if something is happening you don't expect to happen!

Another hint: the given 3225 tweets are a little bit too much to debug your Stream if something is going wrong. It may help to limit the Stream you're passing to the method calculateWordCount to e.g. 200 elements!

Last but not least: the Map-Reduce algorithm is a little bit tricky when you implement it for the first time. If you're stuck, Google can help you!

There are already some unit tests but it might be a good idea to extend the test suite.

Using the Twitter API

To be able to use Twitter4j you're required to configure it by setting OAuth consumer key, consumer secret, access token and access token secret in a file called twitter4j.properties that is in the root of your resources folder (right next to the files stopwords.txt and trump_tweets.json). The file should have the following structure:

debug=true
oauth.consumerKey=<dummy>
oauth.consumerSecret=<dummy>
oauth.accessToken=<dummy>
oauth.accessTokenSecret=<dummy>

To get these tokens you need to register a Twitter app. Then you have to fill in some basic information about the "app" you're creating.

After the registration of your new app you'll be able to retrieve the required information. Copy the given structure to a new file twitter4j.properties and replace the <dummy> strings with your actual keys and secrets and you should be able to fetch the tweets live from the API.

13-map-reduce-collect's People

Contributors

prskr avatar sikoried avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.