samujjwaal / dblp-mapreduce Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 2.0 81 KB

Hadoop MapReduce computational model to perform analyses on DBLP publication data

Scala 82.04% Java 17.96%

hadoop mapreduce scala scalatest scala-xml sbt dblp-dataset

dblp-mapreduce's Introduction

Map Reduce on DBLP data

Description: Design and implement an instance of the Hadoop MapReduce computational model to perform analyses on DBLP publication data

Overview

As part of this project, a MapReduce program is created for the parallel processing of the publicly available DBLP dataset. The dataset contains records for various publications by author(s) at different types of venues (like conferences, schools,books, and journals). Multiple map/reduce jobs have been defined to extract various insights from the dataset.

The map/reduce jobs created are :

List top 10 authors published at each venue
List publications with only 1 author at each venue
List publications with highest number of authors for each venue
List of top 100 authors who publish with most co-authors(in desc. order)
List of 100 authors who publish without co-authors

Instructions to Execute

Setup Hadoop environment on the target system. Skip if already done. (Follow these steps if not)
Generate executable jar file
- Clone this repository
- Open root folder of the project in the terminal and assemble the project jar using command:
  
  sbt clean compile assembly
  
  This command compiles the source code, executes the test cases and builds the executable jar file “hw2_dblp_mapred.jar” in the folder “target/scala-2.13”
Setup Hadoop environment
- Start Hadoop DFS & YARN services using:
  - Start NameNode & DataNode daemons
    
    start-dfs.sh
  - Start ResourceManager & NodeManager daemons
    
    start-yarn.sh
  - Verify if daemons are running using:
    
    jps
- Create directory in HDFS to store the input file:
  
  hdfs dfs -mkdir input
- Place the dblp.xml file in the directory created above:
  
  hdfs dfs -put path/to/dblp.xml input
Execute jar file
- Run the jar file using:
  
  hadoop jar hw2_dblp_mapred.jar job_num input
  - The argument ‘job_num’ has to be provided by user and can have possible values of 1/2/3/4/5 corresponding to the job being performed as described above and below
  - The output folder for the job results have been set in the config file ‘JobSpec.conf’ as follows:
```
# Output paths for MapReduce jobs
master_output_path = "output_hw2"
Job1_output_path = "/top_10_authors_at_venues"
Job2_output_path = "/pubs_with_1_author_at_venues"
Job3_output_path = "/pubs_with_max_authors_at_venues"
Job4_output_path = "/top_100_authors_max_coauthors"
Job5_output_path = "/100_authors_0_coauthors"
```
  - The main output folder ‘output_hw2’ needs to be deleted if repeating any map/reduce job or else an error is raised. Delete folder using:
    
    hdfs dfs -rm -r output_hw2
- After executing all jobs, extract the output files from the HDFS into a local directory “mapreduce_output” using:
  
  hdfs dfs -get output_hw2 mapreduce_output
  
  Output of all jobs is in CSV format.
Stop Hadoop services
- Stop all daemons after execution is completed using:
```
stop-yarn.sh
stop-dfs.sh
```

Application Design

XML parsing
- For parsing the dblp.xml file using the dblp.dtd schema I have used multiple tag XMLInputFormatter by Mohammed Siddiq, which is an implementation of Mahout's XMLInputFormat with support for multiple input and output tags.
- The input and output tags are mentioned in the config file.
- The tags considered are:
  
  <article ,<book ,<incollection ,<inproceedings ,<mastersthesis ,<proceedings ,<phdthesis ,<www

MapReduce Jobs

Job 1
- Mapper Class: VenueTopTenAuthorsMapper
- Reducer Class:VenueTopTenAuthorsReducer
- Output path: output_hw2/top_10_authors_at_venues
- Output format: key:<venue name> & value:<list of authors(seperated by ';')>
Job 2
- Mapper Class:VenueOneAuthorMapper
- Reducer Class:VenueOneAuthorReducer
- Output path: output_hw2/pubs_with_1_author_at_venues"
- Output format:key:<venue name> & value:<list of publications(seperated by ';')>
Job 3
- Mapper Class:VenueTopPubMapper
- Reducer Class:VenueTopPubReducer
- Output path: output_hw2/pubs_with_max_authors_at_venues
- Output format: key:<venue name> & value:<publication name>
Job 4
- Mapper Class:CoAuthorCountMapper
- Reducer Class:MostCoAuthorCountReducer
- Output path: output_hw2/top_100_authors_max_coauthors
- Output format:key:<author name> & value:<max. number of coauthors>
Job 5
- Mapper Class:CoAuthorCountMapper
- Reducer Class:ZeroCoAuthorCountReducer
- Output path: output_hw2/100_authors_0_coauthors
- Output format:key:<author name> & value:<0>

dblp-mapreduce's People

Contributors

Stargazers

Watchers

Forkers

mflipe rohaana

dblp-mapreduce's Issues

Job 1: Failed

Job 1 Failed

Compiled with SBT 1.4 on windows
No changes to the source code
Executed on Ubuntu WSL 2

LOG:

$ hadoop jar hw2_dblp_mapred.jar 1 input
2021-05-07 18:11:03,386 INFO hw2.RunJobs$: Starting up MapReduce job..
2021-05-07 18:11:03,500 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-05-07 18:11:04,078 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2021-05-07 18:11:04,161 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hduser/.staging/job_1620420770486_0001
2021-05-07 18:11:06,687 INFO input.FileInputFormat: Total input files to process : 1
2021-05-07 18:11:07,251 INFO mapreduce.JobSubmitter: number of splits:24
2021-05-07 18:11:08,085 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1620420770486_0001
2021-05-07 18:11:08,085 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-05-07 18:11:08,245 INFO conf.Configuration: resource-types.xml not found
2021-05-07 18:11:08,246 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-05-07 18:11:08,611 INFO impl.YarnClientImpl: Submitted application application_1620420770486_0001
2021-05-07 18:11:08,640 INFO mapreduce.Job: The url to track the job: http://[MYUSER].localdomain:8088/proxy/application_1620420770486_0001/
2021-05-07 18:11:08,640 INFO mapreduce.Job: Running job: job_1620420770486_0001
2021-05-07 18:11:15,697 INFO mapreduce.Job: Job job_1620420770486_0001 running in uber mode : false
2021-05-07 18:11:15,698 INFO mapreduce.Job:  map 0% reduce 0%
2021-05-07 18:11:30,885 INFO mapreduce.Job: Task Id : attempt_1620420770486_0001_m_000000_0, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1(VenueTopTenAuthorsMapper.scala:33)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1$adapted(VenueTopTenAuthorsMapper.scala:28)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:11)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:11:45,041 INFO mapreduce.Job:  map 1% reduce 0%
2021-05-07 18:11:58,103 INFO mapreduce.Job:  map 2% reduce 0%
2021-05-07 18:12:15,205 INFO mapreduce.Job:  map 3% reduce 0%
2021-05-07 18:12:27,277 INFO mapreduce.Job:  map 4% reduce 0%
2021-05-07 18:12:39,344 INFO mapreduce.Job:  map 5% reduce 0%
2021-05-07 18:12:53,448 INFO mapreduce.Job:  map 6% reduce 0%
2021-05-07 18:13:09,529 INFO mapreduce.Job:  map 7% reduce 0%
2021-05-07 18:13:21,595 INFO mapreduce.Job:  map 8% reduce 0%
2021-05-07 18:13:33,643 INFO mapreduce.Job:  map 9% reduce 0%
2021-05-07 18:13:44,676 INFO mapreduce.Job: Task Id : attempt_1620420770486_0001_m_000006_0, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1(VenueTopTenAuthorsMapper.scala:33)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1$adapted(VenueTopTenAuthorsMapper.scala:28)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:11)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:13:45,679 INFO mapreduce.Job:  map 8% reduce 0%
2021-05-07 18:13:51,708 INFO mapreduce.Job:  map 9% reduce 0%
2021-05-07 18:13:52,728 INFO mapreduce.Job: Task Id : attempt_1620420770486_0001_m_000000_1, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1(VenueTopTenAuthorsMapper.scala:33)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1$adapted(VenueTopTenAuthorsMapper.scala:28)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:11)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:14:06,788 INFO mapreduce.Job: Task Id : attempt_1620420770486_0001_m_000005_0, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1(VenueTopTenAuthorsMapper.scala:33)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1$adapted(VenueTopTenAuthorsMapper.scala:28)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:11)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:14:07,794 INFO mapreduce.Job:  map 7% reduce 0%
2021-05-07 18:14:09,801 INFO mapreduce.Job:  map 8% reduce 0%
2021-05-07 18:14:15,820 INFO mapreduce.Job: Task Id : attempt_1620420770486_0001_m_000000_2, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1(VenueTopTenAuthorsMapper.scala:33)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.$anonfun$map$1$adapted(VenueTopTenAuthorsMapper.scala:28)
        at scala.collection.immutable.List.foreach(List.scala:333)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopTenAuthorsMapper.map(VenueTopTenAuthorsMapper.scala:11)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:14:27,889 INFO mapreduce.Job:  map 9% reduce 0%
2021-05-07 18:14:39,927 INFO mapreduce.Job:  map 10% reduce 0%
2021-05-07 18:14:51,963 INFO mapreduce.Job:  map 11% reduce 0%
2021-05-07 18:15:04,008 INFO mapreduce.Job:  map 12% reduce 0%
2021-05-07 18:15:12,038 INFO mapreduce.Job:  map 14% reduce 0%
2021-05-07 18:15:15,053 INFO mapreduce.Job:  map 17% reduce 0%
2021-05-07 18:15:20,068 INFO mapreduce.Job:  map 100% reduce 100%
2021-05-07 18:15:22,075 INFO mapreduce.Job: Job job_1620420770486_0001 failed with state FAILED due to: Task failed task_1620420770486_0001_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

2021-05-07 18:15:22,130 INFO mapreduce.Job: Counters: 39
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=137105630
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=536871973
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=12
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Failed map tasks=6
                Killed map tasks=19
                Killed reduce tasks=1
                Launched map tasks=14
                Other local map tasks=5
                Data-local map tasks=9
                Total time spent by all maps in occupied slots (ms)=1445398
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=1445398
                Total vcore-milliseconds taken by all map tasks=1445398
                Total megabyte-milliseconds taken by all map tasks=1480087552
        Map-Reduce Framework
                Map input records=1030940
                Map output records=3175471
                Map output bytes=129695240
                Map output materialized bytes=136046206
                Input split bytes=452
                Combine input records=0
                Spilled Records=3175471
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=10286
                CPU time spent (ms)=907880
                Physical memory (bytes) snapshot=2259881984
                Virtual memory (bytes) snapshot=10277044224
                Total committed heap usage (bytes)=1890058240
                Peak Map Physical memory (bytes)=586129408
                Peak Map Virtual memory (bytes)=2577076224
        File Input Format Counters
                Bytes Read=536871521

Job 3: Failed

Job 3 Failed

Compiled with SBT 1.4 on windows
No changes to the source code
Executed on Ubuntu WSL 2

LOG

$ hadoop jar hw2_dblp_mapred.jar 3 input
2021-05-07 19:30:59,751 INFO hw2.RunJobs$: Starting up MapReduce job..
2021-05-07 19:30:59,806 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-05-07 19:31:00,034 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2021-05-07 19:31:00,135 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hduser/.staging/job_1620420770486_0003
2021-05-07 19:31:02,476 INFO input.FileInputFormat: Total input files to process : 1
2021-05-07 19:31:03,467 INFO mapreduce.JobSubmitter: number of splits:24
2021-05-07 19:31:03,717 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1620420770486_0003
2021-05-07 19:31:03,717 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-05-07 19:31:03,837 INFO conf.Configuration: resource-types.xml not found
2021-05-07 19:31:03,838 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-05-07 19:31:03,880 INFO impl.YarnClientImpl: Submitted application application_1620420770486_0003
2021-05-07 19:31:03,905 INFO mapreduce.Job: The url to track the job: http://DESKTOP-Marcos.localdomain:8088/proxy/application_1620420770486_0003/
2021-05-07 19:31:03,905 INFO mapreduce.Job: Running job: job_1620420770486_0003
2021-05-07 19:31:09,963 INFO mapreduce.Job: Job job_1620420770486_0003 running in uber mode : false
2021-05-07 19:31:09,964 INFO mapreduce.Job:  map 0% reduce 0%
2021-05-07 19:31:23,130 INFO mapreduce.Job: Task Id : attempt_1620420770486_0003_m_000000_0, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 19:31:33,230 INFO mapreduce.Job:  map 1% reduce 0%
2021-05-07 19:31:36,262 INFO mapreduce.Job: Task Id : attempt_1620420770486_0003_m_000000_1, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 19:31:52,422 INFO mapreduce.Job:  map 2% reduce 0%
2021-05-07 19:32:10,556 INFO mapreduce.Job:  map 3% reduce 0%
2021-05-07 19:32:22,884 INFO mapreduce.Job:  map 4% reduce 0%
2021-05-07 19:32:39,991 INFO mapreduce.Job:  map 5% reduce 0%
2021-05-07 19:32:52,045 INFO mapreduce.Job:  map 6% reduce 0%
2021-05-07 19:33:07,146 INFO mapreduce.Job:  map 7% reduce 0%
2021-05-07 19:33:22,240 INFO mapreduce.Job:  map 8% reduce 0%
2021-05-07 19:33:34,302 INFO mapreduce.Job:  map 9% reduce 0%
2021-05-07 19:33:52,393 INFO mapreduce.Job:  map 10% reduce 0%
2021-05-07 19:34:00,436 INFO mapreduce.Job: Task Id : attempt_1620420770486_0003_m_000006_0, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 19:34:01,445 INFO mapreduce.Job:  map 9% reduce 0%
2021-05-07 19:34:10,480 INFO mapreduce.Job:  map 10% reduce 0%
2021-05-07 19:34:10,481 INFO mapreduce.Job: Task Id : attempt_1620420770486_0003_m_000000_2, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 19:34:11,492 INFO mapreduce.Job: Task Id : attempt_1620420770486_0003_m_000005_0, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:28)
        at com.samujjwaal.hw2.mappers.VenueTopPubMapper.map(VenueTopPubMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 19:34:12,497 INFO mapreduce.Job:  map 8% reduce 0%
2021-05-07 19:34:22,571 INFO mapreduce.Job:  map 100% reduce 100%
2021-05-07 19:34:24,579 INFO mapreduce.Job: Job job_1620420770486_0003 failed with state FAILED due to: Task failed task_1620420770486_0003_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

2021-05-07 19:34:24,627 INFO mapreduce.Job: Counters: 14
        Job Counters
                Failed map tasks=6
                Killed map tasks=23
                Killed reduce tasks=1
                Launched map tasks=11
                Other local map tasks=4
                Data-local map tasks=7
                Total time spent by all maps in occupied slots (ms)=1135720
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=1135720
                Total vcore-milliseconds taken by all map tasks=1135720
                Total megabyte-milliseconds taken by all map tasks=1162977280
        Map-Reduce Framework
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0

Development Environment

Hi, my name is Marcos.

I found your project when searching for examples of hadoop and want to explore it further.
Would you describe how did you set up the development environment for this project?

Thank you in advance!
Marcos

Refactor 10 lines occurring 2 times in 2 files: MostCoAuthorCountReducer.scala, ZeroCoAuthorCountReducer.scala

I've selected for refactoring 10 lines of code which are duplicated in 2 file(s) (1, 2). Addressing this will make our codebase more maintainable and improve Better Code Hub's Write Code Once guideline rating! 👍

Here's the gist of this guideline:

Definition 📖
Do not copy code.
Why❓
When code is copied, bugs need to be fixed in multiple places. This is both inefficient and a source of regression bugs.
How 🔧
Avoid duplication by never copy/pasting blocks of code and reduce duplication by extracting shared code, either to a new unit or introduce a superclass if the language permits.

You can find more info about this guideline in Building Maintainable Software. 📖

ℹ️ To know how many other refactoring candidates need addressing to get a guideline compliant, select some by clicking on the 🔲 next to them. The risk profile below the candidates signals (✅) when it's enough! 🏁

Good luck and happy coding! ✨ 💯

Job 2: Failed

Job 2 Failed

Compiled with SBT 1.4 on windows
No changes to the source code
Executed on Ubuntu WSL 2

LOG:

$ hadoop jar hw2_dblp_mapred.jar 2 input
2021-05-07 18:22:02,295 INFO hw2.RunJobs$: Starting up MapReduce job..
2021-05-07 18:22:02,354 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-05-07 18:22:02,583 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2021-05-07 18:22:02,669 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hduser/.staging/job_1620420770486_0002
2021-05-07 18:22:05,246 INFO input.FileInputFormat: Total input files to process : 1
2021-05-07 18:22:05,810 INFO mapreduce.JobSubmitter: number of splits:24
2021-05-07 18:22:06,035 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1620420770486_0002
2021-05-07 18:22:06,035 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-05-07 18:22:06,158 INFO conf.Configuration: resource-types.xml not found
2021-05-07 18:22:06,158 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-05-07 18:22:06,204 INFO impl.YarnClientImpl: Submitted application application_1620420770486_0002
2021-05-07 18:22:06,233 INFO mapreduce.Job: The url to track the job: http://[MYUSER].localdomain:8088/proxy/application_1620420770486_0002/
2021-05-07 18:22:06,234 INFO mapreduce.Job: Running job: job_1620420770486_0002
2021-05-07 18:22:12,284 INFO mapreduce.Job: Job job_1620420770486_0002 running in uber mode : false
2021-05-07 18:22:12,284 INFO mapreduce.Job:  map 0% reduce 0%
2021-05-07 18:22:36,479 INFO mapreduce.Job:  map 1% reduce 0%
2021-05-07 18:22:48,545 INFO mapreduce.Job:  map 2% reduce 0%
2021-05-07 18:23:06,653 INFO mapreduce.Job:  map 3% reduce 0%
2021-05-07 18:23:18,751 INFO mapreduce.Job:  map 4% reduce 0%
2021-05-07 18:23:30,821 INFO mapreduce.Job:  map 5% reduce 0%
2021-05-07 18:23:42,913 INFO mapreduce.Job:  map 6% reduce 0%
2021-05-07 18:24:01,016 INFO mapreduce.Job:  map 7% reduce 0%
2021-05-07 18:24:13,079 INFO mapreduce.Job:  map 8% reduce 0%
2021-05-07 18:24:25,130 INFO mapreduce.Job:  map 9% reduce 0%
2021-05-07 18:24:37,168 INFO mapreduce.Job:  map 10% reduce 0%
2021-05-07 18:24:49,211 INFO mapreduce.Job:  map 11% reduce 0%
2021-05-07 18:25:01,260 INFO mapreduce.Job:  map 12% reduce 0%
2021-05-07 18:25:19,327 INFO mapreduce.Job:  map 13% reduce 0%
2021-05-07 18:25:31,393 INFO mapreduce.Job:  map 14% reduce 0%
2021-05-07 18:25:43,422 INFO mapreduce.Job:  map 15% reduce 0%
2021-05-07 18:25:55,465 INFO mapreduce.Job:  map 16% reduce 0%
2021-05-07 18:25:59,493 INFO mapreduce.Job:  map 19% reduce 0%
2021-05-07 18:26:06,523 INFO mapreduce.Job:  map 20% reduce 0%
2021-05-07 18:26:07,534 INFO mapreduce.Job:  map 22% reduce 0%
2021-05-07 18:26:18,576 INFO mapreduce.Job:  map 24% reduce 0%
2021-05-07 18:26:19,581 INFO mapreduce.Job:  map 25% reduce 0%
2021-05-07 18:26:28,619 INFO mapreduce.Job:  map 26% reduce 0%
2021-05-07 18:26:35,660 INFO mapreduce.Job:  map 26% reduce 8%
2021-05-07 18:26:38,680 INFO mapreduce.Job: Task Id : attempt_1620420770486_0002_m_000010_0, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueOneAuthorMapper.map(VenueOneAuthorMapper.scala:31)
        at com.samujjwaal.hw2.mappers.VenueOneAuthorMapper.map(VenueOneAuthorMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:26:47,724 INFO mapreduce.Job:  map 27% reduce 8%
2021-05-07 18:26:58,778 INFO mapreduce.Job: Task Id : attempt_1620420770486_0002_m_000010_1, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueOneAuthorMapper.map(VenueOneAuthorMapper.scala:31)
        at com.samujjwaal.hw2.mappers.VenueOneAuthorMapper.map(VenueOneAuthorMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:27:09,836 INFO mapreduce.Job:  map 28% reduce 8%
2021-05-07 18:27:27,913 INFO mapreduce.Job:  map 29% reduce 8%
2021-05-07 18:27:45,966 INFO mapreduce.Job:  map 30% reduce 8%
2021-05-07 18:28:04,032 INFO mapreduce.Job:  map 31% reduce 8%
2021-05-07 18:28:21,072 INFO mapreduce.Job:  map 32% reduce 8%
2021-05-07 18:28:39,139 INFO mapreduce.Job:  map 33% reduce 8%
2021-05-07 18:28:59,777 INFO mapreduce.Job:  map 34% reduce 8%
2021-05-07 18:29:17,832 INFO mapreduce.Job:  map 35% reduce 8%
2021-05-07 18:29:35,891 INFO mapreduce.Job:  map 36% reduce 8%
2021-05-07 18:29:40,907 INFO mapreduce.Job: Task Id : attempt_1620420770486_0002_m_000009_0, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueOneAuthorMapper.map(VenueOneAuthorMapper.scala:31)
        at com.samujjwaal.hw2.mappers.VenueOneAuthorMapper.map(VenueOneAuthorMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:29:41,921 INFO mapreduce.Job:  map 33% reduce 8%
2021-05-07 18:29:47,940 INFO mapreduce.Job:  map 34% reduce 8%
2021-05-07 18:29:53,283 INFO mapreduce.Job:  map 37% reduce 8%
2021-05-07 18:29:54,292 INFO mapreduce.Job:  map 37% reduce 10%
2021-05-07 18:30:00,313 INFO mapreduce.Job:  map 37% reduce 11%
2021-05-07 18:30:03,340 INFO mapreduce.Job: Task Id : attempt_1620420770486_0002_m_000010_2, Status : FAILED
Error: java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:629)
        at scala.collection.immutable.Nil$.head(List.scala:628)
        at com.samujjwaal.hw2.mappers.VenueOneAuthorMapper.map(VenueOneAuthorMapper.scala:31)
        at com.samujjwaal.hw2.mappers.VenueOneAuthorMapper.map(VenueOneAuthorMapper.scala:8)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172)

2021-05-07 18:30:06,346 INFO mapreduce.Job:  map 38% reduce 11%
2021-05-07 18:30:12,370 INFO mapreduce.Job:  map 38% reduce 13%
2021-05-07 18:30:19,385 INFO mapreduce.Job:  map 39% reduce 13%
2021-05-07 18:30:25,396 INFO mapreduce.Job:  map 100% reduce 100%
2021-05-07 18:30:27,402 INFO mapreduce.Job: Job job_1620420770486_0002 failed with state FAILED due to: Task failed task_1620420770486_0002_m_000010
Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

2021-05-07 18:30:27,463 INFO mapreduce.Job: Counters: 43
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=38575671
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1207962496
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=27
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                HDFS: Number of bytes read erasure-coded=0
        Job Counters
                Failed map tasks=5
                Killed map tasks=14
                Killed reduce tasks=1
                Launched map tasks=18
                Launched reduce tasks=1
                Other local map tasks=4
                Data-local map tasks=14
                Total time spent by all maps in occupied slots (ms)=2679662
                Total time spent by all reduces in occupied slots (ms)=245579
                Total time spent by all map tasks (ms)=2679662
                Total time spent by all reduce tasks (ms)=245579
                Total vcore-milliseconds taken by all map tasks=2679662
                Total vcore-milliseconds taken by all reduce tasks=245579
                Total megabyte-milliseconds taken by all map tasks=2743973888
                Total megabyte-milliseconds taken by all reduce tasks=251472896
        Map-Reduce Framework
                Map input records=2364905
                Map output records=417710
                Map output bytes=35339828
                Map output materialized bytes=36191841
                Input split bytes=1017
                Combine input records=0
                Spilled Records=417710
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=25146
                CPU time spent (ms)=2043870
                Physical memory (bytes) snapshot=5030256640
                Virtual memory (bytes) snapshot=23105875968
                Total committed heap usage (bytes)=4161798144
                Peak Map Physical memory (bytes)=567414784
                Peak Map Virtual memory (bytes)=2582102016
        File Input Format Counters
                Bytes Read=1207961479

samujjwaal / dblp-mapreduce Goto Github PK

dblp-mapreduce's Introduction

Map Reduce on DBLP data

Description: Design and implement an instance of the Hadoop MapReduce computational model to perform analyses on DBLP publication data

Overview

Instructions to Execute

Application Design

XML parsing

MapReduce Jobs

Job 1

Job 2

Job 3

Job 4

Job 5

dblp-mapreduce's People

Contributors

Stargazers

Watchers

Forkers

dblp-mapreduce's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs