I am trying to run the nutchindexing benchmark but I see the following errors when I r

Answers to your questions: Hadoop version - I am using Clouder

intel-bigdata,hibench

AllanY commented on September 28, 2024

I cannot find enough information to help me root cause the problem.
Can I ask if you have the dict file under your linux? /usr/share/dict/linux.words
You can also try Bayes benchmark to see it works or not. If not, the missing of /usr/share/dict/linux.words might be the root cause of it.

from hibench.

madhurax commented on September 28, 2024

I have /usr/share/dict/linux.words.

[hdfs@r720-h1 bin]$ ls /usr/share/dict/
linux.words words

I am basically following the steps listed in the README.md file. It basically asks to set up the configuration for nutchindexing in the configure.sh. I am using the defaults, so I didn't change it. Then I ran the prepare.sh script for nutchindexing to setup the dataset for the nutchindexing benchmark. But I ran into the failures above. Am I missing any steps?

I have hdfs and mapreduce services running on the Hadoop cluster.

I can provide more information. Let me know what you need.

Madhura

from hibench.

madhurax commented on September 28, 2024

Can someone help me to get the nutchindexing benchmark working?
Madhura

from hibench.

AllanY commented on September 28, 2024

We hope to repeat the problem in house and then debug on it. Can you figure out the basic running configurations?
e.g., cluster size, data size, parameter settings in configuration files, hadoop confs, etc.

from hibench.

madhurax commented on September 28, 2024

3 node cluster, with name node and data node running on one of them, and the other two nodes being data nodes.

Configuration file settings for the nutch indexing benchmark:

PAGES=10000000
NUM_MAPS=96
NUM_REDS=48

I did not change any of the other configuration file parameters. I am not sure about the data size, but I am guessing that the dataset created will be dependent on the number of PAGES listed in the configuration file.

Madhura

from hibench.

madhurax commented on September 28, 2024

I am also including the stack trace from the failure (from prepare.sh), if it will help:

attempt_201403100956_0024_m_000027_0: java.io.FileNotFoundException: File does not exist: urls-0/data
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1704)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:452)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:426)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:396)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:405)
attempt_201403100956_0024_m_000027_0: at HiBench.Utils.getSharedMapFile(Utils.java:196)
attempt_201403100956_0024_m_000027_0: at HiBench.NutchData$CreateNutchPages.configure(NutchData.java:307)
attempt_201403100956_0024_m_000027_0: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
attempt_201403100956_0024_m_000027_0: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
attempt_201403100956_0024_m_000027_0: at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
attempt_201403100956_0024_m_000027_0: at java.lang.reflect.Method.invoke(Method.java:597)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
attempt_201403100956_0024_m_000027_0: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
attempt_201403100956_0024_m_000027_0: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
attempt_201403100956_0024_m_000027_0: at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
attempt_201403100956_0024_m_000027_0: at java.lang.reflect.Method.invoke(Method.java:597)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
attempt_201403100956_0024_m_000027_0: at java.security.AccessController.doPrivileged(Native Method)
attempt_201403100956_0024_m_000027_0: at javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.Child.main(Child.java:262)
14/03/10 12:33:25 INFO mapred.JobClient: Task Id : attempt_201403100956_0024_m_000005_0, Status : FAILED

from hibench.

ShelleyRuirui commented on September 28, 2024

Can you inform us of the hadoop version and jdk version you are using?

A brief description of the work flow of nutch indexing relating to the not found urls-0/data as well as some check recommendations is provided as follows:

1.The prepare.sh contains two map reduce jobs. From the info you provided, it seems that the error came from the second job and the the first job succeeded. Is it the case?

2.If the first job completed correctly, after its completion, there should be contents inside the file /HiBench/Nutch/temp/urls/part-#####/data (##### represents numbers) and /HiBench/Nutch/temp/urls/part-#####/index (it may be deleted after the second job finishes). Would you please verify it?

3.The second job actually leveraged DistributedCache to cache the /HiBench/Nutch/temp/urls/part-#####/ directory to every single node. It calls createSymlink() which will create a link named url-# which points to the cached part-##### directory on each node. The link will be created under your current working directory. FileSystem fs = FileSystem.getLocal(job); fs.getWorkingDirectory() will actually be the directory where the url-# link exists.
Would you please verify that your DistributedCache actually caches the directory in nodes and that if there is any problem for creating a link pointing to it (since the problem you reported seems to be the link is not created and points to the directory as expected)

from hibench.

madhurax commented on September 28, 2024

Answers to your questions:

Hadoop version - I am using Cloudera Distribution of Hadoop version 4.8.1
JDK version -
java version "1.7.0_51"
OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

The first job within prepare.sh succeeded.
The contents of /HiBench/Nutch/temp/urls/ are as expected:
[hdfs@r720-h1 bin]$ hdfs dfs -ls /HiBench/Nutch/temp/urls/part*/
Found 2 items
-rw-r--r-- 3 hdfs supergroup 12471032 2014-03-13 13:54 /HiBench/Nutch/temp/urls/part-00000/data
-rw-r--r-- 3 hdfs supergroup 7263 2014-03-13 13:54 /HiBench/Nutch/temp/urls/part-00000/index
Found 2 items
-rw-r--r-- 3 hdfs supergroup 12474468 2014-03-13 13:54 /HiBench/Nutch/temp/urls/part-00001/data
-rw-r--r-- 3 hdfs supergroup 7261 2014-03-13 13:54 /HiBench/Nutch/temp/urls/part-00001/index
I am not sure how to check for the current working directory. So, I couldn't check to see if the symlinks exist. Is there a way to do this using the command line? Also, you mentioned Distributed Cache usage. I assume the cache is available for use by default and I don't need to actually configure it for use?

from hibench.

AllanY commented on September 28, 2024

First, it is strange that you set NUM_MAPS=96 while you only got 2 part-xxxxx files/folders under "urls" directory. Actually it should contain totally 96 files/folders (i.e., part-00000 ~ part-00095) under the "urls" directory.

Second, the symbol links used to access the hashed urls seems not work in your system or it would not reveal the exception of "file does not exist"..

So would you help to check how the first issue happened? If the situation of only 2 part-xxxxx folders is true, i'm afraid the first job produced wrong (or at least not enough) data. We should first make sure and get rid of it and then further go into the second issue.

I suggest you to set smaller number of PAGES for fast test. I'm afraid the size of 10000000 might be a little large for your 3-node cluster (depends on the hardware you used). Maybe you can try 1000 simply for test and try.

from hibench.

madhurax commented on September 28, 2024

Allan,

I do see all 96 parts. I just didn’t list them all in my comments. I listed the first couple just as an illustration of how the parts show up. So, the first part definitely completes successfully.

Also, someone else also responded saying that the links are not being created and these links are being created in the distributed cache. I was told to see if the links are indeed present. But I wasn’t sure where to look. The comments talked about using the current working directory, but I don’t know how to do it with the cli, rather than having to update any code. So, any help in this regard is appreciated.

I can certainly try with a smaller number of pages for the next runs.

Thanks,
Madhura

From: Allan [mailto:[email protected]]
Sent: Thursday, March 13, 2014 9:13 PM
To: intel-hadoop/HiBench
Cc: Madhura Limaye
Subject: Re: [HiBench] NutchIndexing - File not found exception (#34)

First, it is strange that you set NUM_MAPS=96 while you only got 2 part-xxxxx files/folders under "urls" directory. Actually it should contain totally 96 files/folders (i.e., part-00000 ~ part-00095) under the "urls" directory.

Second, the symbol links used to access the hashed urls seems not work in your system or it would not reveal the exception of "file does not exist"..

So would you help to check how the first issue happened? If the situation of only 2 part-xxxxx folders is true, i'm afraid the first job produced wrong (or at least not enough) data. We should first make sure and get rid of it and then further go into the second issue.

I suggest you to set smaller number of PAGES for fast test. I'm afraid the size of 10000000 might be a little large for your 3-node cluster (depends on the hardware you used). Maybe you can try 1000 simply for test and try.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-37615335.

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

from hibench.

ShelleyRuirui commented on September 28, 2024

It seems that you are using YARN right?

For YARN, the cached file is located in ${yarn.nodemanager.local-dirs}/filecache. The default value of yarn.nodemanager.local-dirs is stated here:
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Or you might have already customized in yarn-site.xml.
On my YARN cluster, after the second job runs some time, the directory will contain some directories with numbers as their names, such as 1305,1314 etc. And inside each directory, there will be the cached files directories with name part-XXXX, and inside it are the data and index file.

Also after the second job runs some time, ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache will contain a directory with job id as directory name. Inside this directory, there will be several directories with container id as the name. Inside one of the directories, for me it's container_1394081617508_0294_01_000015, there will be links with name urls-0, urls-1 etc.

The cached files seems to keep even the second job finishes while the application and container directories will be removed after the job finishes. Also note that the owner of ${yarn.nodemanager.local-dirs} and its sub directories should be "yarn" of group "yarn".

from hibench.

madhurax commented on September 28, 2024

I am actually using MR1. Where are the links created for MR1?

Also, if I don't find the links bring created, what do I need to check for? - permissions, what else?

Madhura

-----Original message-----
From: Ruirui [email protected]
To: intel-hadoop/HiBench [email protected]
Cc: Madhura Limaye [email protected]
Sent: Mon, 17 Mar 2014, 06:02:18 GMT+00:00
Subject: Re: [HiBench] NutchIndexing - File not found exception (#34)

It seems that you are using YARN right?

For YARN, the cached file is located in ${yarn.nodemanager.local-dirs}/filecache. The default value of yarn.nodemanager.local-dirs is stated here:
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Or you might have already customized in yarn-site.xml.

On my YARN cluster, after the second job runs some time, the directory will contain some directories with numbers as their names, such as 1305,1314 etc. And inside each directory, there will be the cached files directories with name part-XXXX, and inside it are the data and index file.

Also after the second job runs some time, ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache will contain a directory with job id as directory name. Inside this directory, there will be several directories with container id as the name. Inside one of the directories, for me it's container_1394081617508_0294_01_000015, there will be links with name urls-0, urls-1 etc.

The cached files seems to keep even the second job finishes while the application and container directories will be removed after the job finishes. Also note that the owner of ${yarn.nodemanager.local-dirs} and its sub directories should be "yarn" of group "yarn".

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-37780916.

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

from hibench.

ShelleyRuirui commented on September 28, 2024

For MR1, the location for distribute cached contents is:
${mapred.local.dir}/taskTracker/distcache/XXXXXXXXX(seem to be random numbers)/{machinename}/HiBench/Nutch/temp/urls/part-00000 etc and
the link url in the working directory seems to be:
${mapred.local.dir}/taskTracker/{username}/jobcache/job_XXX(jobID)/attempt-XXX(attemptID)/work/urls-X
For the default value of mapred.local.dir, you may refer to this page:
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.0/src/mapred/mapred-default.xml

Actually when I click on one of the map tasks in the second job on the tracking website and click the "All" in the "task log" column, all logs will be displayed. And inside the "syslog logs", I can find the logs:
2014-03-17 09:01:16,464 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2014-03-17 09:01:18,398 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk2/cdh4/mapred/taskTracker/distcache/5545382444559646992_-958364022_1305357600/sr231/HiBench/Nutch/temp/urls/part-00000 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-9
2014-03-17 09:01:18,454 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk3/cdh4/mapred/taskTracker/distcache/3238116305694177633_2083908161_1305351166/sr231/HiBench/Nutch/temp/urls/part-00001 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-10
2014-03-17 09:01:18,543 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk1/cdh4/mapred/taskTracker/distcache/-8580248963676784844_1453472193_1305351229/sr231/HiBench/Nutch/temp/urls/part-00002 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-11
.......
which shows the execution trace. You may further check this log.

from hibench.

madhurax commented on September 28, 2024

Hello,

I attempted another run. This is what I found:

 I found the contents as expected for the ${mapred.local.dir}/taskTracker/distcache/XXXXXXXXX(seem to be random numbers)/{machinename}/HiBench/Nutch/temp/urls/part-00000 directories

 But the jobcache directories are all empty.

 I checked the task logs as well. It seems like the symlinks are being created. There doesn’t seem to be errors related to the creation. But for some reason the link does not exist in the jobcache:

INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /data5/mapred/local/taskTracker/distcache/5902015323532002611_-530266511_1615634528/r720-h1.sdcorp.global.sandisk.com/HiBench/Nutch/temp/urls/part-00095 <- /data5/mapred/local/taskTracker/hdfs/jobcache/job_201403200919_0002/attempt_201403200919_0002_m_000000_0/work/urls-0

There is another error that I see at the bottom of the syslog:
2014-03-20 09:22:29,700 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-03-20 09:22:29,702 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.NullPointerException
at HiBench.NutchData$CreateNutchPages.map(NutchData.java:349)
at HiBench.NutchData$CreateNutchPages.map(NutchData.java:296)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
2014-03-20 09:22:29,707 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

Not sure if this is related to the file not found exception error that I eventually end up seeing in the stderr log. Or if this is what is causing the symlinks not to show up.

Any suggestions to try out?

Madhura

From: Ruirui [mailto:[email protected]]
Sent: Tuesday, March 18, 2014 1:36 AM
To: intel-hadoop/HiBench
Cc: Madhura Limaye
Subject: Re: [HiBench] NutchIndexing - File not found exception (#34)

For MR1, the location for distribute cached contents is:
${mapred.local.dir}/taskTracker/distcache/XXXXXXXXX(seem to be random numbers)/{machinename}/HiBench/Nutch/temp/urls/part-00000 etc and
the link url in the working directory seems to be:
${mapred.local.dir}/taskTracker/{username}/jobcache/job_XXX(jobID)/attempt-XXX(attemptID)/work/urls-X
For the default value of mapred.local.dir, you may refer to this page:
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.0/src/mapred/mapred-default.xml

Actually when I click on one of the map tasks in the second job on the tracking website and click the "All" in the "task log" column, all logs will be displayed. And inside the "syslog logs", I can find the logs:
2014-03-17 09:01:16,464 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2014-03-17 09:01:18,398 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk2/cdh4/mapred/taskTracker/distcache/5545382444559646992_-958364022_1305357600/sr231/HiBench/Nutch/temp/urls/part-00000 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-9
2014-03-17 09:01:18,454 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk3/cdh4/mapred/taskTracker/distcache/3238116305694177633_2083908161_1305351166/sr231/HiBench/Nutch/temp/urls/part-00001 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-10
2014-03-17 09:01:18,543 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk1/cdh4/mapred/taskTracker/distcache/-8580248963676784844_1453472193_1305351229/sr231/HiBench/Nutch/temp/urls/part-00002 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-11
.......
which shows the execution trace. You may further check this log.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-37909472.

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

from hibench.

madhurax commented on September 28, 2024

Hello,

I was wondering if people have any suggestions about things to check to see why nutch indexing benchmark is failing for me.

Thanks,
Madhura

from hibench.

NutchIndexing - File not found exception about hibench HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs