Comments (15)
I cannot find enough information to help me root cause the problem.
Can I ask if you have the dict file under your linux? /usr/share/dict/linux.words
You can also try Bayes benchmark to see it works or not. If not, the missing of /usr/share/dict/linux.words might be the root cause of it.
from hibench.
I have /usr/share/dict/linux.words.
[hdfs@r720-h1 bin]$ ls /usr/share/dict/
linux.words words
I am basically following the steps listed in the README.md file. It basically asks to set up the configuration for nutchindexing in the configure.sh. I am using the defaults, so I didn't change it. Then I ran the prepare.sh script for nutchindexing to setup the dataset for the nutchindexing benchmark. But I ran into the failures above. Am I missing any steps?
I have hdfs and mapreduce services running on the Hadoop cluster.
I can provide more information. Let me know what you need.
Madhura
from hibench.
Can someone help me to get the nutchindexing benchmark working?
Madhura
from hibench.
We hope to repeat the problem in house and then debug on it. Can you figure out the basic running configurations?
e.g., cluster size, data size, parameter settings in configuration files, hadoop confs, etc.
from hibench.
3 node cluster, with name node and data node running on one of them, and the other two nodes being data nodes.
Configuration file settings for the nutch indexing benchmark:
PAGES=10000000
NUM_MAPS=96
NUM_REDS=48
I did not change any of the other configuration file parameters. I am not sure about the data size, but I am guessing that the dataset created will be dependent on the number of PAGES listed in the configuration file.
Madhura
from hibench.
I am also including the stack trace from the failure (from prepare.sh), if it will help:
attempt_201403100956_0024_m_000027_0: java.io.FileNotFoundException: File does not exist: urls-0/data
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1704)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:452)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:426)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:396)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:405)
attempt_201403100956_0024_m_000027_0: at HiBench.Utils.getSharedMapFile(Utils.java:196)
attempt_201403100956_0024_m_000027_0: at HiBench.NutchData$CreateNutchPages.configure(NutchData.java:307)
attempt_201403100956_0024_m_000027_0: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
attempt_201403100956_0024_m_000027_0: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
attempt_201403100956_0024_m_000027_0: at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
attempt_201403100956_0024_m_000027_0: at java.lang.reflect.Method.invoke(Method.java:597)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
attempt_201403100956_0024_m_000027_0: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
attempt_201403100956_0024_m_000027_0: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
attempt_201403100956_0024_m_000027_0: at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
attempt_201403100956_0024_m_000027_0: at java.lang.reflect.Method.invoke(Method.java:597)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
attempt_201403100956_0024_m_000027_0: at java.security.AccessController.doPrivileged(Native Method)
attempt_201403100956_0024_m_000027_0: at javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
attempt_201403100956_0024_m_000027_0: at org.apache.hadoop.mapred.Child.main(Child.java:262)
14/03/10 12:33:25 INFO mapred.JobClient: Task Id : attempt_201403100956_0024_m_000005_0, Status : FAILED
from hibench.
Can you inform us of the hadoop version and jdk version you are using?
A brief description of the work flow of nutch indexing relating to the not found urls-0/data as well as some check recommendations is provided as follows:
1.The prepare.sh contains two map reduce jobs. From the info you provided, it seems that the error came from the second job and the the first job succeeded. Is it the case?
2.If the first job completed correctly, after its completion, there should be contents inside the file /HiBench/Nutch/temp/urls/part-#####/data (##### represents numbers) and /HiBench/Nutch/temp/urls/part-#####/index (it may be deleted after the second job finishes). Would you please verify it?
3.The second job actually leveraged DistributedCache to cache the /HiBench/Nutch/temp/urls/part-#####/ directory to every single node. It calls createSymlink() which will create a link named url-# which points to the cached part-##### directory on each node. The link will be created under your current working directory. FileSystem fs = FileSystem.getLocal(job); fs.getWorkingDirectory() will actually be the directory where the url-# link exists.
Would you please verify that your DistributedCache actually caches the directory in nodes and that if there is any problem for creating a link pointing to it (since the problem you reported seems to be the link is not created and points to the directory as expected)
from hibench.
Answers to your questions:
- Hadoop version - I am using Cloudera Distribution of Hadoop version 4.8.1
- JDK version -
java version "1.7.0_51"
OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
- The first job within prepare.sh succeeded.
- The contents of /HiBench/Nutch/temp/urls/ are as expected:
[hdfs@r720-h1 bin]$ hdfs dfs -ls /HiBench/Nutch/temp/urls/part*/
Found 2 items
-rw-r--r-- 3 hdfs supergroup 12471032 2014-03-13 13:54 /HiBench/Nutch/temp/urls/part-00000/data
-rw-r--r-- 3 hdfs supergroup 7263 2014-03-13 13:54 /HiBench/Nutch/temp/urls/part-00000/index
Found 2 items
-rw-r--r-- 3 hdfs supergroup 12474468 2014-03-13 13:54 /HiBench/Nutch/temp/urls/part-00001/data
-rw-r--r-- 3 hdfs supergroup 7261 2014-03-13 13:54 /HiBench/Nutch/temp/urls/part-00001/index - I am not sure how to check for the current working directory. So, I couldn't check to see if the symlinks exist. Is there a way to do this using the command line? Also, you mentioned Distributed Cache usage. I assume the cache is available for use by default and I don't need to actually configure it for use?
from hibench.
First, it is strange that you set NUM_MAPS=96 while you only got 2 part-xxxxx files/folders under "urls" directory. Actually it should contain totally 96 files/folders (i.e., part-00000 ~ part-00095) under the "urls" directory.
Second, the symbol links used to access the hashed urls seems not work in your system or it would not reveal the exception of "file does not exist"..
So would you help to check how the first issue happened? If the situation of only 2 part-xxxxx folders is true, i'm afraid the first job produced wrong (or at least not enough) data. We should first make sure and get rid of it and then further go into the second issue.
I suggest you to set smaller number of PAGES for fast test. I'm afraid the size of 10000000 might be a little large for your 3-node cluster (depends on the hardware you used). Maybe you can try 1000 simply for test and try.
from hibench.
Allan,
I do see all 96 parts. I just didn’t list them all in my comments. I listed the first couple just as an illustration of how the parts show up. So, the first part definitely completes successfully.
Also, someone else also responded saying that the links are not being created and these links are being created in the distributed cache. I was told to see if the links are indeed present. But I wasn’t sure where to look. The comments talked about using the current working directory, but I don’t know how to do it with the cli, rather than having to update any code. So, any help in this regard is appreciated.
I can certainly try with a smaller number of pages for the next runs.
Thanks,
Madhura
From: Allan [mailto:[email protected]]
Sent: Thursday, March 13, 2014 9:13 PM
To: intel-hadoop/HiBench
Cc: Madhura Limaye
Subject: Re: [HiBench] NutchIndexing - File not found exception (#34)
First, it is strange that you set NUM_MAPS=96 while you only got 2 part-xxxxx files/folders under "urls" directory. Actually it should contain totally 96 files/folders (i.e., part-00000 ~ part-00095) under the "urls" directory.
Second, the symbol links used to access the hashed urls seems not work in your system or it would not reveal the exception of "file does not exist"..
So would you help to check how the first issue happened? If the situation of only 2 part-xxxxx folders is true, i'm afraid the first job produced wrong (or at least not enough) data. We should first make sure and get rid of it and then further go into the second issue.
I suggest you to set smaller number of PAGES for fast test. I'm afraid the size of 10000000 might be a little large for your 3-node cluster (depends on the hardware you used). Maybe you can try 1000 simply for test and try.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-37615335.
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
from hibench.
It seems that you are using YARN right?
For YARN, the cached file is located in ${yarn.nodemanager.local-dirs}/filecache. The default value of yarn.nodemanager.local-dirs is stated here:
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Or you might have already customized in yarn-site.xml.
On my YARN cluster, after the second job runs some time, the directory will contain some directories with numbers as their names, such as 1305,1314 etc. And inside each directory, there will be the cached files directories with name part-XXXX, and inside it are the data and index file.
Also after the second job runs some time, ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache will contain a directory with job id as directory name. Inside this directory, there will be several directories with container id as the name. Inside one of the directories, for me it's container_1394081617508_0294_01_000015, there will be links with name urls-0, urls-1 etc.
The cached files seems to keep even the second job finishes while the application and container directories will be removed after the job finishes. Also note that the owner of ${yarn.nodemanager.local-dirs} and its sub directories should be "yarn" of group "yarn".
from hibench.
I am actually using MR1. Where are the links created for MR1?
Also, if I don't find the links bring created, what do I need to check for? - permissions, what else?
Madhura
-----Original message-----
From: Ruirui [email protected]
To: intel-hadoop/HiBench [email protected]
Cc: Madhura Limaye [email protected]
Sent: Mon, 17 Mar 2014, 06:02:18 GMT+00:00
Subject: Re: [HiBench] NutchIndexing - File not found exception (#34)
It seems that you are using YARN right?
For YARN, the cached file is located in ${yarn.nodemanager.local-dirs}/filecache. The default value of yarn.nodemanager.local-dirs is stated here:
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
Or you might have already customized in yarn-site.xml.
On my YARN cluster, after the second job runs some time, the directory will contain some directories with numbers as their names, such as 1305,1314 etc. And inside each directory, there will be the cached files directories with name part-XXXX, and inside it are the data and index file.
Also after the second job runs some time, ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache will contain a directory with job id as directory name. Inside this directory, there will be several directories with container id as the name. Inside one of the directories, for me it's container_1394081617508_0294_01_000015, there will be links with name urls-0, urls-1 etc.
The cached files seems to keep even the second job finishes while the application and container directories will be removed after the job finishes. Also note that the owner of ${yarn.nodemanager.local-dirs} and its sub directories should be "yarn" of group "yarn".
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-37780916.
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
from hibench.
For MR1, the location for distribute cached contents is:
${mapred.local.dir}/taskTracker/distcache/XXXXXXXXX(seem to be random numbers)/{machinename}/HiBench/Nutch/temp/urls/part-00000 etc and
the link url in the working directory seems to be:
${mapred.local.dir}/taskTracker/{username}/jobcache/job_XXX(jobID)/attempt-XXX(attemptID)/work/urls-X
For the default value of mapred.local.dir, you may refer to this page:
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.0/src/mapred/mapred-default.xml
Actually when I click on one of the map tasks in the second job on the tracking website and click the "All" in the "task log" column, all logs will be displayed. And inside the "syslog logs", I can find the logs:
2014-03-17 09:01:16,464 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2014-03-17 09:01:18,398 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk2/cdh4/mapred/taskTracker/distcache/5545382444559646992_-958364022_1305357600/sr231/HiBench/Nutch/temp/urls/part-00000 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-9
2014-03-17 09:01:18,454 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk3/cdh4/mapred/taskTracker/distcache/3238116305694177633_2083908161_1305351166/sr231/HiBench/Nutch/temp/urls/part-00001 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-10
2014-03-17 09:01:18,543 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk1/cdh4/mapred/taskTracker/distcache/-8580248963676784844_1453472193_1305351229/sr231/HiBench/Nutch/temp/urls/part-00002 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-11
.......
which shows the execution trace. You may further check this log.
from hibench.
Hello,
I attempted another run. This is what I found:
-
I found the contents as expected for the ${mapred.local.dir}/taskTracker/distcache/XXXXXXXXX(seem to be random numbers)/{machinename}/HiBench/Nutch/temp/urls/part-00000 directories
-
But the jobcache directories are all empty.
-
I checked the task logs as well. It seems like the symlinks are being created. There doesn’t seem to be errors related to the creation. But for some reason the link does not exist in the jobcache:
INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /data5/mapred/local/taskTracker/distcache/5902015323532002611_-530266511_1615634528/r720-h1.sdcorp.global.sandisk.com/HiBench/Nutch/temp/urls/part-00095 <- /data5/mapred/local/taskTracker/hdfs/jobcache/job_201403200919_0002/attempt_201403200919_0002_m_000000_0/work/urls-0
There is another error that I see at the bottom of the syslog:
2014-03-20 09:22:29,700 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-03-20 09:22:29,702 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.NullPointerException
at HiBench.NutchData$CreateNutchPages.map(NutchData.java:349)
at HiBench.NutchData$CreateNutchPages.map(NutchData.java:296)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
2014-03-20 09:22:29,707 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
Not sure if this is related to the file not found exception error that I eventually end up seeing in the stderr log. Or if this is what is causing the symlinks not to show up.
Any suggestions to try out?
Madhura
From: Ruirui [mailto:[email protected]]
Sent: Tuesday, March 18, 2014 1:36 AM
To: intel-hadoop/HiBench
Cc: Madhura Limaye
Subject: Re: [HiBench] NutchIndexing - File not found exception (#34)
For MR1, the location for distribute cached contents is:
${mapred.local.dir}/taskTracker/distcache/XXXXXXXXX(seem to be random numbers)/{machinename}/HiBench/Nutch/temp/urls/part-00000 etc and
the link url in the working directory seems to be:
${mapred.local.dir}/taskTracker/{username}/jobcache/job_XXX(jobID)/attempt-XXX(attemptID)/work/urls-X
For the default value of mapred.local.dir, you may refer to this page:
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.0/src/mapred/mapred-default.xml
Actually when I click on one of the map tasks in the second job on the tracking website and click the "All" in the "task log" column, all logs will be displayed. And inside the "syslog logs", I can find the logs:
2014-03-17 09:01:16,464 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2014-03-17 09:01:18,398 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk2/cdh4/mapred/taskTracker/distcache/5545382444559646992_-958364022_1305357600/sr231/HiBench/Nutch/temp/urls/part-00000 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-9
2014-03-17 09:01:18,454 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk3/cdh4/mapred/taskTracker/distcache/3238116305694177633_2083908161_1305351166/sr231/HiBench/Nutch/temp/urls/part-00001 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-10
2014-03-17 09:01:18,543 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /mnt/DP_disk1/cdh4/mapred/taskTracker/distcache/-8580248963676784844_1453472193_1305351229/sr231/HiBench/Nutch/temp/urls/part-00002 <- /mnt/DP_disk1/cdh4/mapred/taskTracker/hdfs/jobcache/job_201403171008_0002/attempt_201403171008_0002_m_000000_0/work/urls-11
.......
which shows the execution trace. You may further check this log.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/34#issuecomment-37909472.
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
from hibench.
Hello,
I was wondering if people have any suggestions about things to check to see why nutch indexing benchmark is failing for me.
Thanks,
Madhura
from hibench.
Related Issues (20)
- can you add me pls
- I am facing the below mentioned issue while trying to execute the pagerank algorithm in hadoop from hibench. Please give a solution. HOT 1
- upgrade kafka version
- Hi
- HiveError:HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient HOT 1
- Does Hibench Suite work with spark version 3.2.1 ? HOT 1
- Hi
- Error when run sql test
- Does Hibench Suite support Hadoop version 3.x ? HOT 1
- i don't no what i do -_- HOT 2
- Could tasks submitted to Spark by Hibench be viewed in Spark's History Server?
- Issues with run-sparkbench on Google Cloud Platform (GCP)
- spark config value modification does not apply.
- for single node hadoop configuration
- Hi
- The
- ERROR [26/26] RUN cd /root/HiBench && mvn clean package -Dspark=1.6 -Dscala=2.10
- Attempt to run HiBench on JDK 17 or higher versions HOT 1
- Hi
- Runstreaming job failed with error in bench.log like "org.apache.spark.SparkException: Error getting partition metadata for 'identity'. Does the topic exist?"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hibench.