Comments (3)
...which could be said this way: crawler is actually working to store the page, but why is it so slow ? (more than 30 minute to grab a single webpage)
Same behaviour with www.caf.fr (webarchive http://repos.kbaccess.org/20140524003034/www.caf.fr/ )
from kbaccess.
Extract from crawl.log of www.pasdecalais.fr :
2014-05-23T12:35:12.189Z 1 56 dns:www.pasdecalais.fr P http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive text/dns #002 20140523123511687+1 - - -
2014-05-23T12:35:13.096Z 200 569 http://www.pasdecalais.fr/robots.txt P http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive text/plain #001 20140523123512686+409 - - -
2014-05-23T12:40:13.288Z 200 39858 http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive - - text/html #001 20140523124013142+100 - - 3t
2014-05-23T12:45:13.367Z 200 1699 http://www.pasdecalais.fr/css/4570c5e.css E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive text/css #001 20140523124513317+48 - - -
2014-05-23T12:50:13.437Z 200 524 http://www.pasdecalais.fr/img/commun/suivre/rss-focus.gif E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/gif #001 20140523125013395+42 - - -
2014-05-23T12:55:13.517Z 200 1173 http://www.pasdecalais.fr/bundles/telmediasocle/images/header/header/togglers/rubriques.png E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/png #001 20140523125513467+49 - - -
2014-05-23T13:00:13.591Z 200 964 http://www.pasdecalais.fr/img/commun/outils/rss-focus.gif E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/gif #001 20140523130013548+42 - - -
2014-05-23T13:05:13.680Z 200 6401 http://www.pasdecalais.fr/bundles/telmediasocle/img/footer/legal/logo.png E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/png #001 20140523130513622+57 - - -
2014-05-23T13:10:13.763Z 200 483 http://www.pasdecalais.fr/img/commun/social/googleplus.gif E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/gif #001 20140523131013711+51 - - -
2014-05-23T13:15:15.376Z 404 316 http://www.pasdecalais.fr/favicon.ico I http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive text/html #022 20140523131514025+1350 - - -
2014-05-23T13:20:15.450Z 200 396 http://www.pasdecalais.fr/img/commun/outils/faire-suivre-focus.gif E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/gif #022 20140523132015408+42 - - -
2014-05-23T13:25:15.550Z 200 5641 http://www.pasdecalais.fr/var/cg62/storage/images/mediatheque/sports/photos/carte-des-referents-territoriaux-de-la-direction-des-sports/265933-5-fre-FR/Carte-des-referents-territoriaux-de-la-Direction-des-Sports_small-medium.jpg E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/jpeg #022 20140523132515482+67 - - robotExcluded
2014-05-23T13:30:15.650Z 200 494 http://www.pasdecalais.fr/img/commun/suivre/lettre-info-focus.gif E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/gif #022 20140523133015590+60 - - -
2014-05-23T13:35:15.727Z 200 3539 http://www.pasdecalais.fr/bundles/telmediasocle/images/page/icon-home.png E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/png #022 20140523133515682+45 - - -
2014-05-23T13:40:15.802Z 200 228 http://www.pasdecalais.fr/img/commun/outils/imprimer-focus.gif E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/gif #022 20140523134015759+42 - - -
2014-05-23T13:45:15.878Z 200 380 http://www.pasdecalais.fr/img/commun/outils/faire-suivre.gif E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/gif #022 20140523134515834+43 - - -
2014-05-23T13:50:15.981Z 200 17900 http://www.pasdecalais.fr/img/page/to-top.png E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/png #022 20140523135015909+72 - - -
2014-05-23T13:55:16.144Z 200 257958 http://www.pasdecalais.fr/var/cg62/storage/images/mediatheque/galeries/sports-loisirs/bandeau-3-terrils/720398-1-fre-FR/Bandeau-3-terrils.jpg E http://www.pasdecalais.fr/Sports-Loisirs/Politique-sportive image/jpeg #022 20140523135516013+130 - - robotExcluded
Requests are fired one by one each 5 minutes
File http://pasdecalais.fr/robots.txt :
#Tous les moteurs sont concernés
User-agent:*
Sitemap: http://www.pasdecalais.fr/sitemap.xml
# le terme suivant nous permet de limiter les temps de passage des robots à 5 minutes
Crawl-delay: 300
# les fichiers suivants ne seront pas indexés par les moteurs
Disallow: *.gif$
Disallow: *.png$
Disallow: *.ico$
Disallow: *.exe$
Disallow: *.js$
Disallow: *.css$
Disallow: *.zip$
Disallow: *.xls$
Disallow: /layout/set/pdf*/
Disallow: /www/var/cg62/storage
Disallow: /var/cg62/storage
Disallow: /recherche
Disallow: /content/tipafriend
Disallow: /content/download
and also extract from http://www.caf.fr/robots.txt
User-agent: *
Crawl-delay: 10
Conclusion: we have to disable the robots.txt in slurp-manager
from kbaccess.
duplicate of #225
from kbaccess.
Related Issues (20)
- Switch testcase creation steps
- Use radio buttons in "select reference" step of testcase creation instead of select box
- Fix "select test" step UI from testcase creation
- Fix submit buttons from testcase creation steps
- Change contributors list page
- Rewrite URLs for SEO optimization
- Replace log out message
- Implement snapshot service
- Add references coverage tables
- Fix add example steps patterns
- Fix title of last subscription step
- Trim URL (in adding an example step 3)
- Fix i18n problem with the alt attribute of the webarchive snapshot in example details
- Provide RSS
- Provide a way to delete WebArchives
- UI Revamp page presenting a testcase
- Ability to add image(s) to a testcase
- Technical doc is missing: how to build / how to deploy
- Disable robots.txt support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kbaccess.