c4software / python-sitemap Goto Github PK
View Code? Open in Web Editor NEWMini website crawler to make sitemap from a website.
License: GNU General Public License v3.0
Mini website crawler to make sitemap from a website.
License: GNU General Public License v3.0
mage sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/
With the following output added inside image:image after image:loc:
image:licensehttp://creativecommons.org/publicdomain/zero/1.0/</image:license>
Hi,
I am getting a SyntaxError when trying to execute the file, no matter what link I type in. Also "" and '' don't work
Is there a way to "revert" the python version back to 3.6 without installing another instance?
Or am I doing something wrong here??
Thx
If http://domain/dir/page1.html
contains a link to page2.html
the parser interprets this as http://domain/page2.html
, correct is http://domain/dir/page2.html
.
Furthermore on a page containing references to the upper directories (..
), these are changed to .
by self.clean_link.
I recommend to use urllib.parse.urljoin(crawling_url, link)
to make a link to an absolute URL. This will handle everything except "//" in the path.
Wanted to try and see if this would work with single page apps like Angular and it appears it will only pick up the index page. Hopefully support can be added to support these types of use cases. Thanks.
$ python3 main.py --domain http://ua.shop-ink.su --output sitemap.xml
Fatal Python error: Cannot recover from stack overflow.
Current thread 0x00007fff7edeb180:
....
File "/Users/dchaplinsky/Projects/python-sitemap/crawler.py", line 201 in __continue_crawling
File "/Users/dchaplinsky/Projects/python-sitemap/crawler.py", line 197 in __crawling
...
Abort trap: 6
Could it be possible to restrict the search to a certain path?
A bad example would be to restrict a search to http://google.com/maps/
and ignore results which are in other "subdirectories" of http://google.com/
.
Using "domain" for this purpose does not work.
Command
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.books.2globalnomads.info --image --output sitemap.xml
Output
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte
With multiple errors: HTTP Error 404: Not Found
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:loc>https://www.2globalnomads.infopaivi-santeri-kannisto/subscribe.png</image:loc></image:image><image:image><image:loc>https://www.2globalnomads.infopaivi-santeri-kannisto/logo.png</image:loc>
There should be "/" in the URL before path, between "info" and "paivi" like this "info/paivi"
The same issue happens with all local URLs. Remote URLs are all OK.
sebclick$ python3 main.py --config config.fb6.json
...
DEBUG:root:http://www.freebox-v6.fr//www.mediawiki.org/ ==> HTTP Error 404: Not Found
...
et dans le sitemap.xml, je retrouve la ligne suivante :
On running command mentioned in simple usage of readme.md ........
File "main.py", line 6, in
import crawler
File "/Users/kartikey/Desktop/SoftwareIncubator/sitemapgen/python-sitemap/crawler.py", line 85
print(config.xml_header, file=self.output_file)
The issue with this tool is once it halts, your have to start all over again from scratch.
And with large sites this is a very common scenario.
Since we already have the partially generated xml, it would be nice to continue from where it was interrupted. Let me know your thoughts on this and how to achieve this, I am willing to send pull request once I have a better understanding of the code
Sitemap should contain only URLs that belong to the same domain and are under the current directory where the sitemap is located. The same rule applies to images and videos. Currently the script adds all images not checking the domain or directory.
Hello,
This sitemap have index xml sitemap where is located all little sitemap.xml ... ?
Can i add limit e.g. I have a lot links on my website, I can add limit e.g. 10 000 links after this script stop ?
This script not add copy/same links in sitemap ?
Thanks.
Hi,
Would you consider adding support for images in the future?
i.e.
https://support.google.com/webmasters/answer/178636?hl=en
Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.forum.2globalnomads.info --image --output sitemap.xml --verbose
Your fix
--drop "sid=[a-z0-9]{32}"
Although it is not likely that people will run this script on phpbb3 forums as there is already a mod for making sitemap, please consider adding your workaround to the documentation.
The same indefinite loop will happen with all phpbb3 installation and there are tens of thousands of them. Also, might be good idea to add there some kind of guard or timeout to detect loops so that you can gracefully exit and give a proper error message. Similar issue can actually happen with any website that has session management.
My french is not great but here goes
Quand il y a un "HTTP Error 404" je me demande quel hyperlie a cause le problem.
This is not really an issue, but I did not find any better/other way to talk with you.
I made an online interface for the script. I am updating your code there manually after testing each release by myself first. Hopefully it will make it easier for people to run the script and help to attract others join the projects and do testing or possibly even coding.
The interface is available at: https://www.2globalnomads.info/web-design-websites/#generateimagesitemap.
If the URL contains UNICODE encoding, python will report an error.
debug info:
INFO:root:Crawling #1: https://gvo.wiki/html/NPC掉落書籍.html
DEBUG:root:https://gvo.wiki/html/NPC掉落書籍.html ==> 'ascii' codec can't encode characters in position 13-16: ordinal no
t in range(128)
Solution:
import string
from urllib.parse import unquote
then search
current_url = self.urls_to_crawl.pop()
add a line below
current_url = self.urls_to_crawl.pop()
current_url = quote(current_url, safe=string.printable)
Dear Creator,
Thank you very much for creating this.
Is there a way to add hreflang tags automatically?
Take care.
All of my site's URLs include a trailing '/'
https://www.example.com/
https://www.example.com/dir/
not the following:
https://www.example.com
https://www.example.com/dir
This script made all of my links use the links without the trailing '/'
How do I add the trailing '/' in?
diff -urN python-sitemap-master/crawler.py python-sitemap-master/crawler.py
--- python-sitemap-master/crawler.py 2013-04-03 09:25:00.000000000 +0300
+++ python-sitemap-master/crawler.py 2013-06-08 11:09:44.910676587 +0300
@@ -84,8 +84,8 @@
url = urlparse(crawling)
self.crawled.add(crawling)
request = Request(crawling, headers={"User-Agent":config.crawler_user_agent})
try:
request = Request(crawling, headers={"User-Agent":config.crawler_user_agent})
response = urlopen(request)
except Exception as e:
if hasattr(e,'code'):
response.close()
return self.__continue_crawling()
# Read the response
When I run with any number of workers greater than 1, I get the following error after crawling around 40 urls.
INFO:root:Crawling #56: https://up.codes/s/natural-ventilation
ERROR:concurrent.futures:exception calling callback for <Future at 0x10ddc1190 state=finished returned NoneType>
Traceback (most recent call last):
File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 324, in _invoke_callbacks
callback(self)
File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/futures.py", line 362, in _call_set_state
dest_loop.call_soon_threadsafe(_set_state, destination, source)
File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 728, in call_soon_threadsafe
self._check_closed()
File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 475, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
I'm on a Mac with Catalina. Seems to run fine on Linux.
Here command I'm using to repro:
python main.py --domain="https://up.codes" --output="sitemap.xml" -v -n 2
Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.xetnet.fi --image --output sitemap.xml --verbose
Output:
INFO:root:Start the crawling process
INFO:root:Crawling #1: https://www.xetnet.fi
INFO:root:Crawling #2: https://www.xetnet.fi/category/ror/
INFO:root:Crawling #3: https://www.xetnet.fi/wordpress-asennus-webhotelliin-2/
INFO:root:Crawling #4: https://www.xetnet.fi/category/ruby/
INFO:root:Crawling #5: https://www.xetnet.fi/asiakaspalvelu/reilua-palvelua/
INFO:root:Crawling #6: https://www.xetnet.fi/webhotelli/wordpress-webhotelli/
INFO:root:Crawling #7: https://www.xetnet.fi/wordpress/
INFO:root:Crawling #8: https://www.xetnet.fi/palvelupaketin-vaihtaminen-suurempaan-tai-pienempaan/
Traceback (most recent call last):
File "/home/paivisanteri/sitemap/python-sitemap-master/main.py", line 53, in
crawl.run()
File "/home/paivisanteri/sitemap/python-sitemap-master/crawler.py", line 101, in run
self.__crawling()
File "/home/paivisanteri/sitemap/python-sitemap-master/crawler.py", line 205, in __crawling
print (""+self.htmlspecialchars(url.geturl())+"" + lastmod + image_list + "", file=self.output_file)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 745: ordinal not in range(128)
Can this be somehow local problem or maybe in my python settings? I am not familiar with python.
Script adds the root directory twice to the sitemap, the first entry in the beginning is without ending slash and the second entry at the end is with the ending slash.
See:
python3 sitemap.py --domain https://www.2globalnomads.info --image --output sitemap.xml
Output:
<url><loc>https://www.2globalnomads.info</loc><lastmod>2017-08-22T15:28:56+00:00</lastmod>
...
<url><loc>https://www.2globalnomads.info/</loc><lastmod>2017-08-22T15:28:56+00:00</lastmod>
Take image:title from TITLE and/or ALT and image:caption from FIGCAPTION tags if they are present.
I took a peak at your source code. One source for crawling issues is that you currently define in the code not_parseable_ressources. Instead, if you define parseable resources and limit those to only truly parseable resources that are are supported in the sitemap and may contain plain text html links, you can limit issues with unknown extensions. Also you might take a look at using mime types instead of file extensions. I am not sure how that works in Python though.
Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.2globalnomads.info --image --output sitemap.xml --report
Missing page:
Command
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.forum.2globalnomads.info --image --output sitemap.xml --verbose
Data URI image links gets added, but they should be left out. Those are commonly used for example for lazy loading images. The real image URLs are inside NOSCRIPT tags and they get added OK.
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:loc>https://www.2globalnomads.info/data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7</image:loc>
Image sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/
With the following output added inside <image:image> after <image:loc>:
<image:license>http://creativecommons.org/publicdomain/zero/1.0/</image:license>
You could prettyprint the sitemap.xml a bit and add there newlines after every closing tag. That would make it a bit more human readable.
If you want, you can also take <image:title> from TITLE and/or ALT and <image:caption> from FIGCAPTION tags if they are present.
Cheers,
Santeri
I got such error
python3 main.py --domain https://domain.com --output sitemap.xml
Traceback (most recent call last):
File "main.py", line 60, in
crawl.run()
File "/root/python-sitemap/crawler.py", line 127, in run
self.__crawl(current_url)
File "/root/python-sitemap/crawler.py", line 264, in __crawl
final_url = response.geturl()
AttributeError: 'NoneType' object has no attribute 'geturl'
Hi, just wanted to say thanks for such a great library.
One need we have is to generate a sitemap for a site that has more than 50,000 URLs. The search engines typically only handle a maximum of 50,000 URLs per sitemap file, which means today that we manually create a sitemap index and move the URLs into individual sitemap files, each containing less than 50,000 URLs each.
One option I was considering was adding a feature to python-sitemap
that would optionally output a sitemap index and multiple sitemap files if there are more than 50,000 URLs; would that be of interest? Just wanted to make sure that kind of feature would be desired prior to implementing; thanks!
sometimes we have URLs that are canonicalized to other pages, and these should not be included in the sitemap. See google's reference: https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap
So the logic would be to look for a canonical tag and check if it matches the crawled URL. If it does not, then do not include that page in the sitemap.
I'm working on updating your code myself to include this but I'm still new to Python.
How to create video sitemap for my site?
Number of found URL : 1
Number of links crawled : 1
python main.py --domain https://www.domain.com --output sitemap.xml --report
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
</urlset>
Just pulled the code and I get this:
$ python main.py www.cafonline.com --output sitemap.xml --verbose
Traceback (most recent call last):
File "main.py", line 8, in
import crawler
File "/home/francisco/python-sitemap/crawler.py", line 105
print(config.xml_header, file=self.output_file)
^
SyntaxError: invalid syntax
Adding videos to site would work the same way as images and if you make ALT/TITLE and FIGURECAPTION, the same code would work with <video:video> as well. So far I have not found a single public image sitemap generator and the same applies to video sitemaps.
About video sitemaps: https://developers.google.com/webmasters/videosearch/sitemaps
Hello
I have a problem with python-sitemap on Windows and Python 3.7.2.
I haven't looked into the problem yet, but whatever I do (even 'empty'/solo 'python main.py') I get:
Traceback (most recent call last):
File "C:\! git !\python-sitemap\main.py", line 8, in <module>
import crawler
File "C:\! git !\python-sitemap\crawler.py", line 240
image_link = f"{self.domain.strip("/")}{image_link.replace("./", "/")}"
^
SyntaxError: invalid syntax
Il semblerait qu'il y est un soucis :
[valentin@valentinpc crawler]$ python main.py --config config.json --debug
[...]
DEBUG:root:Number of link crawled : 15
DEBUG:root:Nb Code HTTP 200 : 14
Can you add a crawling depth setting? I found that because my website has filtering and searching, the URL will repeat to a very large amount.
I just read that Microsoft is acquiring GitHub. I have seen enough of Microsoft love for open source for a life time to avoid everything that involves them. It is at best just a kiss of death and soon all users are required to install Microsoft malware and open Microsoft accounts to use GitHub, all all our information is for sale. I am quitting GitHub. So long and thanks for all the fish.
Add xmllint to produce an human readable XML.
Ref to #26
Tracker image links gets added, but they should be left out. You could simply check that the image extension is not php or js, or that it is a valid image type, before adding it:
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:image><image:loc>https://analytics.2globalnomads.info/piwik.php?idsite=1&rec=1</image:loc>
I appears that the exclusion parameters (--skipext --exclude --drop) don't seem to have any effect to images.
Hi! I propose change name projecto to Pysitemap.
Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.2globalnomads.info//web-design-websites/ --image --output sitemap.xml --report
This location will appears twice in the sitemap because of the double slash:
<loc>https://www.2globalnomads.info//web-design-websites/</loc>
I have a website with millions of categorized records, it will be useful if I could limit the number of urls to parse per section. E.g. the 900,000 first urls under /products/toys/
section but not from a higher category.
diff -urN python-sitemap-master/crawler.py python-sitemap-master/crawler.py
--- python-sitemap-master/crawler.py 2013-04-03 09:25:00.000000000 +0300
+++ python-sitemap-master/crawler.py 2013-06-08 11:27:24.706698113 +0300
@@ -5,6 +5,7 @@
from urllib.request import urlopen, Request
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
+from datetime import datetime
import os
@@ -105,12 +106,17 @@
else:
self.response_code[response.getcode()]=1
response.close()
if 'last-modified' in response.headers:
date = response.headers['Last-Modified']
else:
date = response.headers['Date']
date = datetime.strptime(date, '%a, %d %b %Y %H:%M:%S %Z')
except Exception as e:
logging.debug ("{1} ===> {0}".format(e, crawling))
return self.__continue_crawling()
print ("<url><loc>"+url.geturl()+"</loc></url>", file=self.output_file)
print ("<url><loc>"+url.geturl()+"</loc><lastmod>"+date.strftime('%Y-%m-%dT%H:%M:%S')+"</lastmod></url>", file=self.output_file)
if self.output_file:
self.output_file.flush()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.