c4software / python-sitemap Goto Github PK

Mini website crawler to make sitemap from a website.

License: GNU General Public License v3.0

Python 98.99% Dockerfile 1.01%

python-sitemap's Issues

Image Licence

mage sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this:

python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/

With the following output added inside image:image after image:loc:
image:licensehttp://creativecommons.org/publicdomain/zero/1.0/</image:license>

Python 3.9.6 support? SyntaxError

Hi,

I am getting a SyntaxError when trying to execute the file, no matter what link I type in. Also "" and '' don't work
Is there a way to "revert" the python version back to 3.6 without installing another instance?

Or am I doing something wrong here??

Thx

Relative URLs are parsed incorrectly

If http://domain/dir/page1.html contains a link to page2.html the parser interprets this as http://domain/page2.html, correct is http://domain/dir/page2.html.

Furthermore on a page containing references to the upper directories (..), these are changed to . by self.clean_link.

I recommend to use urllib.parse.urljoin(crawling_url, link) to make a link to an absolute URL. This will handle everything except "//" in the path.

Working with Angular sites?

Wanted to try and see if this would work with single page apps like Angular and it appears it will only pick up the index page. Hopefully support can be added to support these types of use cases. Thanks.

Stack overflow error

$ python3 main.py --domain http://ua.shop-ink.su --output sitemap.xml
Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007fff7edeb180:
....
File "/Users/dchaplinsky/Projects/python-sitemap/crawler.py", line 201 in __continue_crawling
File "/Users/dchaplinsky/Projects/python-sitemap/crawler.py", line 197 in __crawling
...
Abort trap: 6

Limit search to path instead of domain?

Could it be possible to restrict the search to a certain path?
A bad example would be to restrict a search to http://google.com/maps/ and ignore results which are in other "subdirectories" of http://google.com/.
Using "domain" for this purpose does not work.

UnicodeDecodeError possibly with Scandinavian letters

Command
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.books.2globalnomads.info --image --output sitemap.xml
Output
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte

With multiple errors: HTTP Error 404: Not Found

Slash missing in URL

Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots

Output:
<image:loc>https://www.2globalnomads.infopaivi-santeri-kannisto/subscribe.png</image:loc></image:image><image:image><image:loc>https://www.2globalnomads.infopaivi-santeri-kannisto/logo.png</image:loc>

There should be "/" in the URL before path, between "info" and "paivi" like this "info/paivi"

The same issue happens with all local URLs. Remote URLs are all OK.

URL en erreur 404 affichée dans le sitemap

sebclick$ python3 main.py --config config.fb6.json
...
DEBUG:root:http://www.freebox-v6.fr//www.mediawiki.org/ ==> HTTP Error 404: Not Found
...
et dans le sitemap.xml, je retrouve la ligne suivante :

http://www.freebox-v6.fr//www.mediawiki.org/

Crawler.py giving error

On running command mentioned in simple usage of readme.md ........
File "main.py", line 6, in
import crawler
File "/Users/kartikey/Desktop/SoftwareIncubator/sitemapgen/python-sitemap/crawler.py", line 85
print(config.xml_header, file=self.output_file)

Stop and continue

The issue with this tool is once it halts, your have to start all over again from scratch.
And with large sites this is a very common scenario.
Since we already have the partially generated xml, it would be nice to continue from where it was interrupted. Let me know your thoughts on this and how to achieve this, I am willing to send pull request once I have a better understanding of the code

Images from different domains should not be added to sitemap

Sitemap should contain only URLs that belong to the same domain and are under the current directory where the sitemap is located. The same rule applies to images and videos. Currently the script adds all images not checking the domain or directory.

See https://www.sitemaps.org/protocol.html

Question about Sitemap

Hello,

This sitemap have index xml sitemap where is located all little sitemap.xml ... ?
Can i add limit e.g. I have a lot links on my website, I can add limit e.g. 10 000 links after this script stop ?
This script not add copy/same links in sitemap ?

Thanks.

Image Sitemap?

Hi,

Would you consider adding support for images in the future?

i.e.
https://support.google.com/webmasters/answer/178636?hl=en

Endless loop part 2: report error and document workaround

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.forum.2globalnomads.info --image --output sitemap.xml --verbose
Your fix

--drop "sid=[a-z0-9]{32}"

Although it is not likely that people will run this script on phpbb3 forums as there is already a mod for making sitemap, please consider adding your workaround to the documentation.

The same indefinite loop will happen with all phpbb3 installation and there are tens of thousands of them. Also, might be good idea to add there some kind of guard or timeout to detect loops so that you can gracefully exit and give a proper error message. Similar issue can actually happen with any website that has session management.

Add link that caused 404 to the sitemap

My french is not great but here goes

Quand il y a un "HTTP Error 404" je me demande quel hyperlie a cause le problem.

Online interface for the script

This is not really an issue, but I did not find any better/other way to talk with you.

I made an online interface for the script. I am updating your code there manually after testing each release by myself first. Hopefully it will make it easier for people to run the script and help to attract others join the projects and do testing or possibly even coding.

The interface is available at: https://www.2globalnomads.info/web-design-websites/#generateimagesitemap.

URL UnicodeEncodeError

If the URL contains UNICODE encoding, python will report an error.

debug info:

INFO:root:Crawling #1: https://gvo.wiki/html/NPC掉落書籍.html
DEBUG:root:https://gvo.wiki/html/NPC掉落書籍.html ==> 'ascii' codec can't encode characters in position 13-16: ordinal no
t in range(128)

Solution:

edit crawler.py
Add the following code at the top

import string
from urllib.parse import unquote

then search
current_url = self.urls_to_crawl.pop()
add a line below

current_url = self.urls_to_crawl.pop()
current_url = quote(current_url, safe=string.printable)

How to add hreflang tags

Dear Creator,

Thank you very much for creating this.

Is there a way to add hreflang tags automatically?

Take care.

Adding trailing '/' to all URLs

All of my site's URLs include a trailing '/'

https://www.example.com/
https://www.example.com/dir/

not the following:

https://www.example.com
https://www.example.com/dir

This script made all of my links use the links without the trailing '/'

How do I add the trailing '/' in?

patch for response error

diff -urN python-sitemap-master/crawler.py python-sitemap-master/crawler.py
--- python-sitemap-master/crawler.py 2013-04-03 09:25:00.000000000 +0300
+++ python-sitemap-master/crawler.py 2013-06-08 11:09:44.910676587 +0300
@@ -84,8 +84,8 @@
url = urlparse(crawling)
self.crawled.add(crawling)

  request = Request(crawling, headers={"User-Agent":config.crawler_user_agent})
try:

      request = Request(crawling, headers={"User-Agent":config.crawler_user_agent})
    response = urlopen(request)
except Exception as e:
    if hasattr(e,'code'):

@@ -94,7 +94,6 @@
else:
self.response_code[e.code]=1
logging.debug ("{1} ==> {0}".format(e, crawling))

      response.close()
    return self.__continue_crawling()

# Read the response

RuntimeError: Event loop is closed - with > 1 workers

When I run with any number of workers greater than 1, I get the following error after crawling around 40 urls.

INFO:root:Crawling #56: https://up.codes/s/natural-ventilation
ERROR:concurrent.futures:exception calling callback for <Future at 0x10ddc1190 state=finished returned NoneType>
Traceback (most recent call last):
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/futures.py", line 362, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 728, in call_soon_threadsafe
    self._check_closed()
  File "/Users/danpatz/.pyenv/versions/3.7.4/lib/python3.7/asyncio/base_events.py", line 475, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

I'm on a Mac with Catalina. Seems to run fine on Linux.

Here command I'm using to repro:

python main.py --domain="https://up.codes" --output="sitemap.xml" -v -n 2

UnicodeDecodeError possibly with Scandinavian letters

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.xetnet.fi --image --output sitemap.xml --verbose
Output:

INFO:root:Start the crawling process
INFO:root:Crawling #1: https://www.xetnet.fi
INFO:root:Crawling #2: https://www.xetnet.fi/category/ror/
INFO:root:Crawling #3: https://www.xetnet.fi/wordpress-asennus-webhotelliin-2/
INFO:root:Crawling #4: https://www.xetnet.fi/category/ruby/
INFO:root:Crawling #5: https://www.xetnet.fi/asiakaspalvelu/reilua-palvelua/
INFO:root:Crawling #6: https://www.xetnet.fi/webhotelli/wordpress-webhotelli/
INFO:root:Crawling #7: https://www.xetnet.fi/wordpress/
INFO:root:Crawling #8: https://www.xetnet.fi/palvelupaketin-vaihtaminen-suurempaan-tai-pienempaan/
Traceback (most recent call last):
File "/home/paivisanteri/sitemap/python-sitemap-master/main.py", line 53, in
crawl.run()
File "/home/paivisanteri/sitemap/python-sitemap-master/crawler.py", line 101, in run
self.__crawling()
File "/home/paivisanteri/sitemap/python-sitemap-master/crawler.py", line 205, in __crawling
print (""+self.htmlspecialchars(url.geturl())+"" + lastmod + image_list + "", file=self.output_file)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 745: ordinal not in range(128)

Can this be somehow local problem or maybe in my python settings? I am not familiar with python.

Duplicate entry

Script adds the root directory twice to the sitemap, the first entry in the beginning is without ending slash and the second entry at the end is with the ending slash.
See:
python3 sitemap.py --domain https://www.2globalnomads.info --image --output sitemap.xml
Output:
<url><loc>https://www.2globalnomads.info</loc><lastmod>2017-08-22T15:28:56+00:00</lastmod>
...
<url><loc>https://www.2globalnomads.info/</loc><lastmod>2017-08-22T15:28:56+00:00</lastmod>

Read the image title / alt

Take image:title from TITLE and/or ALT and image:caption from FIGCAPTION tags if they are present.

Suggestion: Not parseable resources ->parseable resources

I took a peak at your source code. One source for crawling issues is that you currently define in the code not_parseable_ressources. Instead, if you define parseable resources and limit those to only truly parseable resources that are are supported in the sitemap and may contain plain text html links, you can limit issues with unknown extensions. Also you might take a look at using mime types instead of file extensions. I am not sure how that works in Python though.

Crawl fails to find one page

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.2globalnomads.info --image --output sitemap.xml --report
Missing page:

https://www.2globalnomads.info/suomi-fi/

Error: No space left on device

My VPS has a lot of space, but the script always gives this error
NOTE: I use the Screen program to run the script in the background

Endless loop fix

Command
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.forum.2globalnomads.info --image --output sitemap.xml --verbose

IMG Data URI and image license

Data URI image links gets added, but they should be left out. Those are commonly used for example for lazy loading images. The real image URLs are inside NOSCRIPT tags and they get added OK.
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:loc>https://www.2globalnomads.info/data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7</image:loc>

A few improvement proposals

Image sitemap is the only way to tell search engines the licenses of images. Please consider adding the script an option for a site-wide license for all images. It could work like this:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --license http://creativecommons.org/publicdomain/zero/1.0/
With the following output added inside <image:image> after <image:loc>:
<image:license>http://creativecommons.org/publicdomain/zero/1.0/</image:license>

You could prettyprint the sitemap.xml a bit and add there newlines after every closing tag. That would make it a bit more human readable.

If you want, you can also take <image:title> from TITLE and/or ALT and <image:caption> from FIGCAPTION tags if they are present.

Cheers,
Santeri

AttributeError: 'NoneType' object has no attribute 'geturl'

I got such error
python3 main.py --domain https://domain.com --output sitemap.xml

Traceback (most recent call last):
File "main.py", line 60, in
crawl.run()
File "/root/python-sitemap/crawler.py", line 127, in run
self.__crawl(current_url)
File "/root/python-sitemap/crawler.py", line 264, in __crawl
final_url = response.geturl()
AttributeError: 'NoneType' object has no attribute 'geturl'

urls not saved to sitemap.xml

This is the first script I am ever running.

Thank you for creating it.

After it finished, there are 634 crawled urls.

However, the sitemap.xml file in the directory is empty. How do I fix this?

Thank you in advance.

This is what I see:

Handling more than 50,000 URLs

Hi, just wanted to say thanks for such a great library.

One need we have is to generate a sitemap for a site that has more than 50,000 URLs. The search engines typically only handle a maximum of 50,000 URLs per sitemap file, which means today that we manually create a sitemap index and move the URLs into individual sitemap files, each containing less than 50,000 URLs each.

One option I was considering was adding a feature to python-sitemap that would optionally output a sitemap index and multiple sitemap files if there are more than 50,000 URLs; would that be of interest? Just wanted to make sure that kind of feature would be desired prior to implementing; thanks!

Exclude canonicalized pages

sometimes we have URLs that are canonicalized to other pages, and these should not be included in the sitemap. See google's reference: https://developers.google.com/search/docs/advanced/sitemaps/build-sitemap

So the logic would be to look for a canonical tag and check if it matches the crawled URL. If it does not, then do not include that page in the sitemap.

I'm working on updating your code myself to include this but I'm still new to Python.

Video sitemap

How to create video sitemap for my site?

No URLs found

Number of found URL : 1
Number of links crawled : 1

python main.py --domain https://www.domain.com --output sitemap.xml --report

<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

</urlset>

sintax error

Just pulled the code and I get this:

$ python main.py www.cafonline.com --output sitemap.xml --verbose
Traceback (most recent call last):
File "main.py", line 8, in
import crawler
File "/home/francisco/python-sitemap/crawler.py", line 105
print(config.xml_header, file=self.output_file)
^
SyntaxError: invalid syntax

Improvement proposal: video support

Adding videos to site would work the same way as images and if you make ALT/TITLE and FIGURECAPTION, the same code would work with <video:video> as well. So far I have not found a single public image sitemap generator and the same applies to video sitemaps.

About video sitemaps: https://developers.google.com/webmasters/videosearch/sitemaps

Windows and/or Python 3.7.2?

Hello
I have a problem with python-sitemap on Windows and Python 3.7.2.
I haven't looked into the problem yet, but whatever I do (even 'empty'/solo 'python main.py') I get:

Traceback (most recent call last):
  File "C:\! git !\python-sitemap\main.py", line 8, in <module>
    import crawler
  File "C:\! git !\python-sitemap\crawler.py", line 240
    image_link = f"{self.domain.strip("/")}{image_link.replace("./", "/")}"
                                                                 ^
SyntaxError: invalid syntax

Soucis sur le nombre de lien avec code

Il semblerait qu'il y est un soucis :

[valentin@valentinpc crawler]$ python main.py --config config.json --debug
[...]
DEBUG:root:Number of link crawled : 15
DEBUG:root:Nb Code HTTP 200 : 14

crawling depth setting

Can you add a crawling depth setting? I found that because my website has filtering and searching, the URL will repeat to a very large amount.

Please move the project away from GitHub

I just read that Microsoft is acquiring GitHub. I have seen enough of Microsoft love for open source for a life time to avoid everything that involves them. It is at best just a kiss of death and soon all users are required to install Microsoft malware and open Microsoft accounts to use GitHub, all all our information is for sale. I am quitting GitHub. So long and thanks for all the fish.

Add options to pretty print the output XML

Add xmllint to produce an human readable XML.

Ref to #26

Add package to PyPI

This package seems quite popular and would benefit from being on PyPI. We could check out Poetry to keep it simple.

I can take a look at doing this one if it's of interest.

Tracker images are included

Tracker image links gets added, but they should be left out. You could simply check that the image extension is not php or js, or that it is a valid image type, before adding it:
Running:
python3 main.py --domain https://www.2globalnomads.info --output sitemap.xml --images --report --parserobots
Output:
<image:image><image:loc>https://analytics.2globalnomads.info/piwik.php?idsite=1&rec=1</image:loc>

I appears that the exclusion parameters (--skipext --exclude --drop) don't seem to have any effect to images.

Change name project

Hi! I propose change name projecto to Pysitemap.

Double entry if 2 slashes in the url

Command:
python3 ~/sitemap/python-sitemap-master/main.py --domain https://www.2globalnomads.info//web-design-websites/ --image --output sitemap.xml --report
This location will appears twice in the sitemap because of the double slash:

<loc>https://www.2globalnomads.info//web-design-websites/</loc>

Feature Request: Limit per category/section the number of URLs to parse

I have a website with millions of categorized records, it will be useful if I could limit the number of urls to parse per section. E.g. the 900,000 first urls under /products/toys/ section but not from a higher category.

patch for <lastmod> in sitemap

diff -urN python-sitemap-master/crawler.py python-sitemap-master/crawler.py
--- python-sitemap-master/crawler.py 2013-04-03 09:25:00.000000000 +0300
+++ python-sitemap-master/crawler.py 2013-06-08 11:27:24.706698113 +0300
@@ -5,6 +5,7 @@
from urllib.request import urlopen, Request
from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse
+from datetime import datetime

import os

@@ -105,12 +106,17 @@
else:
self.response_code[response.getcode()]=1
response.close()

      if 'last-modified' in response.headers:

          date = response.headers['Last-Modified']

```
      else:
```

          date = response.headers['Date']

      date = datetime.strptime(date, '%a, %d %b %Y %H:%M:%S %Z')
except Exception as e:
    logging.debug ("{1} ===> {0}".format(e, crawling))
    return self.__continue_crawling()

  print ("<url><loc>"+url.geturl()+"</loc></url>", file=self.output_file)

  print ("<url><loc>"+url.geturl()+"</loc><lastmod>"+date.strftime('%Y-%m-%dT%H:%M:%S')+"</lastmod></url>", file=self.output_file)
if self.output_file:
    self.output_file.flush()

HTTPS urls

Hi,

I noticed that even though the links in the page doesn't specify this, the tag always defaults to http://, even when the <a href="/"></a> doesn't include the domain?

i.e. with this command:

python3 main.py --domain https://www.****/ --images --output sitemap.xml --verbose

I get:

c4software / python-sitemap Goto Github PK

python-sitemap's Issues

A few improvement proposals

Recommend Projects

Recommend Topics

Recommend Org

Jobs