GithubHelp home page GithubHelp logo

Comments (14)

akkana avatar akkana commented on August 16, 2024 1

Thanks for the report. It looks like I never updated weborphans for python 3: some syntaxes are different, but there were also a couple of more subtle differences in the way the urllib library functions worked.

I've updated it and I think I have it working now. I'm not sure it's 100%, but the output at least looks reasonable. Try it and see, and if you still see problems, either reopen this bug or file a new one.

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

I've tested with python 3.10.8 too, issue remains the same

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

I've added print (dir(lurl))
and output is showing all available attributes:
['add',
'class',
'class_getitem',
'contains',
'delattr',
'delitem',
'dir',
'doc',
'eq',
'format',
'ge',
'getattribute',
'getitem',
'gt',
'hash',
'iadd',
'imul',
'init',
'init_subclass',
'iter',
'le',
'len',
'lt',
'mul',
'ne',
'new',
'reduce',
'reduce_ex',
'repr',
'reversed',
'rmul',
'setattr',
'setitem',
'sizeof',
'str',
'subclasshook',
'append',
'clear',
'copy',
'count',
'extend',
'index',
'insert',
'pop',
'remove',
'reverse',
'sort']

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

Thanks for this great script!

It worked with one minor issue

mholzer@PB6460b-DE:/mnt/c/Users/mholzer/Documents/git/rusefi_documentation$ python3.10 -d ./wiki-tools/weborphans.py /mnt/c/Users/mholzer/Documents/git/rusefi_documentation/ https://wiki.rusefi.com/ >./wiki-tools/weborphans_wiki.rusefi.com.log
Traceback (most recent call last):
  File "/mnt/c/Users/mholzer/Documents/git/rusefi_documentation/./wiki-tools/weborphans.py", line 12, in <module>
    from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'
mholzer@PB6460b-DE:/mnt/c/Users/mholzer/Documents/git/rusefi_documentation$ pip install bs4
Requirement already satisfied: bs4 in /home/mholzer/.local/lib/python3.9/site-packages (0.0.1)
Requirement already satisfied: beautifulsoup4 in /usr/lib/python3/dist-packages (from bs4) (4.9.3)
Requirement already satisfied: soupsieve>1.2 in /usr/lib/python3/dist-packages (from beautifulsoup4->bs4) (2.2.1)

I fixed it by commenting #from bs4 import BeautifulSoup

Below output seems to indicate that the logic for identifying the "web root" configuration will need adaption for my case.
Did I understand that correctly?

rootdir: /mnt/c/Users/mholzer/Documents/git/rusefi_documentation/
rooturl: https://wiki.rusefi.com/
rooturlpath: /
scheme: https
host: wiki.rusefi.com

Can't find an index file inside /mnt/c/Users/mholzer/Documents/git/rusefi_documentation
Can't find an index file inside /mnt/c/Users/mholzer/Documents/git/rusefi_documentation
Done spiding

URLs succeeded:


Outside URLs:


URLs failed:


Orphans:
....
0 good links, 0 external urls not checked, 0 bad links, 2489 orphaned files.

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

root url is build from file https://raw.githubusercontent.com/rusefi/rusefi_documentation/master/_Sidebar.md

from scripts.

akkana avatar akkana commented on August 16, 2024

The script won't be able to follow links if you comment out the BeautifulSoup import. Is it possible your PYTHONPATH isn't seeing the bs4 from pip install? What happens if you just run python3 and type: from bs4 import BeautifulSoup ?

weborphans assumes that a link of directoryname/ will be filled by the web server (I use apache2) looking for index.html, index.php or index.cgi in that directory. If you're using some sort of content management system that remaps URLs in some other way, then weborphans has no way to tell that and it probably won't work for you. If you can figure out what file in the directory is providing the content, it might be possible to add that.

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

you are right PYTHONPATH seems to be the root cause

Python 3.10.8 (main, Dec 26 2022, 15:36:55) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'bs4'

from scripts.

akkana avatar akkana commented on August 16, 2024

I don't really have any useful advice on getting pip install to .local working reliably. I know it used to be super flaky several years ago, to the point where I gave up on it. I use python packages from my distro when I can (Debian and Ubuntu both package python3-bs4), and when I need to use packages installed with pip, I use a virtualenv, which seems to be better supported and tested than .local.

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

Thanks you so much for the hints. It all works with a venv

weborphans assumes that a link of directoryname/ will be filled by the web server (I use apache2) looking for index.html, index.php or index.cgi in that directory.

As this seems the basic assumption that needs to be met in order to produce correct output, how about
making this a fatal error?

instead of

if not localpath:
      print("Can't find an index file inside", localdir)
      return

I'd make it

if not localpath:
      raise RuntimeError("Can't find an index file inside localdir:",localdir, "!! All files would be reported incorrectly as orphans")

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

If you can figure out what file in the directory is providing the content, it might be possible to add that.

MkDocs framework is used to create the website.
The variable mkdocs_page_input_path contains the source file name.
would that be a starting point?

image

image

from scripts.

akkana avatar akkana commented on August 16, 2024

I don't want check_url to raise an exception, because it's not just called on the first URL, it's called on sub-URLs too. So if you have a home page that points to /subpage, and there's no subpage/index.*, raising an exception would kill the whole run instead of just printing the error about subpage/.

The screenshots don't really answer what MkDocs is doing. I'm not even sure what (Index) means in your file browser: is that the actual filename, including the parentheses and everything? I guess MkDocs is converting markdown (Home.md) on the fly to HTML and serving it as index.html? Something that specific would probably need a web checker with rules specific to MkDocs, and since I don't use it, I don't know what those rules are. Parsing the markdown files probably isn't too hard, but figuring out which markdown files correspond to which URLs requires knowledge of how MkDocs works.

But MkDocs's home page says it's a static site generator. Couldn't you generate the HTML site, then run weborphans on that?

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

Couldn't you generate the HTML site, then run weborphans on that?

yes, after I wrote the reply I had the same idea :)
I'm fighting again with python to make that happen unfortunately.

(weborphans_venv) mholzer@PB6460b-DE:/mnt/c/Users/mholzer/Documents/git/rusefi_documentation$ python3.10 wiki-tools/weborphans.py mkdocs/site/Home/ https://wiki.rusefi.com
rootdir: mkdocs/site/Home/
rooturl: https://wiki.rusefi.com/
rooturlpath: /
scheme: https
host: wiki.rusefi.com

EEK! Non-relative URL passed to check_url, bailing
Traceback (most recent call last):
  File "/mnt/c/Users/mholzer/Documents/git/rusefi_documentation/wiki-tools/weborphans.py", line 346, in <module>
    spider.spide()
  File "/mnt/c/Users/mholzer/Documents/git/rusefi_documentation/wiki-tools/weborphans.py", line 95, in spide
    self.check_url(self.urls_to_check.pop())
  File "/mnt/c/Users/mholzer/Documents/git/rusefi_documentation/wiki-tools/weborphans.py", line 303, in check_url
    soup = BeautifulSoup(content, 'lxml')
  File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 243, in __init__
    raise FeatureNotFound(
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Issues to solve seem:

  1. how to solve "bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml."
  2. correct command line/ local setup to avoid "EEK! Non-relative URL passed to check_url,"
    Would you have any suggestions?

Below is the directory structure and index.html content
image

from scripts.

akkana avatar akkana commented on August 16, 2024

I'm surprised lxml isn't pulled in as a dependency of bs4. On Debian, the package is python3-lxml; on pip, it's just lxml.

I've just pushed a change that clarifies that EEK non-relative message by showing the URL that triggered it, but that's not the problem, the problem is that BeautifulSoup needs an HTML parser (lxml).

from scripts.

mi-hol avatar mi-hol commented on August 16, 2024

On Debian, the package is python3-lxml; on pip, it's just lxml.

it appears my Debian WSL environment is screwed.
Apologize for the effort it caused.

I switched to Ubuntu, where it all worked instantly.
on first run script displayed message "bailed" due to missing trailing slash on URL
on second run with added trailing slash on URL , message "bailed" was no longer displayed.

I was surprised of "0 orphaned files" as this was unexpected.
I'll check the bad links first, fix them and then will re-run.

mholzer@PB6460b-DE:/mnt/c/Users/mholzer/Documents/git/rusefi_documentation$ python3 wiki-tools/weborphans.py mkdocs/site/Home/ https://wiki.rusefi.com
rootdir: mkdocs/site/Home/
rooturl: https://wiki.rusefi.com/
rooturlpath: /
scheme: https
host: wiki.rusefi.com

EEK! Non-relative URL 'https://wiki.rusefi.com' passed to check_url, bailing
Done spiding

URLs succeeded:
/

Outside URLs:
...


Orphans:


1 good links, 16 external urls not checked, 62 bad links, 0 orphaned files.

from scripts.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.