Comments (14)
Thanks for the report. It looks like I never updated weborphans for python 3: some syntaxes are different, but there were also a couple of more subtle differences in the way the urllib library functions worked.
I've updated it and I think I have it working now. I'm not sure it's 100%, but the output at least looks reasonable. Try it and see, and if you still see problems, either reopen this bug or file a new one.
from scripts.
I've tested with python 3.10.8 too, issue remains the same
from scripts.
I've added print (dir(lurl))
and output is showing all available attributes:
['add',
'class',
'class_getitem',
'contains',
'delattr',
'delitem',
'dir',
'doc',
'eq',
'format',
'ge',
'getattribute',
'getitem',
'gt',
'hash',
'iadd',
'imul',
'init',
'init_subclass',
'iter',
'le',
'len',
'lt',
'mul',
'ne',
'new',
'reduce',
'reduce_ex',
'repr',
'reversed',
'rmul',
'setattr',
'setitem',
'sizeof',
'str',
'subclasshook',
'append',
'clear',
'copy',
'count',
'extend',
'index',
'insert',
'pop',
'remove',
'reverse',
'sort']
from scripts.
Thanks for this great script!
It worked with one minor issue
mholzer@PB6460b-DE:/mnt/c/Users/mholzer/Documents/git/rusefi_documentation$ python3.10 -d ./wiki-tools/weborphans.py /mnt/c/Users/mholzer/Documents/git/rusefi_documentation/ https://wiki.rusefi.com/ >./wiki-tools/weborphans_wiki.rusefi.com.log
Traceback (most recent call last):
File "/mnt/c/Users/mholzer/Documents/git/rusefi_documentation/./wiki-tools/weborphans.py", line 12, in <module>
from bs4 import BeautifulSoup
ModuleNotFoundError: No module named 'bs4'
mholzer@PB6460b-DE:/mnt/c/Users/mholzer/Documents/git/rusefi_documentation$ pip install bs4
Requirement already satisfied: bs4 in /home/mholzer/.local/lib/python3.9/site-packages (0.0.1)
Requirement already satisfied: beautifulsoup4 in /usr/lib/python3/dist-packages (from bs4) (4.9.3)
Requirement already satisfied: soupsieve>1.2 in /usr/lib/python3/dist-packages (from beautifulsoup4->bs4) (2.2.1)
I fixed it by commenting #from bs4 import BeautifulSoup
Below output seems to indicate that the logic for identifying the "web root" configuration will need adaption for my case.
Did I understand that correctly?
rootdir: /mnt/c/Users/mholzer/Documents/git/rusefi_documentation/
rooturl: https://wiki.rusefi.com/
rooturlpath: /
scheme: https
host: wiki.rusefi.com
Can't find an index file inside /mnt/c/Users/mholzer/Documents/git/rusefi_documentation
Can't find an index file inside /mnt/c/Users/mholzer/Documents/git/rusefi_documentation
Done spiding
URLs succeeded:
Outside URLs:
URLs failed:
Orphans:
....
0 good links, 0 external urls not checked, 0 bad links, 2489 orphaned files.
from scripts.
root url is build from file https://raw.githubusercontent.com/rusefi/rusefi_documentation/master/_Sidebar.md
from scripts.
The script won't be able to follow links if you comment out the BeautifulSoup import. Is it possible your PYTHONPATH isn't seeing the bs4 from pip install? What happens if you just run python3 and type: from bs4 import BeautifulSoup ?
weborphans assumes that a link of directoryname/ will be filled by the web server (I use apache2) looking for index.html, index.php or index.cgi in that directory. If you're using some sort of content management system that remaps URLs in some other way, then weborphans has no way to tell that and it probably won't work for you. If you can figure out what file in the directory is providing the content, it might be possible to add that.
from scripts.
you are right PYTHONPATH seems to be the root cause
Python 3.10.8 (main, Dec 26 2022, 15:36:55) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'bs4'
from scripts.
I don't really have any useful advice on getting pip install to .local working reliably. I know it used to be super flaky several years ago, to the point where I gave up on it. I use python packages from my distro when I can (Debian and Ubuntu both package python3-bs4), and when I need to use packages installed with pip, I use a virtualenv, which seems to be better supported and tested than .local.
from scripts.
Thanks you so much for the hints. It all works with a venv
weborphans assumes that a link of directoryname/ will be filled by the web server (I use apache2) looking for index.html, index.php or index.cgi in that directory.
As this seems the basic assumption that needs to be met in order to produce correct output, how about
making this a fatal error?
instead of
if not localpath:
print("Can't find an index file inside", localdir)
return
I'd make it
if not localpath:
raise RuntimeError("Can't find an index file inside localdir:",localdir, "!! All files would be reported incorrectly as orphans")
from scripts.
If you can figure out what file in the directory is providing the content, it might be possible to add that.
MkDocs framework is used to create the website.
The variable mkdocs_page_input_path contains the source file name.
would that be a starting point?
from scripts.
I don't want check_url to raise an exception, because it's not just called on the first URL, it's called on sub-URLs too. So if you have a home page that points to /subpage, and there's no subpage/index.*, raising an exception would kill the whole run instead of just printing the error about subpage/.
The screenshots don't really answer what MkDocs is doing. I'm not even sure what (Index) means in your file browser: is that the actual filename, including the parentheses and everything? I guess MkDocs is converting markdown (Home.md) on the fly to HTML and serving it as index.html? Something that specific would probably need a web checker with rules specific to MkDocs, and since I don't use it, I don't know what those rules are. Parsing the markdown files probably isn't too hard, but figuring out which markdown files correspond to which URLs requires knowledge of how MkDocs works.
But MkDocs's home page says it's a static site generator. Couldn't you generate the HTML site, then run weborphans on that?
from scripts.
Couldn't you generate the HTML site, then run weborphans on that?
yes, after I wrote the reply I had the same idea :)
I'm fighting again with python to make that happen unfortunately.
(weborphans_venv) mholzer@PB6460b-DE:/mnt/c/Users/mholzer/Documents/git/rusefi_documentation$ python3.10 wiki-tools/weborphans.py mkdocs/site/Home/ https://wiki.rusefi.com
rootdir: mkdocs/site/Home/
rooturl: https://wiki.rusefi.com/
rooturlpath: /
scheme: https
host: wiki.rusefi.com
EEK! Non-relative URL passed to check_url, bailing
Traceback (most recent call last):
File "/mnt/c/Users/mholzer/Documents/git/rusefi_documentation/wiki-tools/weborphans.py", line 346, in <module>
spider.spide()
File "/mnt/c/Users/mholzer/Documents/git/rusefi_documentation/wiki-tools/weborphans.py", line 95, in spide
self.check_url(self.urls_to_check.pop())
File "/mnt/c/Users/mholzer/Documents/git/rusefi_documentation/wiki-tools/weborphans.py", line 303, in check_url
soup = BeautifulSoup(content, 'lxml')
File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 243, in __init__
raise FeatureNotFound(
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
Issues to solve seem:
- how to solve "bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml."
- correct command line/ local setup to avoid "EEK! Non-relative URL passed to check_url,"
Would you have any suggestions?
Below is the directory structure and index.html content
from scripts.
I'm surprised lxml isn't pulled in as a dependency of bs4. On Debian, the package is python3-lxml; on pip, it's just lxml.
I've just pushed a change that clarifies that EEK non-relative message by showing the URL that triggered it, but that's not the problem, the problem is that BeautifulSoup needs an HTML parser (lxml).
from scripts.
On Debian, the package is python3-lxml; on pip, it's just lxml.
it appears my Debian WSL environment is screwed.
Apologize for the effort it caused.
I switched to Ubuntu, where it all worked instantly.
on first run script displayed message "bailed" due to missing trailing slash on URL
on second run with added trailing slash on URL , message "bailed" was no longer displayed.
I was surprised of "0 orphaned files" as this was unexpected.
I'll check the bad links first, fix them and then will re-run.
mholzer@PB6460b-DE:/mnt/c/Users/mholzer/Documents/git/rusefi_documentation$ python3 wiki-tools/weborphans.py mkdocs/site/Home/ https://wiki.rusefi.com
rootdir: mkdocs/site/Home/
rooturl: https://wiki.rusefi.com/
rooturlpath: /
scheme: https
host: wiki.rusefi.com
EEK! Non-relative URL 'https://wiki.rusefi.com' passed to check_url, bailing
Done spiding
URLs succeeded:
/
Outside URLs:
...
Orphans:
1 good links, 16 external urls not checked, 62 bad links, 0 orphaned files.
from scripts.
Related Issues (14)
- termsize: without python HOT 4
- Awesome HOT 3
- be resilient to decoding issues? HOT 2
- conjunctions.py crash HOT 1
- Suggested add, to remove blank line before 'Reset...' text... HOT 4
- termsize: extra characters in urxvt HOT 1
- viewhtmlmail chokes on maildir HOT 3
- termsize curses version does not work for me HOT 4
- viewhtmlmail.py fails for certain characters HOT 1
- viewhtmlmail.py with Firefox HOT 9
- termsize won't work with Python3 HOT 4
- python 3 and block=false HOT 1
- Can't install python-poppler-pyqt5 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scripts.