GithubHelp home page GithubHelp logo

urlstechie / urlchecker-python Goto Github PK

View Code? Open in Web Editor NEW
20.0 3.0 12.0 40.06 MB

:snake: :link: Python module and client for checking URLs

Home Page: https://urlchecker-python.readthedocs.io

License: MIT License

Python 97.97% Dockerfile 2.03%
urlchecker python url-check url-checker hacktoberfest

urlchecker-python's Introduction

Build Status Documentation Status codecov Python CodeFactor PyPI Downloads License

urlchecker-python

This is a python module to collect urls over static files (code and documentation) and then test for and report broken links. If you are interesting in using this as a GitHub action, see urlchecker-action. There are also container bases available on quay.io/urlstechie/urlchecker. As of version 0.0.26, we use multiprocessing so the checks run a lot faster, and you can set URLCHECKER_WORKERS to change the number of workers (defaults to 9). If you don't want multiprocessing, use version 0.0.25 or earlier.

Module Documentation

A detailed documentation of the code is available under urlchecker-python.readthedocs.io

Usage

Install

You can install the urlchecker from pypi. Before you do, it's recommended to install fake-useragent from:

pip install git+https://github.com/danger89/fake-useragent.git

And then urlchecker:

$ pip install urlchecker

or install from the repository directly:

$ git clone https://github.com/urlstechie/urlchecker-python.git
$ cd urlchecker-python
$ python setup.py install

Installation will place a binary, urlchecker in your Python path.

$ which urlchecker
/home/vanessa/anaconda3/bin/urlchecker

Check Local Folder

Your most likely use case will be to check a local directory with static files (documentation or code) for files. In this case, you can use urlchecker check:

$ urlchecker check --help
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup] [--serial] [--no-check-certs]
                        [--force-pass] [--no-print] [--verbose] [--file-types FILE_TYPES] [--files FILES]
                        [--exclude-urls EXCLUDE_URLS] [--exclude-patterns EXCLUDE_PATTERNS]
                        [--exclude-files EXCLUDE_FILES] [--save SAVE] [--retry-count RETRY_COUNT] [--timeout TIMEOUT]
                        path

positional arguments:
  path                  the local path or GitHub repository to clone and check

options:
  -h, --help            show this help message and exit
  -b BRANCH, --branch BRANCH
                        if cloning, specify a branch to use (defaults to main)
  --subfolder SUBFOLDER
                        relative subfolder path within path (if not specified, we use root)
  --cleanup             remove root folder after checking (defaults to False, no cleaup)
  --serial              run checks in serial (no multiprocess)
  --no-check-certs      Allow urls to validate that fail certificate checks
  --force-pass          force successful pass (return code 0) regardless of result
  --no-print            Skip printing results to the screen (defaults to printing to console).
  --verbose             Print file names for failed urls in addition to the urls.
  --file-types FILE_TYPES
                        comma separated list of file extensions to check (defaults to .md,.py)
  --files FILES         comma separated list of exact files or patterns to check.
  --exclude-urls EXCLUDE_URLS
                        comma separated links to exclude (no spaces)
  --exclude-patterns EXCLUDE_PATTERNS
                        comma separated list of patterns to exclude (no spaces)
  --exclude-files EXCLUDE_FILES
                        comma separated list of files and patterns to exclude (no spaces)
  --save SAVE           Path to a csv file to save results to.
  --retry-count RETRY_COUNT
                        retry count upon failure (defaults to 2, one retry).
  --timeout TIMEOUT     timeout (seconds) to provide to the requests library (defaults to 5)

You have a lot of flexibility to define patterns of urls or files to skip, along with the number of retries or timeout (seconds). The most basic usage will check an entire directory. Let's clone and check the urlchecker action:

$ git clone https://github.com/urlstechie/urlchecker-action.git
$ cd urchecker-action

and run the simplest command to check the present working directory (.).

$ urlchecker check .
           original path: .
              final path: /tmp/urlchecker-action
               subfolder: None
                  branch: master
                 cleanup: False
              file types: ['.md', '.py']
                   files: []
               print all: True
           urls excluded: []
   url patterns excluded: []
  file patterns excluded: []
              force pass: False
             retry count: 2
                    save: None
                 timeout: 5

 /tmp/urlchecker-action/README.md 
 --------------------------------
https://github.com/urlstechie/urlchecker-action/blob/master/LICENSE
https://github.com/r-hub/docs/blob/bc1eac71206f7cb96ca00148dcf3b46c6d25ada4/.github/workflows/pr.yml
https://img.shields.io/static/v1?label=Marketplace&message=urlchecker-action&color=blue?style=flat&logo=github
https://github.com/rseng/awesome-rseng
https://github.com/rseng/awesome-rseng/blob/5f5cb78f8392cf10aec2f3952b305ae9611029c2/.github/workflows/urlchecker.yml
https://github.com/HPC-buildtest/buildtest-framework/actions?query=workflow%3A%22Check+URLs%22
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action/badge
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/USRSE/usrse.github.io
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/actions?query=workflow%3ACommands
https://github.com/USRSE/usrse.github.io/blob/abcbed5f5703e0d46edb9e8850eea8bb623e3c1c/.github/workflows/urlchecker.yml
https://github.com/urlstechie/urlchecker-action/releases
https://img.shields.io/badge/license-MIT-brightgreen
https://github.com/r-hub/docs/actions?query=workflow%3ACommands
https://github.com/rseng/awesome-rseng/actions?query=workflow%3AURLChecker
https://github.com/buildtesters/buildtest
https://github.com/r-hub/docs
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action
https://github.com/urlstechie/URLs-checker-test-repo
https://github.com/marketplace/actions/urlchecker-action
https://github.com/actions/checkout
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2
https://github.com/USRSE/usrse.github.io/actions?query=workflow%3A%22Check+URLs%22
https://github.com/SuperKogito/Voice-based-gender-recognition/issues
https://github.com/buildtesters/buildtest/blob/v0.9.1/.github/workflows/urlchecker.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/blob/master/.github/workflows/urlchecker-pr-label.yml

 /tmp/urlchecker-action/examples/README.md 
 -----------------------------------------
https://github.com/urlstechie/urlchecker-action/releases
https://github.com/urlstechie/urlchecker-action/issues
https://help.github.com/en/actions/reference/events-that-trigger-workflows


Done. The following urls did not pass:
https://github.com/SuperKogito/URLs-checker/issues/1,https://github.com/SuperKogito/URLs-checker/issues/2

The url that didn't pass above is an example parameter for the library! Let's add a simple pattern to exclude it.

$ urlchecker check --exclude-pattern SuperKogito .
           original path: .
              final path: /tmp/urlchecker-action
               subfolder: None
                  branch: master
                 cleanup: False
              file types: ['.md', '.py']
                   files: []
               print all: True
           urls excluded: []
   url patterns excluded: ['SuperKogito']
  file patterns excluded: []
              force pass: False
             retry count: 2
                    save: None
                 timeout: 5

 /tmp/urlchecker-action/README.md 
 --------------------------------
https://github.com/urlstechie/urlchecker-action/blob/master/LICENSE
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/rseng/awesome-rseng/actions?query=workflow%3AURLChecker
https://github.com/USRSE/usrse.github.io/actions?query=workflow%3A%22Check+URLs%22
https://github.com/actions/checkout
https://github.com/USRSE/usrse.github.io/blob/abcbed5f5703e0d46edb9e8850eea8bb623e3c1c/.github/workflows/urlchecker.yml
https://github.com/r-hub/docs/blob/bc1eac71206f7cb96ca00148dcf3b46c6d25ada4/.github/workflows/pr.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/blob/master/.github/workflows/urlchecker-pr-label.yml
https://github.com/rseng/awesome-rseng
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action/badge
https://github.com/urlstechie/URLs-checker-test-repo
https://www.codefactor.io/repository/github/urlstechie/urlchecker-action
https://github.com/r-hub/docs
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks
https://github.com/buildtesters/buildtest
https://img.shields.io/badge/license-MIT-brightgreen
https://github.com/urlstechie/urlchecker-action/releases
https://github.com/marketplace/actions/urlchecker-action
https://img.shields.io/static/v1?label=Marketplace&message=urlchecker-action&color=blue?style=flat&logo=github
https://github.com/r-hub/docs/actions?query=workflow%3ACommands
https://github.com/HPC-buildtest/buildtest-framework/actions?query=workflow%3A%22Check+URLs%22
https://github.com/buildtesters/buildtest/blob/v0.9.1/.github/workflows/urlchecker.yml
https://github.com/berlin-hack-and-tell/berlinhackandtell.rocks/actions?query=workflow%3ACommands
https://github.com/USRSE/usrse.github.io
https://github.com/rseng/awesome-rseng/blob/5f5cb78f8392cf10aec2f3952b305ae9611029c2/.github/workflows/urlchecker.yml

 /tmp/urlchecker-action/examples/README.md 
 -----------------------------------------
https://help.github.com/en/actions/reference/events-that-trigger-workflows
https://github.com/urlstechie/urlchecker-action/issues
https://github.com/urlstechie/urlchecker-action/releases


Done. All URLS passed.

We can also filter by file types. If we want to do this (for example, to only check different file types) we might do any of the following:

# Check only html files
urlchecker check --file-types *.html .

# Check hidden flies
urlchecker check --file-types ".*" .

# Check hidden files and html files
urlchecker check --file-types ".*,*.html" .

Note that while some patterns will work without quotes, it's recommended for most to use them because if the shell expands any part of the pattern, it will not work as expected. By default, the urlchecker checks python and markdown. If a multiprocessing workers has an error, you can also add --serial to run in serial and test. The run will be slower, but it's useful for debugging.

$ urlchecker check . --files "content/docs/hacking/contributing/documentation/index.md" --serial

Check GitHub Repository

But wouldn't it be easier to not have to clone the repository first? Of course! We can specify a GitHub url instead, and add --cleanup if we want to clean up the folder after.

$ urlchecker check https://github.com/SuperKogito/SuperKogito.github.io.git

If you specify any arguments for a white list (or any kind of expected list) make sure that you provide a comma separated list without any spaces

$ urlchecker check --exclude-files=README.md,_config.yml

Save Results

If you want to save your results to file, perhaps for some kind of record or other data analysis, you can provide the --save argument:

$ urlchecker check --save results.csv .

The file that you save to will include a comma separated value tabular listing of the urls, and their result. The result options are "passed" and "failed" and the default header is URL,RESULT. All of these defaults are exposed if you want to change them (e.g., using a tab separator or a different header) if you call the function from within Python. Here is an example of the default file produced, which should satisfy most use cases:

URL,RESULT
https://github.com/SuperKogito,passed
https://www.google.com/,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues,passed
https://github.com/SuperKogito/Voice-based-gender-recognition,passed
https://github.com/SuperKogito/spafe/issues/4,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues/2,passed
https://github.com/SuperKogito/spafe/issues/5,passed
https://github.com/SuperKogito/URLs-checker/blob/master/README.md,passed
https://img.shields.io/,passed
https://github.com/SuperKogito/spafe/,passed
https://github.com/SuperKogito/spafe/issues/3,passed
https://www.google.com/,passed
https://github.com/SuperKogito,passed
https://github.com/SuperKogito/spafe/issues/8,passed
https://github.com/SuperKogito/spafe/issues/7,passed
https://github.com/SuperKogito/Voice-based-gender-recognition/issues/1,passed
https://github.com/SuperKogito/spafe/issues,passed
https://github.com/SuperKogito/URLs-checker/issues,passed
https://github.com/SuperKogito/spafe/issues/2,passed
https://github.com/SuperKogito/URLs-checker,passed
https://github.com/SuperKogito/spafe/issues/6,passed
https://github.com/SuperKogito/spafe/issues/1,passed
https://github.com/SuperKogito/URLs-checker/README.md,failed
https://github.com/SuperKogito/URLs-checker/issues/3,failed
https://none.html,failed
https://github.com/SuperKogito/URLs-checker/issues/2,failed
https://github.com/SuperKogito/URLs-checker/README.md,failed
https://github.com/SuperKogito/URLs-checker/issues/1,failed
https://github.com/SuperKogito/URLs-checker/issues/4,failed

Usage from Python

Checking a Path

If you want to check a list of urls outside of the provided client, this is fairly easy to do! Let's say we have a path, our present working directory, and we want to check .py and .md files (the default)

from urlchecker.core.check import UrlChecker
import os

path = os.getcwd()
checker = UrlChecker(path)    
# UrlChecker:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python

And of course you can provide more substantial arguments to derive the original file list:

checker = UrlChecker(
    path=path,
    file_types=[".md", ".py", ".rst"],
    include_patterns=[],
    exclude_files=["README.md", "LICENSE"],
    print_all=True,
)

I can then run the checker like this:

checker.run()

Or with more customization of excluded urls:

checker.run(
    exclude_urls=exclude_urls,
    exclude_patterns=exclude_patterns,
    retry_count=3,
    timeout=5,
)

You'll get the results object returned, which is also available at checker.results, a simple dictionary with "passed" and "failed" keys to show passes and fails across all files.

{'passed': ['https://github.com/SuperKogito/spafe/issues/4',
  'http://shachi.org/resources',
  'https://superkogito.github.io/blog/SpectralLeakageWindowing.html',
  'https://superkogito.github.io/figures/fig4.html',
  'https://github.com/urlstechie/urlchecker-test-repo',
  'https://www.google.com/',
  ...
  'https://github.com/SuperKogito',
  'https://img.shields.io/',
  'https://www.google.com/',
  'https://docs.python.org/2'],
 'failed': ['https://github.com/urlstechie/urlschecker-python/tree/master',
  'https://github.com/SuperKogito/Voice-based-gender-recognition,passed',
  'https://github.com/SuperKogito/URLs-checker/README.md',
   ...
  'https://superkogito.github.io/tables',
  'https://github.com/SuperKogito/URLs-checker/issues/2',
  'https://github.com/SuperKogito/URLs-checker/README.md',
  'https://github.com/SuperKogito/URLs-checker/issues/4',
  'https://github.com/SuperKogito/URLs-checker/issues/3',
  'https://github.com/SuperKogito/URLs-checker/issues/1',
  'https://none.html']}

You can look at checker.checks, which is a dictionary of result objects, organized by the filename:

for file_name, result in checker.checks.items(): 
    print() 
    print(result) 
    print("Total Results: %s " % result.count) 
    print("Total Failed: %s" % len(result.failed)) 
    print("Total Passed: %s" % len(result.passed)) 

...

UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/tests/test_files/sample_test_file.md
Total Results: 26 
Total Failed: 6
Total Passed: 20

UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/.pytest_cache/README.md
Total Results: 1 
Total Failed: 0
Total Passed: 1

UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/.eggs/pytest_runner-5.2-py3.7.egg/ptr.py
Total Results: 0 
Total Failed: 0
Total Passed: 0

UrlCheck:/home/vanessa/Desktop/Code/urlstechie/urlchecker-python/docs/source/conf.py
Total Results: 3 
Total Failed: 0
Total Passed: 3

For any result object, you can print the list of passed, falied, white listed, or all the urls.

result.all                                                                                                                                                                       
['https://www.sphinx-doc.org/en/master/usage/configuration.html',
 'https://docs.python.org/3',
 'https://docs.python.org/2']

result.failed                                                                                                                                                                    
[]

result.exclude
[]

result.passed                                                                                                                                                                    
['https://www.sphinx-doc.org/en/master/usage/configuration.html',
 'https://docs.python.org/3',
 'https://docs.python.org/2']

result.count
3

Checking a List of URls

If you start with a list of urls you want to check, you can do that too!

from urlchecker.core.urlproc import UrlCheckResult

urls = ['https://www.github.com', "https://github.com", "https://banana-pudding-doesnt-exist.com"]

# Instantiate an empty checker to extract urls
checker = UrlCheckResult()
File name None is undefined or does not exist, skipping extraction.

If you provied a file name, the urls would be extracted for you.

checker = UrlCheckResult(
    file_name=file_name,
    exclude_patterns=exclude_patterns,
    exclude_urls=exclude_urls,
    print_all=self.print_all,
)

or you can provide all the parameters without the filename:

checker = UrlCheckResult(
    exclude_patterns=exclude_patterns,
    exclude_urls=exclude_urls,
    print_all=self.print_all,
)

If you don't provide the file_name to check urls, you can give the urls you defined previously directly to the check_urls function:

checker.check_urls(urls)

https://www.github.com
https://github.com
HTTPSConnectionPool(host='banana-pudding-doesnt-exist.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f989abdfa10>: Failed to establish a new connection: [Errno -2] Name or service not known'))
https://banana-pudding-doesnt-exist.com

And of course you can specify a timeout and retry:

checker.check_urls(urls, retry_count=retry_count, timeout=timeout)

After you run the checker you can get all the urls, the passed, and failed sets:

checker.failed                                                                                                                                                                   
['https://banana-pudding-doesnt-exist.com']

checker.passed                                                                                                                                                                   
['https://www.github.com', 'https://github.com']

checker.all                                                                                                                                                                      
['https://www.github.com',
 'https://github.com',
 'https://banana-pudding-doesnt-exist.com']

checker.all                                                                                                                                                                      
['https://www.github.com',
 'https://github.com',
 'https://banana-pudding-doesnt-exist.com']

checker.count                                                                                                                                                                    
3

If you have any questions, please don't hesitate to open an issue.

Docker

A Docker container is provided if you want to build a base container with urlchecker, meaning that you don't need to install it on your host. You can build the container as follows:

docker build -t urlchecker .

And then the entrypoint will expose the urlchecker.

docker run -it urlschecker

Development

Organization

The module is organized as follows:

โ”œโ”€โ”€ client              # command line client
โ”œโ”€โ”€ main                # functions for supported integrations (e.g., GitHub)
โ”œโ”€โ”€ core                # core file and url processing tools
โ””โ”€โ”€ version.py          # package and versioning

In the "client" folder, for example, the commands that are exposed for the client (e.g., check) would named accordingly, e.g., client/check.py. Functions for Github are be provided in main/github.py. This organization should be fairly straight forward to always find what you are looking for.

Drivers

To test more difficult urls, we use a web driver, and you can choose between:

both to be used with selenium. This driver is optional, but will come by default with our action. To install it, you can download the driver at either of the links above and ensure you install selenium:

$ pip install urlchecker[selenium]

and either:

  1. Add it directly to your path
  2. Export the directory where it lives as URLCHECKER_DRIVERS_PATH
  3. Put it in the root of the urlchecker clone (it will be looked for here)

Support

If you need help, or want to suggest a project for the organization, please open an issue

urlchecker-python's People

Contributors

ax3l avatar mrmundt avatar superkogito avatar vsoch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

urlchecker-python's Issues

Travis runs: enable for only PR?

As discussed in #2 , Travis currently runs twice (for PR and branch pushes) and it might make sense to only run for PR, since pushing to master won't be a thing (but always done via a pull request). A note from @SuperKogito :

Well you make a good point about travis running twice. Technically, travis is running once on the new branch and once on the PR. I think we should drop the branch run and settle for one run for the PR? if you think it is slow, we can use pytest-xdist (like I did initially) but the logs won't be as detailed as now :P

Bug: urls that end with }

We currently have a bug that urls represented in yaml / json (or another data structure) that end in } are not being correctly parsed. Here is an example:

image

@SuperKogito can you take a look soon?

option to only print red URLs?

Could there be an option to only print the URLs for which there was a failure? This way it'd be easier to go through them for a website that has many URLs?

Improve urls regex

@vsoch Once again, here is another attempt at improving our long and forgiving nice regex ๐Ÿ˜„

A little background, the current regex is something I found online and after testing it along with other links I deemed it to be good enough. However, I was never comfortable with how long it was.

Complexity, simplicity and regex visualizations

Here is a simplified (domain extensions are replaced with ... except .com and .org) graph of what we have at the moment:
image

So after hacking and tweaking for a couple of days, I think I came up with an improved regex, that is shorter which means faster and simpler. Here how it looks:
image(2)

Comparing efficiency and speed

Here is a small idea on how it performs: https://regex101.com/r/zvnFp6/1
Unfortunately I couldn't run the same thing for our current regex cuz it is too long. However, I did run a the following comparison locally:

import re 
import time 


domain_extensions = "".join(
    (
        "com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|",
        "jobs|mobi|museum|name|post|pro|tel|travel|xxx|",
        "ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|",
        "ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|",
        "ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|",
        "dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|",
        "fi|fj|fk|fm|fo|fr|",
        "ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|",
        "hk|hm|hn|hr|ht|hu|",
        "id|ie|il|im|in|io|iq|ir|is|it|",
        "je|jm|jo|jp|ke|kg|kh|ki|",
        "km|kn|kp|kr|kw|ky|kz|",
        "la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|",
        "ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|",
        "na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|",
        "om|",
        "pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|",
        "qa|",
        "re|ro|rs|ru|rw|",
        "sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|",
        "tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|",
        "ua|ug|uk|us|uy|uz|",
        "va|vc|ve|vg|vi|vn|vu|",
        "wf|ws|",
        "ye|yt|yu|",
        "za|zm|zw",
    )
)
    
URL_REGEX1 = "".join(
    (
        "(?i)\\b(",
        "(?:",
        "https?:(?:htt/{1,3}|[a-z0-9%]",
        ")",
        "|[a-z0-9.\\-]+[.](?:%s)/)" % domain_extensions,
        "(?:",
        "[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
        "|\\([^\\s]+?\\)",
        ")",
        "+",
        "(?:",
        "\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
        "|\\([^\\s]+?\\)",
        "|[^\\s`!()\\[\\];:'\".,<>?ยซยปโ€œโ€โ€˜โ€™]",
        ")",
        "|",
        "(?:",
        "(?<!@)[a-z0-9]",
        "+(?:[.\\-][a-z0-9]+)*[.]",
        "(?:%s)\\b/?(?!@)" % domain_extensions,
        "))",
    )
)
CURRENT_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?ยซยปโ€œโ€โ€˜โ€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
NEW_REGEX = "(http[s]?:\/\/)(www\.)?([a-zA-Z0-9$-_@&+!*\(\),\/\.]+[\.])([a-zA-Z]+)([\/\.\-\_\=?#a-zA-Z0-9@&_=:%+~\(\)]+)"

# read file content
file_path = "links.txt"
with open(file_path, "r") as file:
    content = file.read()

links  = [ l for l in content.split("\n") if "http" in l ]


# 1st regex
t01 = time.time()
for i in range(1000):
    urls0 = re.findall(URL_REGEX1, content)
t02 = time.time()
print("DT0  =", t02-t01)
print("LEN0 = ", len(urls0))


# final regex
t11 = time.time()
for i in range(1000):
    urls1 = re.findall(CURRENT_REGEX, content)
t12 = time.time()
print("DT1  =", t12-t11)
print("LEN1 = ", len(urls1))


# 2nd regex
t21   = time.time()
for i in range(1000):
    urls2 = ["".join(x) for x in re.findall(NEW_REGEX, content)]
t22   = time.time()
print("DT2  =", t22-t21)
print("LEN2 = ", len(urls2))

links.txt is a file with 755 urls, each on a seperate line. These urls are collected from the logs of buildtest and us-rse. The results of the previous comparison are the following:

DT0  = 2.3765275478363037
LEN0 =  748

DT1  = 0.7541322708129883
LEN1 =  755

DT2  = 0.6342747211456299
LEN2 =  755

As you can see the long beautifully formatted regex takes a lot of time and is worse than the others. The newest regex is the fastest and it returns urls that for sure has http or https in them.

So what's next?

I suggest you take a look at all this, and maybe test the regex too with different urls and different ideas to check its robustness and if your results are positive too then I can submit a PR ๐Ÿ˜‰ This blog post: In search of the perfect URL validation regex is a good inspiration. I think we rank somewhat third according to their test.

Add a tests summary/ tests run stats

The link provided by maelle in urlstechie/urlchecker-action#61 gave me an idea, so I am just laying this out here to get your input on it. So far we only provide urls checks/tests results, but wouldn't it be also nice to provide a summary of the tests and some stats, something like:

Urls checks summary
-----------------------------
Total count of tested files     : 10
Total count of tested urls      : 51
Total count of working urls     : 48  
Total count of failed urls      : 3

etc.

Urlchecker doesn't clean previous configuration. Got `file types: ['.md', '.py']` without adding such pattern

Preconditions:

  • Python 3.9.13
  • urlchecker version: 0.0.31

Steps to reproduce:

  1. Clone the repo: https://gitlab.com/cki-project/documentation
  2. Navigate to the repo: cd documentation
  3. Execute: urlchecker check . --timeout 60 --retry-count 5 --files "content/docs/hacking/contributing/documentation/index.md"

Actual result:

โฏ urlchecker check .      --timeout 60     --retry-count 5  --files "content/docs/hacking/contributing/documentation/index.md"
           original path: .
              final path: /home/sturivny/git-cki/documentation
               subfolder: None
                  branch: main
                 cleanup: False
              file types: ['.md', '.py']
                   files: ['content/docs/hacking/contributing/documentation/index.md']
               print all: True
                 verbose: False
           urls excluded: []
   url patterns excluded: []
  file patterns excluded: []
              force pass: False
             retry count: 5
                    save: None
                 timeout: 60

Expected result:
No Errors

ERROR: Packages installed from PyPI cannot depend on packages which are not also hosted on PyPI

So far, pip install urlchecker was everything needed to install the package. With version 0.0.33, this fails with:

(env) ~$ pip install urlchecker
Collecting urlchecker
  Downloading urlchecker-0.0.33.tar.gz (30 kB)
  Preparing metadata (setup.py) ... done
ERROR: Packages installed from PyPI cannot depend on packages which are not also hosted on PyPI.
urlchecker depends on fake-useragent@ git+https://github.com/danger89/fake-useragent@master#egg=fake-useragent

If I follow the installation instructions in your README file, I get this:

(urlchecker) ~$ pip install git+https://github.com/danger89/fake-useragent.git
Collecting git+https://github.com/danger89/fake-useragent.git
  Cloning https://github.com/danger89/fake-useragent.git to /tmp/pip-req-build-qvfvjz9m
  Running command git clone --filter=blob:none --quiet https://github.com/danger89/fake-useragent.git /tmp/pip-req-build-qvfvjz9m
  Resolved https://github.com/danger89/fake-useragent.git to commit a8f2b57910cdb496dcebd4a828f6e33984b2124f
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: fake-useragent
  Building wheel for fake-useragent (setup.py) ... done
  Created wheel for fake-useragent: filename=fake_useragent-0.1.12-py3-none-any.whl size=13815 sha256=b897ee62f3d8cd7f726d523aa4e2b5d99a18fdaba36804312bd778f797651233
  Stored in directory: /tmp/pip-ephem-wheel-cache-asneomg1/wheels/4a/80/0e/42b3e4541f3f2e603f91cf7c5c83b5342621e45a46d45b98ed
Successfully built fake-useragent
Installing collected packages: fake-useragent
Successfully installed fake-useragent-0.1.12

(urlchecker) ~$ pip install urlchecker
Collecting urlchecker
  Using cached urlchecker-0.0.33.tar.gz (30 kB)
  Preparing metadata (setup.py) ... done
ERROR: Packages installed from PyPI cannot depend on packages which are not also hosted on PyPI.
urlchecker depends on fake-useragent@ git+https://github.com/danger89/fake-useragent@master#egg=fake-useragent

Consider dummy repository for testing

@SuperKogito we are currently using your personal site for most tests, and there are quite a few files! What do you think about creating a dummy (smaller version of it) here, and then using that for consistent testing?

Regression in 0.0.31?

Hello, I've observed what seems to be a regression in the latest release.

With version 0.0.30, the URLs in the test file I have there are correctly flagged as problematic but 0.0.31 doesn't appear to work at all.

crd@raspberrypi:~ $ python3 -m venv uc
crd@raspberrypi:~ $ . uc/bin/activate
(uc) crd@raspberrypi:~ $ cd uc
(uc) crd@raspberrypi:~/uc $ mkdir foo
(uc) crd@raspberrypi:~/uc $ cat > foo/foo.md
 *   Principal Lead Software Engineer -- https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209997
 *   Senior Software Engineer -- https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209999
 *    Associate Software Engineer -- https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=210005
(uc) crd@raspberrypi:~/uc $ pip install urlchecker==0.0.30
Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple
Collecting urlchecker==0.0.30
  Using cached https://www.piwheels.org/simple/urlchecker/urlchecker-0.0.30-py3-none-any.whl (26 kB)
Collecting fake-useragent
  Using cached https://www.piwheels.org/simple/fake-useragent/fake_useragent-0.1.11-py3-none-any.whl (13 kB)
Collecting requests>=2.18.4
  Using cached https://www.piwheels.org/simple/requests/requests-2.28.1-py3-none-any.whl (62 kB)
Collecting certifi>=2017.4.17
  Using cached https://www.piwheels.org/simple/certifi/certifi-2022.6.15-py3-none-any.whl (160 kB)
Collecting idna<4,>=2.5
  Using cached https://www.piwheels.org/simple/idna/idna-3.3-py3-none-any.whl (64 kB)
Collecting charset-normalizer<3,>=2
  Using cached https://www.piwheels.org/simple/charset-normalizer/charset_normalizer-2.1.0-py3-none-any.whl (39 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached https://www.piwheels.org/simple/urllib3/urllib3-1.26.11-py2.py3-none-any.whl (139 kB)
Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests, fake-useragent, urlchecker
Successfully installed certifi-2022.6.15 charset-normalizer-2.1.0 fake-useragent-0.1.11 idna-3.3 requests-2.28.1 urlchecker-0.0.30 urllib3-1.26.11
(uc) crd@raspberrypi:~/uc $ urlchecker check foo/
           original path: foo/
              final path: foo/
               subfolder: None
                  branch: main
                 cleanup: False
              file types: ['.md', '.py']
                   files: []
               print all: True
                 verbose: False
           urls excluded: []
   url patterns excluded: []
  file patterns excluded: []
              force pass: False
             retry count: 2
                    save: None
                 timeout: 5
HTTPSConnectionPool(host='uwhires.admin.washington.edu', port=443): Max retries exceeded with url: /ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=210005 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=210005
HTTPSConnectionPool(host='uwhires.admin.washington.edu', port=443): Max retries exceeded with url: /ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=210005 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=210005
HTTPSConnectionPool(host='uwhires.admin.washington.edu', port=443): Max retries exceeded with url: /ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209999 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209999
HTTPSConnectionPool(host='uwhires.admin.washington.edu', port=443): Max retries exceeded with url: /ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209999 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209999
HTTPSConnectionPool(host='uwhires.admin.washington.edu', port=443): Max retries exceeded with url: /ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209997 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209997
HTTPSConnectionPool(host='uwhires.admin.washington.edu', port=443): Max retries exceeded with url: /ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209997 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209997

๐Ÿค” Uh oh... The following urls did not pass:
โŒ๏ธ https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=210005
โŒ๏ธ https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209999
โŒ๏ธ https://uwhires.admin.washington.edu/ENG/Candidates/default.cfm?szCategory=jobprofile&szOrderID=209997
(uc) crd@raspberrypi:~/uc $ pip install --upgrade urlchecker==0.0.31
Looking in indexes: https://pypi.org/simple, https://www.piwheels.org/simple
Collecting urlchecker==0.0.31
  Using cached https://www.piwheels.org/simple/urlchecker/urlchecker-0.0.31-py3-none-any.whl (28 kB)
Requirement already satisfied: fake-useragent in ./lib/python3.9/site-packages (from urlchecker==0.0.31) (0.1.11)
Requirement already satisfied: requests>=2.18.4 in ./lib/python3.9/site-packages (from urlchecker==0.0.31) (2.28.1)
Requirement already satisfied: charset-normalizer<3,>=2 in ./lib/python3.9/site-packages (from requests>=2.18.4->urlchecker==0.0.31) (2.1.0)
Requirement already satisfied: idna<4,>=2.5 in ./lib/python3.9/site-packages (from requests>=2.18.4->urlchecker==0.0.31) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in ./lib/python3.9/site-packages (from requests>=2.18.4->urlchecker==0.0.31) (2022.6.15)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./lib/python3.9/site-packages (from requests>=2.18.4->urlchecker==0.0.31) (1.26.11)
Installing collected packages: urlchecker
  Attempting uninstall: urlchecker
    Found existing installation: urlchecker 0.0.30
    Uninstalling urlchecker-0.0.30:
      Successfully uninstalled urlchecker-0.0.30
Successfully installed urlchecker-0.0.31
(uc) crd@raspberrypi:~/uc $ urlchecker check foo/
           original path: foo/
              final path: foo/
               subfolder: None
                  branch: main
                 cleanup: False
              file types: ['.md', '.py']
                   files: []
               print all: True
                 verbose: False
           urls excluded: []
   url patterns excluded: []
  file patterns excluded: []
              force pass: False
             retry count: 2
                    save: None
                 timeout: 5
2022-07-26 20:55:30,171 - urlchecker - ERROR - Error running task
๐Ÿค” There were no URLs to check.
(uc) crd@raspberrypi:~/uc $

Maybe I'm doing something wrong?

failing url check.

According to https://github.com/buildtesters/buildtest/runs/6903580431?check_suite_focus=true i am getting failure on this url https://www.hpcwire.com/2019/01/17/pfizer-hpc-engineer-aims-to-automate-software-stack-testing/

It would be nice if there was extra output to explain why these url are breaking. I know this urlchecker does work on most part and i can ignore this from list.

Screen Shot 2022-06-15 at 12 11 00 PM

I don't want to use force_pass : true though i do want the action to pass.

I have not tried verbose: true but does this provide the detail for debugging.

Accelerate urls checking using concurrency (asyncio + aiohttp)

Urls are checked using a loop that tests the response of the requests sequentially, which becomes slow for huge websites.


Img source

Alternatively, we can use concurrency to process requests& responses asynchronously and speed up the system.


Img source

I already integrated this concept in my local repo using the asyncio and AIOHTTP libraries and the results look promising. The speed difference is notable based on various blogs (Python and fast HTTP clients, HTTP in Python: aiohttp vs. Requests, Making 1 million requests with python-aiohttp) and so far my tests confirm that.

Img source

The new libraries are slightly different from requests and so the following is true:

  • Different requests response format.
  • Different exceptions from previous ones.
  • Some disorder in the printed urls list (asynchronicity).
  • Need to remove duplicate urls before checking instead of adding urls to a seen set .
  • Possibly different timeout value needed.

I managed to almost replicate the same features we have in the current version but I will definitely need your feedback. Anyway, these differences bring me to my major question @vsoch : Do you think that we should add this feature as an option --accelerated-run or replace the current implementation with it

Add --report option for html output

And along with this, an example CI run on CircleCI, which would allow us to save an html report and then render it (in the browser) directly as an artifact. Related #38

Provide base container with urlchecker-python

heyo! So this will take a bit more time to develop, but I'm working on the action refactor and I think we can really streamline it if we release a build for the container here. For a quick solution I'm going to build the container and manually push to Docker Hub to use as a base, but I wanted to open this issue so we can discuss:

  • an organization for containers built here
  • an automated build on release to do it.

Minor: misspelling 'whitetlist'

This is a followup to urlchecker-action #72: The misspelling "whitetlist" (note extra T) is in the python script.

check.py line 66
README.md lines 115, 192

I was going to create a PR, but then I thought you might be considering changing "white(_)list(ed)" options to "exclude(d)" to match the change in urlchecker-action, in which case much more than these lines would be getting changed.

As I noted in the title, this is a minor fix, just wanted to follow up since I mentioned it in the original issue and figured out where it was from.

Syntax for 'file_types' including dotfiles

Let's say I wanted to check all .html files, but also ALL dotfiles -- that is, files that have no filename, only an extension (.editorconfig, say, or .pylintrc).

Why would I want to do that? Because sometimes I use comments in these files to source where I got something, and I want to check URLs there to know if/when they break.

I'm sure this is an edge case, but is there a syntax for file_types that would include all dotfiles (not just specific ones)? I tried a few different things but none of them worked.

Thanks!

Add a logo for the python package?

I saw that you are included the urlstechie logo but I think it is better to using something unique for the lib. After playing with some stuff today I came up with this, what do you think of it?
urlchecker-python

it is a link logo as two snakes with the colors of python xD

Clarification on exclude_patterns

Hi there,

I was playing around with your library, and had to dig into your code to see how exclude patterns work. I think I understand correctly that they're substrings (not glob/regex) that can appear in a URL to exclude it from consideration.

Might I suggest that their format and use is clarified in the docs?

Add selenium for (still) failing requests

According to urlstechie/urlchecker-action#93, it seems that the checker is not able to check doi and sciencedirect link. A quick inspection shows that a simple 'Accept': 'application/json' in the headers should fix this. You can amend this to #72, I think. Unfortunately I cannot test or help much here these days.

This fails:

import requests

agent = (
            "Mozilla/5.0 (X11; Linux x86_64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/63.0.3239.108 "
            "Safari/537.36"
        )
timeout = 5
url_doi = 'https://doi.org/10.1063/5.0023771'
url_sci = 'https://www.sciencedirect.com/science/article/pii/S0013468608005045'

r = requests.get(url_sci, 
                 headers={"User-Agent": agent})

print(f"Status Code: {r.status_code}")

This works:

import requests

agent = (
            "Mozilla/5.0 (X11; Linux x86_64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/63.0.3239.108 "
            "Safari/537.36"
        )
timeout = 5
url_doi = 'https://doi.org/10.1063/5.0023771'
url_sci = 'https://www.sciencedirect.com/science/article/pii/S0013468608005045'

r = requests.get(url_sci, 
                 headers={"User-Agent": agent, 'Accept': 'application/json'})

print(f"Status Code: {r.status_code}")

you can test the code online here

OSError: [Errno 9] Bad file descriptor

Preconditions:

  • Python 3.9.13
  • urlchecker version: 0.0.31

Steps to reproduce:

  1. Clone the repo: https://gitlab.com/cki-project/documentation
  2. Navigate to the repo: cd documentation
  3. Execute: urlchecker check . --timeout 60 --retry-count 5 --files "content/docs/hacking/contributing/documentation/index.md"

Actual result:

โฏ urlchecker check .      --timeout 60     --retry-count 5  --files "content/docs/hacking/contributing/documentation/index.md"
           original path: .
              final path: /home/sturivny/git-cki/documentation
               subfolder: None
                  branch: main
                 cleanup: False
              file types: ['.md', '.py']
                   files: ['content/docs/hacking/contributing/documentation/index.md']
               print all: True
                 verbose: False
           urls excluded: []
   url patterns excluded: []
  file patterns excluded: []
              force pass: False
             retry count: 5
                    save: None
                 timeout: 60
https://gitlab.com/cki-project/documentation/
https://gohugo.io/
https://www.docsy.dev/
https://gohugo.io/content-management/page-bundles/
https://cki-project.org/
https://www.gnu.org/software/stow/
2022-08-04 12:42:32,421 - urlchecker - ERROR - Error running task
ERROR:urlchecker:Error running task
๐Ÿค” There were no URLs to check.
Exception ignored in: <function Pool.__del__ at 0x7fd954e7eee0>
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
    self._writer.send_bytes(obj)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Expected result:
No Errors

tests should use tmp_path

Right now (it looks like) there is a clone being done to the PWD where the repo is / tests are run, I will fix this up in my current (WIP) PR #31 probably tomorrow.

Got "๐Ÿค” There were no URLs to check" message but several URLS were checked

Preconditions:

  • Python 3.9.13
  • urlchecker version: 0.0.31

Steps to reproduce:

  1. Clone the repo: https://gitlab.com/cki-project/documentation
  2. Navigate to the repo: cd documentation
  3. Execute: urlchecker check . --timeout 60 --retry-count 5 --files "content/docs/hacking/contributing/documentation/index.md"

Actual result:

โฏ urlchecker check .      --timeout 60     --retry-count 5  --files "content/docs/hacking/contributing/documentation/index.md"
           original path: .
              final path: /home/sturivny/git-cki/documentation
               subfolder: None
                  branch: main
                 cleanup: False
              file types: ['.md', '.py']
                   files: ['content/docs/hacking/contributing/documentation/index.md']
               print all: True
                 verbose: False
           urls excluded: []
   url patterns excluded: []
  file patterns excluded: []
              force pass: False
             retry count: 5
                    save: None
                 timeout: 60
https://gitlab.com/cki-project/documentation/
https://gohugo.io/
https://www.docsy.dev/
https://gohugo.io/content-management/page-bundles/
https://cki-project.org/
https://www.gnu.org/software/stow/
2022-08-04 12:42:32,421 - urlchecker - ERROR - Error running task
ERROR:urlchecker:Error running task
๐Ÿค” There were no URLs to check.
Exception ignored in: <function Pool.__del__ at 0x7fd954e7eee0>
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
    self._writer.send_bytes(obj)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Expected result:
No Errors

Support for internal links

hey @SuperKogito - it occurred to me today that we don't support checking internal links, meaning that if we render a jekyll site, we aren't able to see if something that starts with a / in an img src or link href works (internally). I think this is something that would be needed, and also would mimic the functionality of html-proofer. What do you think? How would it work?

Promoting the project and gaining attention

The article @vsoch wrote on her blog along with her tweet should help spread the world about the urlstechie tools. This issue is a place to brainstorm and collect suggestions and ideas on how to gain more attention for the project. We should also discuss pacing and maybe construct a tilmeline for the actions we will take, cuz too much posting can be spamming and that way the actions we take will backfire :/

Here is my list of things I intend to do the up-coming days that I hope to help us get more attention:

  • Post about the tools in various programming groups on Facebook. (This usually gets me some views).
  • Post about it on Reddit (I am new to this, it seems okay but nothing major).
  • Write a post about it on my Blog / or maybe better on Medium (never done it before but might help).
  • Post about it on Hacker rank and other forums.
  • Post about urlstechie on my LinkedIn (never tried this before but should help).

I partially tried the previous steps with other projects I have. In all fairness, they did help boost some projects a little but nothing major with the exception of submitting a paper to JOSS. That seemed to be the most successful action, I did. However, for our tools, I don't think a research paper is an option.

As you can see, there is little experience here about promoting the tools. So any suggestions are welcome. The tool is very useful thanks to the additions of @vsoch and it should be able to prove itself, that imo is the best promotion eventually cuz a good tool is reliable. So the aim here is not do advertisement but rather give the tool a little push so we can take this project to the next level ;)

Don't include urls with {} or similar patterns

Sorry I'm breaking my promise to no longer comment for a while!

I've run urlchecker via the action on a website, https://github.com/r-hub/docs/runs/576617965?check_suite_focus=true

I get false positives for e.g. https://www.r-pkg.org/badges/version/ that's actually escaped in my content file (as code) https://github.com/r-hub/docs/blob/master/content/badges/_index.md#cran-versions I'm guessing it's because of regex?

To extract links (in R) I use the commonmark package that uses cmark. Post about URL cleaning, Post about commonmark. There seems to be a Python commonmark library. That said a limitation of this approach is that it wouldn't recognize URLs in text (that e.g. Hugo would transform to links).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.