GithubHelp home page GithubHelp logo

psl's Introduction

publicsuffixlist

Public Suffix List parser implementation for Python 3.5+.

  • Compliant with TEST DATA
  • Supports IDN (unicode and punycoded).
  • Supports Python3.5+
  • Shipped with built-in PSL and an updater script.
  • Written in Pure Python with no library dependencies.

publish package CI test PyPI version Downloads

Install

publicsuffixlist can be installed via pip.

$ pip install publicsuffixlist

Usage

Basic Usage:

from publicsuffixlist import PublicSuffixList

psl = PublicSuffixList()
# Uses built-in PSL file

print(psl.publicsuffix("www.example.com"))   # "com"
# the longest public suffix part

print(psl.privatesuffix("www.example.com"))  # "example.com"
# the shortest domain assigned for a registrant

print(psl.privatesuffix("com")) # None
# Returns None if no private (non-public) part found

print(psl.publicsuffix("www.example.unknownnewtld")) # "unknownnewtld"
# New TLDs are valid public suffix by default

print(psl.publicsuffix("www.example.香港")) #"香港"
# Accepts unicode

print(psl.publicsuffix("www.example.xn--j6w193g")) # "xn--j6w193g"
# Accepts Punycode IDNs by default

print(psl.privatesuffix("WWW.EXAMPLE.COM")) # "example.com"
# Returns in lowercase by default

print(psl.privatesuffix("WWW.EXAMPLE.COM", keep_case=True) # "EXAMPLE.COM"
# kwarg `keep_case=True` to disable the case conversion

The latest PSL is packaged once a day. If you need to parse your own version, it can be passed as a file-like iterable object, or just a str:

with open("latest_psl.dat", "rb") as f:
    psl = PublicSuffixList(f)

The unittest and PSL updater can be invoked as module.

$ python -m publicsuffixlist.test
$ python -m publicsuffixlist.update

Additional convenient methods:

print(psl.is_private("example.com"))  # True
print(psl.is_public("example.com"))   # False
print(psl.privateparts("aaa.www.example.com")) # ("aaa", "www", "example.com")
print(psl.subdomain("aaa.www.example.com", depth=1)) # "www.example.com"

Limitation

Domain Label Validation

publicsuffixlist do NOT provide domain name and label validation. In the DNS protocol, most 8-bit characters are acceptable as labels of domain names. While ICANN-compliant registries do not accept domain names containing underscores (_), hostnames may include them. For example, DMARC records can contain underscores. Users must confirm that the input domain names are valid based on their specific context.

Punycode Handling

Partially encoded (Unicode-mixed) Punycode is not supported due to very slow Punycode encoding/decoding and unpredictable encoding results. If you are unsure whether an input is valid Punycode, you should use: unknowndomain.encode("idna").decode("ascii"). This method, converting to idna is idempotent.

Handling Arbitrary Binary

If you need to accept arbitrary or malicious binary data, it can be passed as a tuple of bytes. Note that the returned bytes may include byte patterns that cannot be decoded or represented as a standard domain name. Example:

psl.privatesuffix((b"a.a", b"a.example\xff", b"com"))  # (b"a.example\xff", b"com")

# Note that IDNs must be punycoded when passed as tuple of bytes.
psl = PublicSuffixList("例.example")
psl.publicsuffix((b"xn--fsq", b"example"))  # (b"xn--fsq", b"example")
# UTF-8 encoded bytes of "例" do not match.
psl.publicsuffix((b"\xe4\xbe\x8b", b"example"))  # (b"example",)

License

  • This module is licensed under Mozilla Public License 2.0.
  • The Public Suffix List maintained by the Mozilla Foundation is licensed under the Mozilla Public License 2.0.
  • The PSL testcase dataset is in the public domain (CC0).

Development / Packaging

This module and its packaging workflow are maintained in the author's repository located at https://github.com/ko-zu/psl.

A new package, which includes the latest PSL file, is automatically generated and uploaded to PyPI. The last part of the version number represents the release date. For example, 0.10.1.20230331 indicates a release date of March 31, 2023.

This package dropped support for Python 2.7 and Python 3.4 or prior versions at the version 1.0.0 release in June 2024. The last version that works on Python 2.x is 0.10.0.x.

Source / Link

psl's People

Contributors

ko-zu avatar megabug avatar tomers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

psl's Issues

Tag the source

Could you please tag the source? This allows distributions to get the complete source from GitHub if they want.

Thanks

Dashes cause a parsing error

Try this example, as I believe there is a bug with a dash. Note that all I did was change "compute-1" to "compute1" and then it works as expected.

>>> from publicsuffixlist import PublicSuffixList
>>> psl = PublicSuffixList()
>>> psl.privatesuffix('ec2-107-21-74-29.compute-1.amazonaws.com')
>>> psl.publicsuffix('ec2-107-21-74-29.compute-1.amazonaws.com')
'ec2-107-21-74-29.compute-1.amazonaws.com'
>>> psl.publicsuffix('ec2-107-21-74-29.compute1.amazonaws.com')
'com'
>>> psl.privatesuffix('ec2-107-21-74-29.compute1.amazonaws.com')
'amazonaws.com'

Notice: Dropping Support for Python 2.7 and 3.4

This is an important notice for users.

In the upcoming version 1.0.0, support for Python 2.7 and 3.4 will be discontinued. Version 0.10.x (or auto-released versions with the .yyyymmdd suffix) will be the last to support Python 2.7.

The minimum requirement for new versions will be Python 3.5 or later.

The new version will include type hinting to enhance API stability. The updated code is currently available in the devel branch.
https://github.com/ko-zu/psl/tree/devel

If you know of any users still relying on this module with Python 2.7, please comment here.

Fails on ccTLDs

Calling privatesuffix("something.com.mx") returns "com.mx".

UTF-8 encoded bytes should not match the list

In version 1.0.0, the tuple of bytes input matches the list if the bytes are valid UTF-8.

# custom psl rule to demo
psl = PublicSuffixList("例.example")
psl.publicsuffix("例.example")              # "例.example"
psl.publicsuffix("xn--fsq.example")         # "xn--fsq.example"
psl.publicsuffix((b"xn--fsq", b"example"))  # (b"xn--fsq", b"example")

# UTF-8 binary of "例" does match, but it should not.
psl.publicsuffix((b"\xe4\xbe\x8b", b"example"))  # (b"\xe4\xbe\x8b", b"example")

Expected behavior should be:

# b"\xe4\xbe\x8b" should not match b"xn--fsq".  Only its level 1 tld should match.
psl.publicsuffix((b"\xe4\xbe\x8b", b"example"))  # (b"example",)

The last case should not match in its entirety since the bytes object does not contain its encoding information. We should evaluate the binary input as-is, except for the ASCII case conversion defined in the evaluation rule.

This can be problematic if the encoding of arbitrary input cannot be enforced and/or the input must be decoded from bytes to str using punycode. Assuming UTF-8 is incorrect in this context.

In cases where evaluating binary as UTF-8 is required, the callers should encode the input to punycoded bytes tuples, as pspacesk commented in #29.

`--help` fails on win

When running publicsuffixlist-download --help on windows, I get the following error:

Error:

Traceback (most recent call last):
  File "C:\bld\publicsuffixlist_1675107463020\_test_env\Scripts\publicsuffixlist-download-script.py", line 9, in <module>
    sys.exit(updatePSL())
             ^^^^^^^^^^^
  File "C:\bld\publicsuffixlist_1675107463020\_test_env\Lib\site-packages\publicsuffixlist\update.py", line 41, in updatePSL
    os.rename(psl_file + ".swp", psl_file)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\bld\\publicsuffixlist_1675107463020\\_test_env\\Lib\\site-packages\\publicsuffixlist\\public_suffix_list.dat.swp' -> 'C:\\bld\\publicsuffixlist_1675107463020\\_test_env\\Lib\\site-packages\\publicsuffixlist\\public_suffix_list.dat'

Logs: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=650026&view=logs&j=3ff94dba-189a-527c-65e3-ce8503824159&t=35acf2bd-66a8-5b9f-4368-b52d351bfcc2
Context: conda-forge/staged-recipes#21906

issue/inconsistent behavior for all *. rules

if there is a rule like:
*.abc.com
I would expect that if you give it
substuf.def.abc.com that the public suffix should be def.abc.com.

from publicsuffixlist import PublicSuffixList

# RULES TESTED:
# *.awdev.ca
# *.advisor.ws
#
# *.compute.amazonaws.com
# *.compute-1.amazonaws.com
# *.compute.amazonaws.com.cn
#
# *.elb.amazonaws.com
# *.elb.amazonaws.com.cn

psl = PublicSuffixList()
input = [
    'test.awdev.ca',
    'test.advisor.ws',
    
    'test.compute.amazonaws.com',
    'test.compute-1.amazonaws.com',
    'test.compute.amazonaws.com.cn',
    
    'test.elb.amazonaws.com',
    'test.amazonaws.com.cn',
    
    # add another level and it gets weird
    'sub.test.awdev.ca',
    'sub.test.advisor.ws',
    
    'sub.test.compute.amazonaws.com',
    'sub.test.compute-1.amazonaws.com',
    'sub.test.compute.amazonaws.com.cn',

    'sub.test.elb.amazonaws.com',
    'sub.test.amazonaws.com.cn',
]


output = [(i, psl.privatesuffix(i)) for i in input]

for t in output:
    print(f'{t[0]} -> {t[1]}')

Output from the run:

test.awdev.ca -> None
test.advisor.ws -> None
test.compute.amazonaws.com -> None
test.compute-1.amazonaws.com -> None
test.compute.amazonaws.com.cn -> None
test.elb.amazonaws.com -> None
test.amazonaws.com.cn -> amazonaws.com.cn
sub.test.awdev.ca -> sub.test.awdev.ca
sub.test.advisor.ws -> sub.test.advisor.ws
sub.test.compute.amazonaws.com -> sub.test.compute.amazonaws.com
sub.test.compute-1.amazonaws.com -> sub.test.compute-1.amazonaws.com
sub.test.compute.amazonaws.com.cn -> sub.test.compute.amazonaws.com.cn
sub.test.elb.amazonaws.com -> sub.test.elb.amazonaws.com
sub.test.amazonaws.com.cn -> amazonaws.com.cn

I would have expected the first set to return the domains unchanged and the second set to return the part minus the sub. part.

in either case the behavior is inconsistent for 2 reasons:

  1. test.amazonaws.com.cn -> amazonaws.com.cn the return was not None like all the others.
  2. why are all the domains with sub returning unchanged? again with the sub.test.amazonaws.com.cn -> amazonaws.com.cn behaving differently.

compute-1.amazonaws.com return none

Hi,
I made a test on the hostname "ec2-100-24-188-149.compute-1.amazonaws.com" , and was expecting it to return amazonaws.com.
But I'm getting None as return.

    def test_amazonaws(self):
        self.assertEqual(self.psl.privatesuffix("ec2-100-24-188-149.compute-1.amazonaws.com"), "amazonaws.com")

'amazonaws.com' != None
Expected :None
Actual :'amazonaws.com'

If I remove the first ec2-... I'm getting correct result:

    def test_amazonaws(self):
        self.assertEqual(self.psl.privatesuffix("compute-1.amazonaws.com"), "amazonaws.com")

PASSED [100%]
Process finished with exit code 0

In https://publicsuffix.org/list/public_suffix_list.dat I can see *.compute-1.amazonaws.com.
Should the first not match ?

incorrect handling of weird domain names

Version affected

0.10.0.20240525

Summary

DNS encoding for "weird" names is not handled.

Reproducer

from publicsuffixlist import PublicSuffixList
psl = PublicSuffixList()
print(psl.privatesuffix("www.exa\\.mple.com"))
mple.com

Result "mple.com" is wrong because \. character is not a label separator. It's an escaped dot which is part of the exa\.mple label. The correct return value thus should be exa\.mple.com.

An underlying problem

Currently this library handles DNS names as strings. This does not match DNS definition of names: DNS names are defined as sequence of labels and individual labels can contain arbitrary binary data on the wire. "Unusual" bytes are then encoded with \ escape sequences when presented in text-format.

Use-case

Processing real traffic from traffic captures. It has lots of weird names which require escaping and the current string-based processing leads to incorrect results for these weird names.

Proposed solution

Extend the current API to accept tuple of labels instead of string. In that case it's responsibility of the caller to do the right thing, and if a software is reading stuff from PCAP files it's actually easier to pass the labels instead of constructing escaped string out of it, and then having it decoded once again in publicsuffic library again.

Alternative would be to implement full decoding of DNS names, but I think it's more work and slower performance for my use-case.

References

\X              where X is any character other than a digit (0-9), is
                used to quote that character so that its special meaning
                does not apply.  For example, "\." can be used to place
                a dot character in a label.

\DDD            where each D is a digit is the octet corresponding to
                the decimal number described by DDD.  The resulting
                octet is assumed to be text and is not checked for
                special meaning.

This problem was encountered by other people and there was a proposal to integrate PSL matching into a DNS-aware library dnspython:
rthalley/dnspython#1082

I think it would be better if we can get this improved in publicsufficlist itself. What do you think?

publicsuffix of cloudfront.net

cloudfront.net is a public suffix and belong to Amazon.
but before the TLD was registered, Amazon also has the domain cloudfront with TLD .net.
So it's confused to discern the root domain of *.cloudfront.net.

examples:

In [164]: ps.privatesuffix('d2os3n5ieuk9g5.cloudfront.net')
Out[164]: 'd2os3n5ieuk9g5.cloudfront.net'

In [165]: ps.privatesuffix('a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net')
Out[165]: 'tlv50-c1.cloudfront.net'

And we known every root domain has NS record, so check it.

dig d2os3n5ieuk9g5.cloudfront.net NS

; <<>> DiG 9.10.6 <<>> d2os3n5ieuk9g5.cloudfront.net NS
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1735
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 3

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;d2os3n5ieuk9g5.cloudfront.net.	IN	NS

;; ANSWER SECTION:
d2os3n5ieuk9g5.cloudfront.net. 830 IN	NS	ns-1961.awsdns-53.co.uk.
d2os3n5ieuk9g5.cloudfront.net. 830 IN	NS	ns-1525.awsdns-62.org.
d2os3n5ieuk9g5.cloudfront.net. 830 IN	NS	ns-765.awsdns-31.net.
d2os3n5ieuk9g5.cloudfront.net. 830 IN	NS	ns-224.awsdns-28.com.

;; ADDITIONAL SECTION:
ns-1961.awsdns-53.co.uk. 2488	IN	A	205.251.199.169
ns-1525.awsdns-62.org.	8341	IN	A	205.251.197.245

;; Query time: 36 msec
;; SERVER: 10.95.44.53#53(10.95.44.53)
;; WHEN: Wed Sep 09 13:12:46 CST 2020
;; MSG SIZE  rcvd: 227
dig tlv50-c1.cloudfront.net NS

; <<>> DiG 9.10.6 <<>> tlv50-c1.cloudfront.net NS
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 868
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;tlv50-c1.cloudfront.net.	IN	NS

;; AUTHORITY SECTION:
cloudfront.net.		59	IN	SOA	ns-418.awsdns-52.com. hostmaster.cloudfront.net. 1377556270 16384 2048 1048576 60

;; Query time: 1018 msec
;; SERVER: 10.95.44.53#53(10.95.44.53)
;; WHEN: Wed Sep 09 13:13:56 CST 2020
;; MSG SIZE  rcvd: 119
nslookup a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net 8.8.8.8
Server:		8.8.8.8
Address:	8.8.8.8#53

Non-authoritative answer:
Name:	a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.197
Name:	a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.231
Name:	a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.22
Name:	a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.45

tlv50-c1.cloudfront.net has no NS recored, but a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net has A recored,
so the root domain of a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net is cloudfront.net.

Wrong timestamp parsed from last-modified header

I reside in UTC+03. When I use the update.py script:

>>> import time
>>> from email.utils import parsedate
>>> lastmod = "Thu, 28 May 2020 16:40:36 GMT"
>>> parsedate(lastmod)
(2020, 5, 28, 16, 40, 36, 0, 1, -1)  # <-- ok
>>> time.mktime(parsedate(lastmod))
1590673236.0  # <-- not ok! 3 hours offset (caused by my TZ)

# 1590673236 is "Thursday, May 28, 2020 1:40:36 PM GMT" (notice the 3 hours difference, caused by my TZ)
# should be 1590684036

A resolution for this is to replace time.mktime() with calendar.timegm().
Reference: https://docs.python.org/3/library/time.html#index-4

Public Suffix data incorrect?

Hello,
Thank you for this project. If I am understanding the purpose of these methods correctly then I believe the parser is pulling the incorrect information. Its my understanding that the eTLD (effective top level domain) where an organization could register a private domain would be places like ".com" and "com.uk" would be "public suffixes". The domain that someone would register there such as "google.com" and "google.com.uk" would both be "private suffixes". However, that isn't what the tool produces.

>>> PublicSuffixList().is_private("com.uk")   <-Should be False
True
>>> PublicSuffixList().is_private("com")
False

Furthermore if I try to retrieve the public and private suffixes I get incorrect data as well.

>>> PublicSuffixList().publicsuffix("google.com.uk")   <- should be com.uk
'uk'          
>>> PublicSuffixList().privatesuffix("google.com.uk")  <- Should be google.com.uk
'com.uk'
>>> PublicSuffixList().privatesuffix("google.com")     <- should be google.com and is correct
'google.com'

is_public is broken for upper case input

psl.is_public() is broken for upper case input with 2 or more labels.

psl.is_public("Jp") # => True
psl.is_public("Co.jp") # => False

TLD only domain has unintentionally returned the right value. related to #20

Uppercase domain causes inconsistent result for TLD

publicsuffix() in 0.7.14 returns non-lower suffix for TLDs.

psl = publicsuffixlist.PublicSuffixList()
psl.publicsuffix("example.COM") # => "com"
psl.publicsuffix("COM") # => "COM"

the shortcut code path for TLD-only domain should return lowered one for consistency.

Consider using @typing.overload instead of Union input/output types

I recently pulled in the new release of the library into my project and had some difficulty with API changes to the privatesuffix() method in particular. My project's existing code that used the old version passed in a str hostname and relied on getting an Optional[str] back, handling the result like this (as an isolated example):

r = self.psl.privatesuffix(hostname)
return r if r else r hostname

The updated version of privatesuffix() now takes in a RelaxDomain (i.e., Union[str, BytesTuple, Iterable]) and returns an Optional[Domain] (i.e., Optional[Union[str, BytesTuple]). Looking over the code, my understanding is that this is extending the method to include a new specialization for additionally taking in a BytesTuple or Iterable, and in turn outputting a BytesTuple, but the "original" str version still exists (that is, if you pass in a str you get out an Optional[str] effectively). This means that our calling code now does:

r = self.psl.privatesuffix(hostname)
return r if isinstance(r, str) else hostname

This isn't too bad, but it does mean we're making some assumptions about the internal implementation of this method, where the API/types contract is a bit opaque (there's no type-system guarantee that str in means str out).

If this was instead implemented as method overloads using @typing.overload, the type system would know that the str in type was explicitly connected to the str out type, like this:

@overload
def privatesuffix(self, domain: str, ...) -> Optional[str]: ...
@overload
def privatesuffix(self, domain: BytesTuple, ...) -> Optional[BytesTuple]: ...
def privatesuffix(self, domain, ...) -> Optional[str] | Optional[BytesTuple]:
    # actual implementation here

If this was something that seemed desirable for this library, I'm happy to try working on a PR, but wanted to discuss before doing that (and if it was desirable I'd want to discuss how far to extend this pattern throughout the library). I also understand if the additional complexity doesn't seem warranted. Ultimately the downstream burden we have for this is pretty minimal :-)

Sdist on PyPI

Thank you for writing publicsuffixlist!

For those of us using buildout or other non-wheel-aware installers (or at least for me) it would be convenient to have an sdist available on PyPI. Could I bother you to upload one?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.