ko-zu / psl Goto Github PK
View Code? Open in Web Editor NEWpublicsuffixlist for python
License: Mozilla Public License 2.0
publicsuffixlist for python
License: Mozilla Public License 2.0
It'd be nice to not have to rely on a person to get the latest updates.
Calling privatesuffix("something.com.mx") returns "com.mx".
Thank you for writing publicsuffixlist
!
For those of us using buildout or other non-wheel-aware installers (or at least for me) it would be convenient to have an sdist available on PyPI. Could I bother you to upload one?
On pypi version 0.6.1 has been published, but the repo on github is still at version 0.6.0
When running publicsuffixlist-download --help
on windows, I get the following error:
Error:
Traceback (most recent call last):
File "C:\bld\publicsuffixlist_1675107463020\_test_env\Scripts\publicsuffixlist-download-script.py", line 9, in <module>
sys.exit(updatePSL())
^^^^^^^^^^^
File "C:\bld\publicsuffixlist_1675107463020\_test_env\Lib\site-packages\publicsuffixlist\update.py", line 41, in updatePSL
os.rename(psl_file + ".swp", psl_file)
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\bld\\publicsuffixlist_1675107463020\\_test_env\\Lib\\site-packages\\publicsuffixlist\\public_suffix_list.dat.swp' -> 'C:\\bld\\publicsuffixlist_1675107463020\\_test_env\\Lib\\site-packages\\publicsuffixlist\\public_suffix_list.dat'
Logs: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=650026&view=logs&j=3ff94dba-189a-527c-65e3-ce8503824159&t=35acf2bd-66a8-5b9f-4368-b52d351bfcc2
Context: conda-forge/staged-recipes#21906
Could you please tag the source? This allows distributions to get the complete source from GitHub if they want.
Thanks
if there is a rule like:
*.abc.com
I would expect that if you give it
substuf.def.abc.com that the public suffix should be def.abc.com.
from publicsuffixlist import PublicSuffixList
# RULES TESTED:
# *.awdev.ca
# *.advisor.ws
#
# *.compute.amazonaws.com
# *.compute-1.amazonaws.com
# *.compute.amazonaws.com.cn
#
# *.elb.amazonaws.com
# *.elb.amazonaws.com.cn
psl = PublicSuffixList()
input = [
'test.awdev.ca',
'test.advisor.ws',
'test.compute.amazonaws.com',
'test.compute-1.amazonaws.com',
'test.compute.amazonaws.com.cn',
'test.elb.amazonaws.com',
'test.amazonaws.com.cn',
# add another level and it gets weird
'sub.test.awdev.ca',
'sub.test.advisor.ws',
'sub.test.compute.amazonaws.com',
'sub.test.compute-1.amazonaws.com',
'sub.test.compute.amazonaws.com.cn',
'sub.test.elb.amazonaws.com',
'sub.test.amazonaws.com.cn',
]
output = [(i, psl.privatesuffix(i)) for i in input]
for t in output:
print(f'{t[0]} -> {t[1]}')
Output from the run:
test.awdev.ca -> None
test.advisor.ws -> None
test.compute.amazonaws.com -> None
test.compute-1.amazonaws.com -> None
test.compute.amazonaws.com.cn -> None
test.elb.amazonaws.com -> None
test.amazonaws.com.cn -> amazonaws.com.cn
sub.test.awdev.ca -> sub.test.awdev.ca
sub.test.advisor.ws -> sub.test.advisor.ws
sub.test.compute.amazonaws.com -> sub.test.compute.amazonaws.com
sub.test.compute-1.amazonaws.com -> sub.test.compute-1.amazonaws.com
sub.test.compute.amazonaws.com.cn -> sub.test.compute.amazonaws.com.cn
sub.test.elb.amazonaws.com -> sub.test.elb.amazonaws.com
sub.test.amazonaws.com.cn -> amazonaws.com.cn
I would have expected the first set to return the domains unchanged and the second set to return the part minus the sub. part.
in either case the behavior is inconsistent for 2 reasons:
cloudfront.net is a public suffix and belong to Amazon.
but before the TLD was registered, Amazon also has the domain cloudfront with TLD .net.
So it's confused to discern the root domain of *.cloudfront.net.
examples:
In [164]: ps.privatesuffix('d2os3n5ieuk9g5.cloudfront.net')
Out[164]: 'd2os3n5ieuk9g5.cloudfront.net'
In [165]: ps.privatesuffix('a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net')
Out[165]: 'tlv50-c1.cloudfront.net'
And we known every root domain has NS record, so check it.
dig d2os3n5ieuk9g5.cloudfront.net NS
; <<>> DiG 9.10.6 <<>> d2os3n5ieuk9g5.cloudfront.net NS
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1735
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 3
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;d2os3n5ieuk9g5.cloudfront.net. IN NS
;; ANSWER SECTION:
d2os3n5ieuk9g5.cloudfront.net. 830 IN NS ns-1961.awsdns-53.co.uk.
d2os3n5ieuk9g5.cloudfront.net. 830 IN NS ns-1525.awsdns-62.org.
d2os3n5ieuk9g5.cloudfront.net. 830 IN NS ns-765.awsdns-31.net.
d2os3n5ieuk9g5.cloudfront.net. 830 IN NS ns-224.awsdns-28.com.
;; ADDITIONAL SECTION:
ns-1961.awsdns-53.co.uk. 2488 IN A 205.251.199.169
ns-1525.awsdns-62.org. 8341 IN A 205.251.197.245
;; Query time: 36 msec
;; SERVER: 10.95.44.53#53(10.95.44.53)
;; WHEN: Wed Sep 09 13:12:46 CST 2020
;; MSG SIZE rcvd: 227
dig tlv50-c1.cloudfront.net NS
; <<>> DiG 9.10.6 <<>> tlv50-c1.cloudfront.net NS
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 868
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;tlv50-c1.cloudfront.net. IN NS
;; AUTHORITY SECTION:
cloudfront.net. 59 IN SOA ns-418.awsdns-52.com. hostmaster.cloudfront.net. 1377556270 16384 2048 1048576 60
;; Query time: 1018 msec
;; SERVER: 10.95.44.53#53(10.95.44.53)
;; WHEN: Wed Sep 09 13:13:56 CST 2020
;; MSG SIZE rcvd: 119
nslookup a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net 8.8.8.8
Server: 8.8.8.8
Address: 8.8.8.8#53
Non-authoritative answer:
Name: a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.197
Name: a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.231
Name: a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.22
Name: a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net
Address: 13.226.6.45
tlv50-c1.cloudfront.net has no NS recored, but a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net has A recored,
so the root domain of a286330aad4e096be6cdda229527774f4.profile.tlv50-c1.cloudfront.net is cloudfront.net.
Hi,
in the readme you link to https://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/data/test_psl.txt?raw=1 but that no longer exists.
The domain does not resolve anymore.
I reside in UTC+03. When I use the update.py
script:
>>> import time
>>> from email.utils import parsedate
>>> lastmod = "Thu, 28 May 2020 16:40:36 GMT"
>>> parsedate(lastmod)
(2020, 5, 28, 16, 40, 36, 0, 1, -1) # <-- ok
>>> time.mktime(parsedate(lastmod))
1590673236.0 # <-- not ok! 3 hours offset (caused by my TZ)
# 1590673236 is "Thursday, May 28, 2020 1:40:36 PM GMT" (notice the 3 hours difference, caused by my TZ)
# should be 1590684036
A resolution for this is to replace time.mktime()
with calendar.timegm()
.
Reference: https://docs.python.org/3/library/time.html#index-4
psl.is_public() is broken for upper case input with 2 or more labels.
psl.is_public("Jp") # => True
psl.is_public("Co.jp") # => False
TLD only domain has unintentionally returned the right value. related to #20
Try this example, as I believe there is a bug with a dash. Note that all I did was change "compute-1" to "compute1" and then it works as expected.
>>> from publicsuffixlist import PublicSuffixList
>>> psl = PublicSuffixList()
>>> psl.privatesuffix('ec2-107-21-74-29.compute-1.amazonaws.com')
>>> psl.publicsuffix('ec2-107-21-74-29.compute-1.amazonaws.com')
'ec2-107-21-74-29.compute-1.amazonaws.com'
>>> psl.publicsuffix('ec2-107-21-74-29.compute1.amazonaws.com')
'com'
>>> psl.privatesuffix('ec2-107-21-74-29.compute1.amazonaws.com')
'amazonaws.com'
publicsuffix() in 0.7.14 returns non-lower suffix for TLDs.
psl = publicsuffixlist.PublicSuffixList()
psl.publicsuffix("example.COM") # => "com"
psl.publicsuffix("COM") # => "COM"
the shortcut code path for TLD-only domain should return lowered one for consistency.
Hello,
Thank you for this project. If I am understanding the purpose of these methods correctly then I believe the parser is pulling the incorrect information. Its my understanding that the eTLD (effective top level domain) where an organization could register a private domain would be places like ".com" and "com.uk" would be "public suffixes". The domain that someone would register there such as "google.com" and "google.com.uk" would both be "private suffixes". However, that isn't what the tool produces.
>>> PublicSuffixList().is_private("com.uk") <-Should be False
True
>>> PublicSuffixList().is_private("com")
False
Furthermore if I try to retrieve the public and private suffixes I get incorrect data as well.
>>> PublicSuffixList().publicsuffix("google.com.uk") <- should be com.uk
'uk'
>>> PublicSuffixList().privatesuffix("google.com.uk") <- Should be google.com.uk
'com.uk'
>>> PublicSuffixList().privatesuffix("google.com") <- should be google.com and is correct
'google.com'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.