I am a cyber security researcher and programmer.
Do you want to be one too? Check out my advice for learning hacking and programming.
You can support my work with a few bucks, here.
Incredibly fast crawler designed for OSINT.
License: GNU General Public License v3.0
I am a cyber security researcher and programmer.
Do you want to be one too? Check out my advice for learning hacking and programming.
You can support my work with a few bucks, here.
collecting of /ads.txt
and https:// certs info
may be useful
OS: Arch Linux
Python Version: 3.7
Python Script:
from photon import crawl
data = crawl("http://stackoverflow.com", timeout=5)
print(data)
Exception Thrown
Exception in thread Thread-38:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/usr/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.7/http/client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.7/ssl.py", line 1049, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.7/ssl.py", line 908, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 445, in send
timeout=timeout
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Read timed out. (read timeout=5)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 184, in extractor
response = requester(url, delay, domain_name, user_agents, cookie, timeout) # make request to the url
File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 56, in requester
response = get(url, cookies=cookie, headers=headers, verify=False, timeout=timeout, stream=True)
File "/usr/lib/python3.7/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python3.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 622, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 526, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='stackoverflow.com', port=443): Read timed out. (read timeout=5)
Exception in thread Thread-43:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/usr/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.7/http/client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.7/ssl.py", line 1049, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.7/ssl.py", line 908, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 445, in send
timeout=timeout
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Read timed out. (read timeout=5)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 184, in extractor
response = requester(url, delay, domain_name, user_agents, cookie, timeout) # make request to the url
File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 56, in requester
response = get(url, cookies=cookie, headers=headers, verify=False, timeout=timeout, stream=True)
File "/usr/lib/python3.7/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python3.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 622, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 526, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='stackoverflow.com', port=443): Read timed out. (read timeout=5)
^CTraceback (most recent call last):
File "a.py", line 3, in <module>
data = crawl("https://stackoverflow.com", timeout=5)
File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 296, in crawl
flash(extractor, links, threads, delay, domain_name, user_agents, cookie, timeout, regex, keys, only_urls, main_url)
File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 255, in flash
threader(function, delay, threads, domain_name, user_agents, cookie, timeout, regex, keys, only_urls, main_url, splitted)
File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 242, in threader
thread.join()
File "/usr/lib/python3.7/threading.py", line 1032, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/usr/lib/python3.7/threading.py'>
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 1273, in _shutdown
t.join()
File "/usr/lib/python3.7/threading.py", line 1032, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
https://github.com/s0md3v/Photon/blob/master/photon.py#L313 and
https://github.com/s0md3v/Photon/blob/master/photon.py#L478 do not work as expected...
$ python3 -c "print('a' or 'e' or 'i' or 'o' or 'u')" --> a
By or-ing these things together we are only getting the first Truthy which is a or .png on line 313.
def threader(function, *urls):
threads = [] # list of threads
urls = urls[0] # because urls is a tuple
for url in urls: # iterating over urls
task = threading.Thread(target=function, args=(url,))
threads.append(task)
# start threads
for thread in threads:
thread.start()
# wait for all threads to complete their work
for thread in threads:
thread.join()
# delete threads
del threads[:]
So I was just crawling my own website and I noticed a high number of false positives in the results saved in links.txt
files.
Contents of links.txt
from running
$ python photon.py -u https://site.com -t 4
https://site.com/Facebook_Forum
https://site.com/about us
https://site.com/EmailAccount
https://site.com/YouTube channel
https://site.com/Facebook Forum
https://site.com
https://site.com/prev
https://site.com/GitHub
https://site.com/Youtube Channel
https://site.com/error_footer
https://site.com/
https://site.com/YouTube_Channel
https://site.com/Facebook_page
https://site.com/cdn-cgi/l/email-protection
https://site.com/next
https://site.com/Projects
https://site.com/facebook page
https://site.com/index.html
https://site.com/shit
https://site.com/contact
https://site.com/YouTube_channel
Of which about 95% were non-existent urls.
Any way to fix this or why this is happening?
This is the issue if you call the script from /home for instance:
The two lines highlighted in hello are (print(os.getcwd()) and print(sys.path[0]) respectively. As you can see, the second line is what will make the script work if you call it from anywhere.
Fix: on line 182: https://github.com/s0md3v/Photon/blob/master/photon.py#L182
Change:
with open(os.getcwd() +
to:
with open(sys.path[0] +
questions?
On the wiki page https://github.com/s0md3v/Photon/wiki/Compatibility-&-Dependencies
tld
module should be added under Dependencies.
py2 photon.py -u http://xxxx/login.jsp --wayback
____ __ __
/ __ \/ /_ ____ / /_____ ____
/ /_/ / __ \/ __ \/ __/ __ \/ __ \
/ ____/ / / / /_/ / /_/ /_/ / / / /
/_/ /_/ /_/\____/\__/\____/_/ /_/ v1.1.5
[~] Fetching URLs from archive.org
[+] Retrieved -1 URLs from archive.org
[~] Level 1: 1 URLs
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "photon.py", line 435, in extractor
response = requester(url) # make request to the url
File "photon.py", line 285, in requester
return normal(url)
File "photon.py", line 239, in normal
response = get(url, cookies=cook, headers=finalHeaders, verify=False, timeout=timeout, stream=True)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 65, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/safe_mode.py", line 39, in wrapped
return function(method, url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 51, in request
return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'stream'
Hi,
First, good job !
Is it possible to build a new txt files with all linking informations.
Something like this :
Source \t Target \t Type \t Nofollow
http://ndd.tld/A \t http://ndd.tld/B \t AHREF \t TRUE
http://ndd.tld/A \t http://ndd.tld/B.jpg \t AHREF \t TRUE
The prupose
The purpose is to have an idea of how pages are related between them in the website.
As mentioned already in #98 is the setup.py
file missing in the repo.
If there is one around then I would like to suggest that install_requires
and entry_points
is added.
If entry_points
one could launch Photon with photon
without caring about the interpreter.
It looks like that it would require some changes in photon.py
for the access of at least user-agents.txt
.
else:
here = os.path.abspath(os.path.dirname(__file__))
with open('{}/core/user-agents.txt'.format(here), 'r') as uas:
user_agents = [agent.strip('\n') for agent in uas]
Would it be possible to include a non colored mode?
I know it's such a strange edge case, but Pythonista on iOS doesn’t support colors in the output (yet) and it just displays this instead of the colored Photon text
�[91m ____ __ __ / �[1;97m__�[91m \/ /_ ____ / /_____ ____ / �[1;97m/_/�[91m / __ \/ �[1;97m__�[91m \/ __/ �[1;97m__�[91m \/ __ \ / ____/ / / / �[1;97m/_/�[91m / /_/ �[1;97m/_/�[91m / / / / /_/ /_/ /_/\____/\__/\____/_/ /_/ �[1;m
The approach of splitting the list is a good method as explained in your README, but wouldn't a Queue accomplish the same thing without all the list splitting stuff? You'd still only access unique items across workers.
J/w if there's a reason why you didn't do this as I was going to put in a PR for this exact thing but won't bother if you aren't doing it for some reason
Thanks!
Traceback (most recent call last):
File "photon.py", line 382, in
f.write(x + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 137: ordinal not in range(128)
Fix:
change f.write(x + '\n')
to f.write(x.encode('utf-8') + '\n')
I used python3 photon.py --url http://x.x.x.x --level 1 --only-url
and I got a list of 103 internal URL.
All the URL are using the following pattern: http://x.x.x.x/?r=[redirection_token]
.
Having this list alone is pretty useless, what is interesting is to get the redirection value (for example contained in the Location header after a HTTP 302 or 303 code).
There should be an option to store the redirection value instead of the raw URL when a redirection HTTP code is hit.
This could be implemented with something like in pseudo-code:
check_http_code_status(code):
switch(code):
case 200:
store(request)
case 301, 302, 303:
store(answer.location):
case 404:
do_nothing
Hi
After using the option --update i get this error:
Traceback (most recent call last):
File "photon.py", line 187, in
domain = topLevel(main_url)
File "photon.py", line 182, in topLevel
toplevel = tld.get_fld(host, fix_protocol=True)
AttributeError: 'module' object has no attribute 'get_fld'
Any clue?
I installed also on another system and still the same
change target: same
delete totally and clone again: same
OS Kali3
Thanks
but I want to know ,how can I just download picture from a website?
right now, if the user specifies -o /root, won't the program recursively delete /root because it exists (line 640).
That is why in my pull i changed the w+ to w and only modified the files you wrote in the first place.
I suggest removing the part about recursively removing the directory structure, and instead, just overwrite your files if the tools is run twice.
can you please make its api so that we can use it in our projects ?
I would like to limit results to one domain only or to match some url regex.
The output text files currently don't have a newline at the end, so it messes up the terminal if you cat them, or if I try to cat two files together the first line of the second file is on the same line as the last line of the first file. Simple fix:
Add "f.write('\n')" after after these two lines:
https://github.com/s0md3v/Photon/blob/master/photon.py#L494
https://github.com/s0md3v/Photon/blob/master/photon.py#L490
def writer(datasets, dataset_names, output_dir):
for dataset, dataset_name in zip(datasets, dataset_names):
if dataset:
filepath = output_dir + '/' + dataset_name + '.txt'
if python3:
with open(filepath, 'w+', encoding='utf8') as f:
f.write(str('\n'.join(dataset)))
f.write('\n')
else:
with open(filepath, 'w+') as f:
joined = '\n'.join(dataset)
f.write(str(joined.encode('utf-8')))
f.write('\n')
thanks
Dear Team,
Tool is awesome but when I tried running your command like:
Python photon.py http://site.com --delay=1.5
I got result like this one:
____ __ __
/ __ \/ /_ ____ / /_____ ____
/ /_/ / __ \/ __ \/ __/ __ \/ __ \
/ ____/ / / / /_/ / /_/ /_/ / / / /
/_/ /_/ /_/\____/\__/\____/_/ /_/
[!] Links to crawl: 108
[!] Time required: ~5 minutes
[+] Total URLs found: 780
[+] Fuzzable URLs found: 149
[+] JavaScript files found: 126
[~] Scanning JavaScript files for endpoints
[!] Time required: ~1 minute
[+] Enpoints found: 346
mkdir: missing operand
Try 'mkdir --help' for more information.
[+] Results saved in directory
i don't know where results are being saved.
Can you please update the READme file.
Thanks.
That's a small one I've spotted browsing your code...
Considering the following block of code:
with open('%s/links.txt' % name, 'w+') as f:
for x in storage:
f.write(x + '\n')
f.close()
the f.close()
instruction is redundant, because if you're using the with
statement, the file is already closed, as shown in the Python documentation: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
>>> with open('workfile') as f:
... read_data = f.read()
>>> f.closed
True
Okay so I was going through the Photon library and I found some areas of improvements.
--plugin <name of plugin> / --plugin=<name of plugin>
is a good idea. Why?like if there are 100 plugins, will you serve 100 arguments? One for each plugin? Sick idea isn't it?
exporter
as a part of main build, not as a plugin. Plugins will be strictly restricted to extending features and enrichment of capabilities only.Reconnaissance is way too broad for photon so far as the plugins available are concerned
.-v/--verbose
for photon. Sometimes just that small output of what it has crawled doesn't satisfy. I have done it!
An option to supply a dictionary of strings to brute force at each level of the directory tree would be great.
It's not uncommon to see juicy dirs that are not linked by the rest of the app.
Hey great tool!
I was wondering if you could as an export to CSV option?
Hello there, nice tool! here something I found.
If I enter:
python3 photon.py -u "https://aaa.example.com"
python3 photon.py -u "https://bbb.example.com"
python3 photon.py -u "https://ccc.example.com"
The saved folder is saved with the name "example" but this folder is overwrite with every subdomain result. I recommend adding the full subdomain and domain name to the folder name so the folder result looks like :
bbb.example.com
bbb.example.com
ccc.example.com
This will help when testing multiple subdomains.
PS: Adding a option to load multiple domains and subdomains like -u "https://aaa.example.com,https://bbb.example.com,https://ccc.example.com"
it be nice.
Traceback (most recent call last):
File "photon.py", line 14, in
from requests import get, post
ModuleNotFoundError: No module named 'requests'
The list of user agents can be easily edited in the source code but a command line option to specify one specific user agent string would be nice.
I use Photon on my own server and want to delete Photons requests from my log files.
save amazonaws bucket that found on crawling into a file
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 849, in validate_conn
conn.connect()
File "C:\Python37\lib\site-packages\urllib3\connection.py", line 356, in connect
ssl_context=context)
File "C:\Python37\lib\site-packages\urllib3\util\ssl.py", line 359, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "C:\Python37\lib\ssl.py", line 412, in wrap_socket
session=session
File "C:\Python37\lib\ssl.py", line 850, in _create
self.do_handshake()
File "C:\Python37\lib\ssl.py", line 1108, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\requests\adapters.py", line 445, in send
timeout=timeout
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Python37\lib\site-packages\urllib3\util\retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.xxxx.com', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "photon.py", line 413, in
zap(main_url)
File "photon.py", line 235, in zap
response = get(url + '/robots.txt').text # makes request to robots.txt
File "C:\Python37\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Python37\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python37\lib\site-packages\requests\sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python37\lib\site-packages\requests\sessions.py", line 644, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Python37\lib\site-packages\requests\sessions.py", line 644, in
history = [resp for resp in gen] if allow_redirects else []
File "C:\Python37\lib\site-packages\requests\sessions.py", line 222, in resolve_redirects
**adapter_kwargs
File "C:\Python37\lib\site-packages\requests\sessions.py", line 622, in send
r = adapter.send(request, **kwargs)
File "C:\Python37\lib\site-packages\requests\adapters.py", line 511, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.xxxx.com', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)')))
Line 169 errors out if you run photon against an IP. Easiest fix might be to just add a try/except, but there is prob a more elgant solution.
I'm pretty sure this was working before.
root@kali:/opt/Photon# python /opt/Photon/photon.py -u http://192.168.0.213:80
____ __ __
/ __ \/ /_ ____ / /_____ ____
/ /_/ / __ \/ __ \/ __/ __ \/ __ \
/ ____/ / / / /_/ / /_/ /_/ / / / /
/_/ /_/ /_/\____/\__/\____/_/ /_/ v1.1.1
Traceback (most recent call last):
File "/opt/Photon/photon.py", line 169, in <module>
domain = get_fld(host, fix_protocol=True) # Extracts top level domain out of the host
File "/usr/local/lib/python2.7/dist-packages/tld/utils.py", line 387, in get_fld
search_private=search_private
File "/usr/local/lib/python2.7/dist-packages/tld/utils.py", line 339, in process_url
raise TldDomainNotFound(domain_name=domain_name)
tld.exceptions.TldDomainNotFound: Domain 192.168.0.213 didn't match any existing TLD name!
argv: python photon.py -u https://.... -c PHPSESSID=.... -t 150 -l 10
Traceback:
Traceback (most recent call last):
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "photon.py", line 408, in extractor
response = requester(url) # make request to the url
File "photon.py", line 258, in requester
return normal(url)
File "photon.py", line 212, in normal
response = get(url, cookies=cook, headers=headers, verify=False, timeout=timeout, stream=True)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/sessions.py", line 498, in request
prep = self.prepare_request(req)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/sessions.py", line 419, in prepare_request
cookies = cookiejar_from_dict(cookies)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/cookies.py", line 522, in cookiejar_from_dict
cookiejar.set_cookie(create_cookie(name, cookie_dict[name]))
TypeError: string indices must be integers, not str
Full output:
____ __ __
/ __ \/ /_ ____ / /_____ ____
/ /_/ / __ \/ __ \/ __/ __ \/ __ \
/ ____/ / / / /_/ / /_/ /_/ / / / /
/_/ /_/ /_/\____/\__/\____/_/ /_/ v1.1.4
Level 1: 1 URLs
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "photon.py", line 408, in extractor
response = requester(url) # make request to the url
File "photon.py", line 258, in requester
return normal(url)
File "photon.py", line 212, in normal
response = get(url, cookies=cook, headers=headers, verify=False, timeout=timeout, stream=True)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/sessions.py", line 498, in request
prep = self.prepare_request(req)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/sessions.py", line 419, in prepare_request
cookies = cookiejar_from_dict(cookies)
File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/cookies.py", line 522, in cookiejar_from_dict
cookiejar.set_cookie(create_cookie(name, cookie_dict[name]))
TypeError: string indices must be integers, not str
Progress: 1/1
Crawling 0 JavaScript files
--------------------------------------------------
Internal: 1
--------------------------------------------------
Total requests made: 1
Total time taken: 0 minutes 0 seconds
Requests per second: 1
Results saved in ... directory
No matter what arguments I drop be it -l
or -t
still does the same thing.
Hmm, PyPI tell me that the latest release is 1.1.9. The changelog says it's 1.1.4 (I assume that the changelog wasn't updated) and the latest release on GitHub is 1.1.5.
Also, it seems that the setup.py
file is missing.
Usually, I don't care where the package source is coming from (the Fedora Package guidelines don't make a statement about that) but I would be nice if it doesn't change too often.
Its nothing with bugs, but a suggestion for you to make your work easy
kindly use randua
to make your work easier
ImportError: No module named mechanize
Hi,
Would you like a Docker image to your project ? I can do the PR if so.
Regards,
Stephen
keys = set() # high entropy strings, prolly secret keys
line 130 in photon.py
I do not understand
Not really a program issue, there are just no clear instructions.
I made some change to the docker image.
COPY
the files.COPY
to avoid running pip install
on each build if requirements.txt hasn't changed (layer caching).WORKDIR
.It brings down the size of the image to 87.5 MB while being a lot faster to build.
I updated in consequence:
.travis.yml
: add python 3.7 as it's the version used by the docker image (python:3-alpine
).readme.md
: updated the Docker chapter.Although I read in the contribution guidelines that non-code related PRs will not be merged. So here's my issue!
You can check my changes at etienne-napoleone/Photon.
Line 361 in 73e9538
If a website use protocol relative URL (//example.com/foo), Photon mistakenly detect internal links as external. To fix it you can compare links based on netloc, for example:
change:
if link.startswith(main_url):
to:
if urlparse(link).netloc == urlparse(main_url).netloc:
Also consider that in case of www sub-domain, same thing may happen.
Hello,
If you want a free & opensource proxies network, you can add proxy support for Scrapoxy (http://scrapoxy.io/)
Keep me in touch if you're interested :)
Fabien.
what's the differences among:
-u example.com
-u www.example.com
-u http://example.com
-u https://example.com
-u http://www.example.com
-u https://www.example.com
looks like their output are all saved under the same example.com/ directory but the content will be different for each case?
You're ignoring SSL verification:
/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:843: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
You can fix this by adding the following to the top of the file (or wherever you feel like putting it):
import warnings
warnings.filterwarnings("ignore")
Now this is bad practice, for obvious reasons, however, it makes sense why you're doing it. The above solution is the quickest and simplest solution to solve the problem and keep the errors from being annoying.
An example with and without:
If I supply a URL that is in, lets say Russian and try to extract all the data:
____ __ __
/ __ \/ /_ ____ / /_____ ____
/ /_/ / __ \/ __ \/ __/ __ \/ __ \
/ ____/ / / / /_/ / /_/ /_/ / / / /
/_/ /_/ /_/\____/\__/\____/_/ /_/
URLs retrieved from robots.txt: 5
Level 1: 6 URLs
Progress: 6/6
Level 2: 35 URLs
Progress: 35/35
Level 3: 7 URLs
Progress: 7/7
Crawling 7 JavaScript files
Progress: 7/7
Traceback (most recent call last):
File "photon.py", line 429, in <module>
f.write(x + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u200b' in position 7: ordinal not in range(128)
Easiest solution would be to just ignore things like this and continue, with a warning to the user that it was ignored.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.