s0md3v / photon Goto Github PK

View Code? Open in Web Editor NEW

10.7K 324.0 1.5K 356 KB

Incredibly fast crawler designed for OSINT.

License: GNU General Public License v3.0

Python 98.98% Dockerfile 1.02%

crawler spider python osint information-gathering

photon's Introduction

Hi, I'm Somdev

I am a cyber security researcher and programmer.

Do you want to be one too? Check out my advice for learning hacking and programming.

You can support my work with a few bucks, here.

photon's People

Contributors

Stargazers

Watchers

Forkers

ro9ueadmin kakuye wtf-yodhha arunbalu123 dedeco andyvaikunth subc0ol 4n6strider waghcwb samjaninf bigttys0 heikipikker opt9 hax0rg1rl tatsuryu anon0ps jaikishantulswani kp625544 yugantar7 attacker34 guest20 ethicalredteam aayush420 alessaba 00derp c0dak spetr0x kkkmmu m00zh33 git-path gordes317 shidianshifen rdsece venutrue ayoub5474 c002 khachkara deviantnasir rabitw fatnerd virsystem snehm sjukperro anarquias allisone-0 daogster wind-- awesome-crawler rajat-np manusw stbrieux cmscardoso cycccccc edouardlauret hhy5277 fengqiangzhao exploitd l1nuxg4m32 wzor hclivess shubhankitmishra crawlerclub hadryan hlesesne burakdev s0ca maloneqq wwang72 antemasqued noobzero anon1984 d3nis-web aurimusblack mmilleror nola-radar bbrangeo neurowinter 409h 0xc01 hippiegunnut ruankranz siegfried5 1r0dm480 irivera007 lagartoplastico 0x9k mehrdad-shokri olivierh59500 miteshshah shawprasenjit d1pakda5 exploit-inters tamjid10 achint08 alphilippo ssh-shashi rahulii trendingtechnology xer0xe9 digitalist

photon's Issues

just some ideas

collecting of /ads.txt
and https:// certs info

may be useful

Exception thrown while using photon as module

OS: Arch Linux
Python Version: 3.7

Python Script:

from photon import crawl

data = crawl("http://stackoverflow.com", timeout=5)
print(data)

Exception Thrown

Exception in thread Thread-38:
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1049, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 908, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 445, in send
    timeout=timeout
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Read timed out. (read timeout=5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 184, in extractor
    response = requester(url, delay, domain_name, user_agents, cookie, timeout) # make request to the url
  File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 56, in requester
    response = get(url, cookies=cookie, headers=headers, verify=False, timeout=timeout, stream=True)
  File "/usr/lib/python3.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 526, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='stackoverflow.com', port=443): Read timed out. (read timeout=5)

Exception in thread Thread-43:
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1049, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 908, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 445, in send
    timeout=timeout
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Read timed out. (read timeout=5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 184, in extractor
    response = requester(url, delay, domain_name, user_agents, cookie, timeout) # make request to the url
  File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 56, in requester
    response = get(url, cookies=cookie, headers=headers, verify=False, timeout=timeout, stream=True)
  File "/usr/lib/python3.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 526, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='stackoverflow.com', port=443): Read timed out. (read timeout=5)

^CTraceback (most recent call last):
  File "a.py", line 3, in <module>
    data = crawl("https://stackoverflow.com", timeout=5)
  File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 296, in crawl
    flash(extractor, links, threads, delay, domain_name, user_agents, cookie, timeout, regex, keys, only_urls, main_url)
  File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 255, in flash
    threader(function, delay, threads, domain_name, user_agents, cookie, timeout, regex, keys, only_urls, main_url, splitted)
  File "/home/$USER/.local/lib/python3.7/site-packages/photon/photon.py", line 242, in threader
    thread.join()
  File "/usr/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/usr/lib/python3.7/threading.py'>
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 1273, in _shutdown
    t.join()
  File "/usr/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

Python logic error using 'or'

https://github.com/s0md3v/Photon/blob/master/photon.py#L313 and
https://github.com/s0md3v/Photon/blob/master/photon.py#L478 do not work as expected...

$ python3 -c "print('a' or 'e' or 'i' or 'o' or 'u')" --> a

By or-ing these things together we are only getting the first Truthy which is a or .png on line 313.

python is not suitable for multi thread(because of the GIL) I think you should use multi process

This function starts multiple threads for a function

def threader(function, *urls):
    threads = [] # list of threads
    urls = urls[0] # because urls is a tuple
    for url in urls: # iterating over urls
        task = threading.Thread(target=function, args=(url,))
        threads.append(task)
    # start threads
    for thread in threads:
        thread.start()
    # wait for all threads to complete their work
    for thread in threads:
        thread.join()
    # delete threads
    del threads[:]

Spelling

-c flag in usage spelled COOK instead of COOKIE

Nothing else is abbreviated so it would seem that this is a typo.

High False Positives

So I was just crawling my own website and I noticed a high number of false positives in the results saved in links.txt files.

Contents of links.txt from running

$ python photon.py -u https://site.com -t 4

https://site.com/Facebook_Forum
https://site.com/about us
https://site.com/EmailAccount
https://site.com/YouTube channel
https://site.com/Facebook Forum
https://site.com
https://site.com/prev
https://site.com/GitHub
https://site.com/Youtube Channel
https://site.com/error_footer
https://site.com/
https://site.com/YouTube_Channel
https://site.com/Facebook_page
https://site.com/cdn-cgi/l/email-protection
https://site.com/next
https://site.com/Projects
https://site.com/facebook page
https://site.com/index.html
https://site.com/shit
https://site.com/contact
https://site.com/YouTube_channel

Of which about 95% were non-existent urls.

Any way to fix this or why this is happening?

Can't run photon outside of Photon directory

This is the issue if you call the script from /home for instance:

The two lines highlighted in hello are (print(os.getcwd()) and print(sys.path[0]) respectively. As you can see, the second line is what will make the script work if you call it from anywhere.

Fix: on line 182: https://github.com/s0md3v/Photon/blob/master/photon.py#L182
Change:
with open(os.getcwd() +
to:
with open(sys.path[0] +

questions?

tld dependency

On the wiki page https://github.com/s0md3v/Photon/wiki/Compatibility-&-Dependencies

tld module should be added under Dependencies.

TypeError: request() got an unexpected keyword argument 'stream'

py2 photon.py -u  http://xxxx/login.jsp --wayback
      ____  __          __
     / __ \/ /_  ____  / /_____  ____
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/ v1.1.5

[~] Fetching URLs from archive.org
[+] Retrieved -1 URLs from archive.org
[~] Level 1: 1 URLs
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "photon.py", line 435, in extractor
    response = requester(url) # make request to the url
  File "photon.py", line 285, in requester
    return normal(url)
  File "photon.py", line 239, in normal
    response = get(url, cookies=cook, headers=finalHeaders, verify=False, timeout=timeout, stream=True)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 65, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/safe_mode.py", line 39, in wrapped
    return function(method, url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 51, in request
    return session.request(method=method, url=url, **kwargs)
TypeError: request() got an unexpected keyword argument 'stream'

Add a mapping file

Hi,
First, good job !

Is it possible to build a new txt files with all linking informations.

Something like this :

Source \t Target \t Type \t Nofollow
http://ndd.tld/A \t http://ndd.tld/B \t AHREF \t TRUE
http://ndd.tld/A \t http://ndd.tld/B.jpg \t AHREF \t TRUE

The prupose

The purpose is to have an idea of how pages are related between them in the website.

setup.py

As mentioned already in #98 is the setup.py file missing in the repo.

If there is one around then I would like to suggest that install_requires and entry_points is added.

If entry_points one could launch Photon with photon without caring about the interpreter.

It looks like that it would require some changes in photon.py for the access of at least user-agents.txt.

else:
    here = os.path.abspath(os.path.dirname(__file__))
    with open('{}/core/user-agents.txt'.format(here), 'r') as uas:
        user_agents = [agent.strip('\n') for agent in uas]

Would it be possible to include a non colored mode?
I know it's such a strange edge case, but Pythonista on iOS doesn’t support colors in the output (yet) and it just displays this instead of the colored Photon text

�[91m ____ __ __ / �[1;97m__�[91m \/ /_ ____ / /_____ ____ / �[1;97m/_/�[91m / __ \/ �[1;97m__�[91m \/ __/ �[1;97m__�[91m \/ __ \ / ____/ / / / �[1;97m/_/�[91m / /_/ �[1;97m/_/�[91m / / / / /_/ /_/ /_/\____/\__/\____/_/ /_/ �[1;m

Multithreading question

The approach of splitting the list is a good method as explained in your README, but wouldn't a Queue accomplish the same thing without all the list splitting stuff? You'd still only access unique items across workers.

J/w if there's a reason why you didn't do this as I was going to put in a PR for this exact thing but won't bother if you aren't doing it for some reason

Thanks!

test close

Character handling

Traceback (most recent call last):
File "photon.py", line 382, in
f.write(x + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 137: ordinal not in range(128)

Fix:
change f.write(x + '\n')
to f.write(x.encode('utf-8') + '\n')

Lines incorrectly stored in text file.

OS: macOS High Sierra
Python version: Python 3.6.6

So, I tried photon-ing twitter and the output format wasn't what I expected. Or, did I miss something?

Actual output (the \n is not processed):

Expected output:

Option to save save redirection value instead of request

I used python3 photon.py --url http://x.x.x.x --level 1 --only-url and I got a list of 103 internal URL.

All the URL are using the following pattern: http://x.x.x.x/?r=[redirection_token].

Having this list alone is pretty useless, what is interesting is to get the redirection value (for example contained in the Location header after a HTTP 302 or 303 code).

There should be an option to store the redirection value instead of the raw URL when a redirection HTTP code is hit.

This could be implemented with something like in pseudo-code:

check_http_code_status(code):
  switch(code):
  case 200:
    store(request)
  case 301, 302, 303:
    store(answer.location):
  case 404:
    do_nothing

Encoding error

App crash and show this error: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 17: ordinal not in range(128)"

DNS Plantext Output

Request for DNS plaintext output:

I was wondering if there could be an enhancement for having Photon spit out the DNS enumerated hostnames to accompany the beautiful image that it produces. This thing is great for sub-domain mapping but typing them out is a pain...

Great product!

Doesn't work for python2.7 and the fix is easy

get_lfd

Hi
After using the option --update i get this error:

Traceback (most recent call last):
File "photon.py", line 187, in
domain = topLevel(main_url)
File "photon.py", line 182, in topLevel
toplevel = tld.get_fld(host, fix_protocol=True)
AttributeError: 'module' object has no attribute 'get_fld'

Any clue?
I installed also on another system and still the same
change target: same
delete totally and clone again: same

OS Kali3
Thanks

Hi,thank you for your project,that's gread~

but I want to know ,how can I just download picture from a website?

deleting output directory is dangerous

right now, if the user specifies -o /root, won't the program recursively delete /root because it exists (line 640).

That is why in my pull i changed the w+ to w and only modified the files you wrote in the first place.

I suggest removing the part about recursively removing the directory structure, and instead, just overwrite your files if the tools is run twice.

API for importing in our projects

can you please make its api so that we can use it in our projects ?

Regex switch for urls

I would like to limit results to one domain only or to match some url regex.

add newline after each output file

The output text files currently don't have a newline at the end, so it messes up the terminal if you cat them, or if I try to cat two files together the first line of the second file is on the same line as the last line of the first file. Simple fix:

Add "f.write('\n')" after after these two lines:
https://github.com/s0md3v/Photon/blob/master/photon.py#L494
https://github.com/s0md3v/Photon/blob/master/photon.py#L490

def writer(datasets, dataset_names, output_dir):
    for dataset, dataset_name in zip(datasets, dataset_names):
        if dataset:
            filepath = output_dir + '/' + dataset_name + '.txt'
            if python3:
                with open(filepath, 'w+', encoding='utf8') as f:
                    f.write(str('\n'.join(dataset)))
                    f.write('\n')
            else:
                with open(filepath, 'w+') as f:
                    joined = '\n'.join(dataset)
                    f.write(str(joined.encode('utf-8')))
                    f.write('\n')

thanks

mkdir: missing operand

I'm getting this mkdir: missing operand. Any idea?

How to Save Results

Dear Team,

Tool is awesome but when I tried running your command like:

Python photon.py http://site.com --delay=1.5

I got result like this one:

      ____  __          __            
     / __ \/ /_  ____  / /_____  ____ 
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/ 

[!] Links to crawl: 108
[!] Time required: ~5 minutes
[+] Total URLs found: 780
[+] Fuzzable URLs found: 149
[+] JavaScript files found: 126
[~] Scanning JavaScript files for endpoints
[!] Time required: ~1 minute
[+] Enpoints found: 346
mkdir: missing operand
Try 'mkdir --help' for more information.
[+] Results saved in  directory

i don't know where results are being saved.

Can you please update the READme file.

Thanks.

Unneeded f.close() instructions

That's a small one I've spotted browsing your code...

Considering the following block of code:

with open('%s/links.txt' % name, 'w+') as f:
    for x in storage:
        f.write(x + '\n')
f.close()

the f.close() instruction is redundant, because if you're using the with statement, the file is already closed, as shown in the Python documentation: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

>>> with open('workfile') as f:
...     read_data = f.read()
>>> f.closed
True

[Multiple Suggestions]

Okay so I was going through the Photon library and I found some areas of improvements.

Let photon be just a powerful crawler. Adding a option like --plugin <name of plugin> / --plugin=<name of plugin> is a good idea. Why?
• It will make it easier for users to run custom plugins which they might have developed on their own.
• Think what bloat + heck it will be adding custom arguments for each plugin developed like if there are 100 plugins, will you serve 100 arguments? One for each plugin? Sick idea isn't it?
• Make it something similar to the Nmap Scripting Engine. Your basic work is extensive crawling, but custom plugins help it to extend features.
Add guidelines for custom plugin development which will can be used with photon without any changes in main code.
Include exporter as a part of main build, not as a plugin. Plugins will be strictly restricted to extending features and enrichment of capabilities only.
Change the description of photon to a suitable one Reconnaissance is way too broad for photon so far as the plugins available are concerned.
Refer the url of the build badge to Travis CI where the build is being maintained (it currently points to commits history).
Add a verbose mode -v/--verbose for photon. Sometimes just that small output of what it has crawled doesn't satisfy. I have done it!

Add directory busting

An option to supply a dictionary of strings to brute force at each level of the directory tree would be great.

It's not uncommon to see juicy dirs that are not linked by the rest of the app.

Export as csv

Hey great tool!

I was wondering if you could as an export to CSV option?

Subdomains overwrite the same folder name.

Hello there, nice tool! here something I found.

If I enter:

python3 photon.py -u "https://aaa.example.com"
python3 photon.py -u "https://bbb.example.com"
python3 photon.py -u "https://ccc.example.com"

The saved folder is saved with the name "example" but this folder is overwrite with every subdomain result. I recommend adding the full subdomain and domain name to the folder name so the folder result looks like :

bbb.example.com
bbb.example.com
ccc.example.com

This will help when testing multiple subdomains.

PS: Adding a option to load multiple domains and subdomains like -u "https://aaa.example.com,https://bbb.example.com,https://ccc.example.com" it be nice.

Error while launching

Traceback (most recent call last):
File "photon.py", line 14, in
from requests import get, post
ModuleNotFoundError: No module named 'requests'

Option for custom user agent string

The list of user agents can be easily edited in the source code but a command line option to specify one specific user agent string would be nice.
I use Photon on my own server and want to delete Photons requests from my log files.

[suggestion] S3 bucket

save amazonaws bucket that found on crawling into a file

requests.exceptions.SSLError

Traceback (most recent call last):
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 849, in validate_conn
conn.connect()
File "C:\Python37\lib\site-packages\urllib3\connection.py", line 356, in connect
ssl_context=context)
File "C:\Python37\lib\site-packages\urllib3\util\ssl.py", line 359, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "C:\Python37\lib\ssl.py", line 412, in wrap_socket
session=session
File "C:\Python37\lib\ssl.py", line 850, in _create
self.do_handshake()
File "C:\Python37\lib\ssl.py", line 1108, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python37\lib\site-packages\requests\adapters.py", line 445, in send
timeout=timeout
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Python37\lib\site-packages\urllib3\util\retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.xxxx.com', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "photon.py", line 413, in
zap(main_url)
File "photon.py", line 235, in zap
response = get(url + '/robots.txt').text # makes request to robots.txt
File "C:\Python37\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Python37\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python37\lib\site-packages\requests\sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python37\lib\site-packages\requests\sessions.py", line 644, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Python37\lib\site-packages\requests\sessions.py", line 644, in
history = [resp for resp in gen] if allow_redirects else []
File "C:\Python37\lib\site-packages\requests\sessions.py", line 222, in resolve_redirects
**adapter_kwargs
File "C:\Python37\lib\site-packages\requests\sessions.py", line 622, in send
r = adapter.send(request, **kwargs)
File "C:\Python37\lib\site-packages\requests\adapters.py", line 511, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.xxxx.com', port=443): Max retries exceeded with url: /robots.txt (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)')))

error when scanning IP

Line 169 errors out if you run photon against an IP. Easiest fix might be to just add a try/except, but there is prob a more elgant solution.

I'm pretty sure this was working before.

root@kali:/opt/Photon# python /opt/Photon/photon.py -u http://192.168.0.213:80
      ____  __          __
     / __ \/ /_  ____  / /_____  ____
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/ v1.1.1

Traceback (most recent call last):
  File "/opt/Photon/photon.py", line 169, in <module>
    domain = get_fld(host, fix_protocol=True) # Extracts top level domain out of the host
  File "/usr/local/lib/python2.7/dist-packages/tld/utils.py", line 387, in get_fld
    search_private=search_private
  File "/usr/local/lib/python2.7/dist-packages/tld/utils.py", line 339, in process_url
    raise TldDomainNotFound(domain_name=domain_name)
tld.exceptions.TldDomainNotFound: Domain 192.168.0.213 didn't match any existing TLD name!

Error while using a cookie

argv: python photon.py -u https://.... -c PHPSESSID=.... -t 150 -l 10
Traceback:

Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "photon.py", line 408, in extractor
    response = requester(url) # make request to the url
  File "photon.py", line 258, in requester
    return normal(url)
  File "photon.py", line 212, in normal
    response = get(url, cookies=cook, headers=headers, verify=False, timeout=timeout, stream=True)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/sessions.py", line 498, in request
    prep = self.prepare_request(req)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/sessions.py", line 419, in prepare_request
    cookies = cookiejar_from_dict(cookies)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/cookies.py", line 522, in cookiejar_from_dict
    cookiejar.set_cookie(create_cookie(name, cookie_dict[name]))
TypeError: string indices must be integers, not str

Full output:

      ____  __          __
     / __ \/ /_  ____  / /_____  ____
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/ v1.1.4

 Level 1: 1 URLs
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "photon.py", line 408, in extractor
    response = requester(url) # make request to the url
  File "photon.py", line 258, in requester
    return normal(url)
  File "photon.py", line 212, in normal
    response = get(url, cookies=cook, headers=headers, verify=False, timeout=timeout, stream=True)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/sessions.py", line 498, in request
    prep = self.prepare_request(req)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/sessions.py", line 419, in prepare_request
    cookies = cookiejar_from_dict(cookies)
  File "/Users/admin/bin/tools/Photon/venv3/lib/python2.7/site-packages/requests/cookies.py", line 522, in cookiejar_from_dict
    cookiejar.set_cookie(create_cookie(name, cookie_dict[name]))
TypeError: string indices must be integers, not str

 Progress: 1/1
 Crawling 0 JavaScript files

--------------------------------------------------
 Internal: 1
--------------------------------------------------
 Total requests made: 1
 Total time taken: 0 minutes 0 seconds
 Requests per second: 1
 Results saved in ... directory

No matter what arguments I drop be it -l or -t still does the same thing.

Release

Hmm, PyPI tell me that the latest release is 1.1.9. The changelog says it's 1.1.4 (I assume that the changelog wasn't updated) and the latest release on GitHub is 1.1.5.

Also, it seems that the setup.py file is missing.

Usually, I don't care where the package source is coming from (the Fedora Package guidelines don't make a statement about that) but I would be nice if it doesn't change too often.

Suggestion: use randua instead of user-agents file

Its nothing with bugs, but a suggestion for you to make your work easy
kindly use randua to make your work easier

ImportError: No module named mechanize

A Docker img ?

Hi,

Would you like a Docker image to your project ? I can do the PR if so.

Regards,
Stephen

what is the meaning of this?

keys = set() # high entropy strings, prolly secret keys
line 130 in photon.py
I do not understand

How do you install and run?

Not really a program issue, there are just no clear instructions.

Tidy up Docker image

I made some change to the docker image.

Don't clone the repo, just COPY the files.
Two step COPY to avoid running pip install on each build if requirements.txt hasn't changed (layer caching).
Tidied up the WORKDIR.

It brings down the size of the image to 87.5 MB while being a lot faster to build.

I updated in consequence:

.travis.yml: add python 3.7 as it's the version used by the docker image (python:3-alpine).
readme.md: updated the Docker chapter.

Although I read in the contribution guidelines that non-code related PRs will not be merged. So here's my issue!
You can check my changes at etienne-napoleone/Photon.

Generating invalid external links

Photon/photon.py

Line 361 in 73e9538

if link.startswith(main_url):

If a website use protocol relative URL (//example.com/foo), Photon mistakenly detect internal links as external. To fix it you can compare links based on netloc, for example:
change:

if link.startswith(main_url):

to:

if urlparse(link).netloc == urlparse(main_url).netloc:

Also consider that in case of www sub-domain, same thing may happen.

Add Scrapoxy support for API

Hello,

If you want a free & opensource proxies network, you can add proxy support for Scrapoxy (http://scrapoxy.io/)

Keep me in touch if you're interested :)

Fabien.

clear definition of -u

what's the differences among:

-u example.com
-u www.example.com
-u http://example.com
-u https://example.com
-u http://www.example.com
-u https://www.example.com

looks like their output are all saved under the same example.com/ directory but the content will be different for each case?

A couple minor issues

SSL issues

You're ignoring SSL verification:

/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:843: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

You can fix this by adding the following to the top of the file (or wherever you feel like putting it):

import warnings
warnings.filterwarnings("ignore")

Now this is bad practice, for obvious reasons, however, it makes sense why you're doing it. The above solution is the quickest and simplest solution to solve the problem and keep the errors from being annoying.

An example with and without:

Without:

With:

Dammit Unicode

If I supply a URL that is in, lets say Russian and try to extract all the data:

      ____  __          __
     / __ \/ /_  ____  / /_____  ____
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/ 

 URLs retrieved from robots.txt: 5
 Level 1: 6 URLs
 Progress: 6/6
 Level 2: 35 URLs
 Progress: 35/35
 Level 3: 7 URLs
 Progress: 7/7
 Crawling 7 JavaScript files
 Progress: 7/7
Traceback (most recent call last):
  File "photon.py", line 429, in <module>
    f.write(x + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u200b' in position 7: ordinal not in range(128)

Easiest solution would be to just ignore things like this and continue, with a warning to the user that it was ignored.