GithubHelp home page GithubHelp logo

jake-bickle / arachnid Goto Github PK

View Code? Open in Web Editor NEW
13.0 13.0 3.0 2.89 MB

An OSINT tool to find data leaks on a targeted website

License: GNU General Public License v3.0

Python 9.92% HTML 0.32% CSS 59.57% JavaScript 27.63% Hack 0.53% PHP 2.03%

arachnid's People

Contributors

jake-bickle avatar tobinshields avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

arachnid's Issues

Unable to install Arachnid

$ pip install arachnid-spider
Collecting arachnid-spider
Could not find a version that satisfies the requirement arachnid-spider (from versions: )
No matching distribution found for arachnid-spider

A fresh version of php7.3 and apache2 was installed shortly before attempting to download Arachnid via pip, not sure which requirements aren't being met.

Requests do not timeout

Each request that is made are missing a timeout parameter, which can leave the crawler hanging indefinitely.

In addition, a warning must be issued to the user any time this happens.

Need documentation

Need to find a place and write a user documentation for each option that Arachnid provides.

Also, perhaps an advanced section for using the json output to create your own output application.

Create unit tests

Creating unit tests is extremely important in ensuring Arachnid is stable. Stability is important for a long lasting crawl, as a crash could lead to lost data and wasting a lot of time (though issues #29 and #27 are looking to solve this issue). Unfortunately, tests have fallen by the wayside in this project. Nearly all the modules have no unit tests, and the existing tests are broken.

Crawler gets stuck on pages that generate more HTML pages

EDIT: For those experiencing this problem, read about ways to fix it here.

Arachnid has no method in place that stops crawling pages that continuously generate more pages. For example, a web calendar may have a "next month" button that one could theoretically continuously navigate to move forwards hundreds of years.

This is a common bot trap. The issue is that the crawler is stuck continuously navigating to new pages and gathering useless data. It also takes a long or indefinite amount of time for the scheduler to provide a new, unique URL that navigates the crawler away from the page.

Methods attempted so far:

The AOPIC algorithm

The algorithm

The official paper is located here.
Explained in layman's terms here.

The issue

This algorithm is designed for incremental crawlers, not for snapshot crawlers. However, it naturally pushes the evil pages towards the end of the crawl. If we could think of a way to modify this algorithm to detect when the crawler is stuck on an evil page, this would be our solution.

I pushed the branch "AOPIC" with my implementation of the algorithm. The file of interest is here. Feel free to fork it and mess around with the algorithm.

The URLDiffFilter

The algorithm

The DiffFilter checks an arbitrary amount X of previously crawled URL strings for a "significant difference." A significant difference is when 3 or more characters differ between two strings.
If the current URL and all previous X amount of URLs have do not have a significant difference, the current URL will not be added to the schedule.

The issue

This simple algorithm breaks quite easily when an evil page generates multiple evil pages that have a significant difference in URLs. Also, there is small potential for false positives to arise.

php.finder.prompt_for_php_path() does not sanitize user input

This function prompts the user to give it a path to a php executable. This input is not sanitized, allowing the user to cause some errors.

The function must be safe (not crash or cause errors) from the following:

  • User inputting areas of the computer where permission access is denied
  • User inputting nothing
  • User inputting a directory
  • User inputting a path that doesn't exist
  • User inputting an executable that is not php

Issue Warning to SSL Errors due to Dated Cipher Suites

Attempting to execute the following

import requests
requests.get("https://mail.calcharter.org")

Results in the following error

Traceback (most recent call last):
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 472, in wrap_socket
    cnx.do_handshake()
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1915, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1640, in _raise_ssl_error
    raise SysCallError(-1, "Unexpected EOF")
OpenSSL.SSL.SysCallError: (-1, 'Unexpected EOF')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
    chunked=chunked)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
    self._validate_conn(conn)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in _validate_conn
    conn.connect()
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
    ssl_context=context)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 355, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 478, in wrap_socket
    raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 641, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/util/retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mail.calcharter.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/__main__.py", line 3, in <module>
    main()
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/arachnid.py", line 191, in main
    crawl()
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/arachnid.py", line 175, in crawl
    while c.crawl_next():
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/crawler.py", line 66, in crawl_next
    c_url = self.schedule.next_url()
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/scheduler.py", line 85, in next_url
    self._fuzz_for_domainblocks()
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/scheduler.py", line 113, in _fuzz_for_domainblocks
    r = requests.head(sub_to_check.get_url(), headers=self.headers)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/api.py", line 101, in head
    return request('head', url, **kwargs)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='mail.calcharter.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

After some research, I've found a similar situation here. After trying out some possible solutions, I've found that its due to cipher suites that the requests library no longer supports.

It's possible to add these cipher suites manually, but there are likely too many edge cases around the internet to keep track of. I decided that it'd be best to only use the Cipher Suites that requests supports. Because of this decision, Arachnid can't access the website and must notify the user by a warning.

Need a visualized output

Arachnid needs a helpful and visually pleasing output. It'll utilize the json file that is supplied by the crawler.

Potential dependency conflicts between arachnid-spider and chardet

Hi, as shown in the following full dependency graph of arachnid-spider, arachnid-spider requires chardet * , while the installed version of requests(2.22.0) requires chardet <3.1.0,>=3.0.2.

According to Pip's “first found wins” installation strategy, chardet 3.0.4 is the actually installed version.

Although the first found package version chardet 3.0.4 just satisfies the later dependency constraint (chardet <3.1.0,>=3.0.2), it will lead to a build failure once developers release a newer version of chardet.

Dependency tree--------

arachnid-spider - 0.9.4
| +- beautifulsoup4(install version:4.8.1 version range:*)
| | +- soupsieve(install version:1.9.5 version range:>=1.2)
| +- chardet(install version:3.0.4 version range:*)
| +- ndg-httpsclient(install version:0.5.1 version range:*)
| | +- pyasn1(install version:0.4.8 version range:>=0.1.1)
| | +- pyopenssl(install version:19.1.0 version range:*)
| | | +- cryptography(install version:2.8 version range:>=2.8)
| | | +- six(install version:1.13.0 version range:>=1.5.2)
| +- pyasn1(install version:0.4.8 version range:*)
| +- pyopenssl(install version:19.1.0 version range:*)
| | +- cryptography(install version:2.8 version range:>=2.8)
| | +- six(install version:1.13.0 version range:>=1.5.2)
| +- requests(install version:2.22.0 version range:*)
| | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| +- tldextract(install version:2.2.2 version range:*)
| | +- idna(install version:2.8 version range:*)
| | +- requests(install version:2.22.0 version range:>=2.1.0)
| | | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| | +- requests-file(install version:1.4.3 version range:>=1.4)
| | | +- requests(install version:2.22.0 version range:>=1.0.0)
| | | | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | | | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | | | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | | | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| | | +- six(install version:1.13.0 version range:*)
| | +- setuptools(install version:42.0.1 version range:*)

Thanks for your attention.
Best,
Neolith

"on_fuzzed" data may be inaccurate when user does not use '-S' or '-F' options

Right now, the "on_fuzzed" boolean value for each page is automatically false if -S or -F is not supplied even though the page could vary well be on a fuzz list.
Arachnid ought to check the fuzz data to ensure that a given page exists on the fuzz list despite whether the user wants to fuzz for pages and subdomains or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.