jake-bickle / arachnid Goto Github PK

View Code? Open in Web Editor NEW

13.0 13.0 3.0 2.89 MB

An OSINT tool to find data leaks on a targeted website

License: GNU General Public License v3.0

Python 9.92% HTML 0.32% CSS 59.57% JavaScript 27.63% Hack 0.53% PHP 2.03%

arachnid's People

Contributors

Stargazers

Watchers

Forkers

neolithera jeremyengram

arachnid's Issues

Unable to install Arachnid

$ pip install arachnid-spider
Collecting arachnid-spider
Could not find a version that satisfies the requirement arachnid-spider (from versions: )
No matching distribution found for arachnid-spider

A fresh version of php7.3 and apache2 was installed shortly before attempting to download Arachnid via pip, not sure which requirements aren't being met.

Requests do not timeout

Each request that is made are missing a timeout parameter, which can leave the crawler hanging indefinitely.

In addition, a warning must be issued to the user any time this happens.

Create installer for common operating systems

Scraper: Remove Duplicate Data On Single Page

Currently, if the same social URL appears more than once on the page (think sidebar and footer) then it will record both of those URLs in the JSON.

Crawler: Remove Fragments

Do not request any URLs with fragments (#)

Scraper: The regex for find_all_phones provides false positives

find_all_phones will find "123-123-1234" out of "123-123-12345" and "1234567890" out of "1234567890123"

I'm sure there are other similar scenarios this will occur as well.

Need documentation

Need to find a place and write a user documentation for each option that Arachnid provides.

Also, perhaps an advanced section for using the json output to create your own output application.

Creating unit tests is extremely important in ensuring Arachnid is stable. Stability is important for a long lasting crawl, as a crash could lead to lost data and wasting a lot of time (though issues #29 and #27 are looking to solve this issue). Unfortunately, tests have fallen by the wayside in this project. Nearly all the modules have no unit tests, and the existing tests are broken.

Create method to issue warnings to output

The output holds a special area where Arachnid can provide warnings to the user that it cannot handle. It needs a way to issue warnings to the PHP output.

Crawler gets stuck on pages that generate more HTML pages

EDIT: For those experiencing this problem, read about ways to fix it here.

Arachnid has no method in place that stops crawling pages that continuously generate more pages. For example, a web calendar may have a "next month" button that one could theoretically continuously navigate to move forwards hundreds of years.

This is a common bot trap. The issue is that the crawler is stuck continuously navigating to new pages and gathering useless data. It also takes a long or indefinite amount of time for the scheduler to provide a new, unique URL that navigates the crawler away from the page.

Methods attempted so far:

The AOPIC algorithm

The algorithm

The official paper is located here.
Explained in layman's terms here.

The issue

This algorithm is designed for incremental crawlers, not for snapshot crawlers. However, it naturally pushes the evil pages towards the end of the crawl. If we could think of a way to modify this algorithm to detect when the crawler is stuck on an evil page, this would be our solution.

I pushed the branch "AOPIC" with my implementation of the algorithm. The file of interest is here. Feel free to fork it and mess around with the algorithm.

The URLDiffFilter

The algorithm

The DiffFilter checks an arbitrary amount X of previously crawled URL strings for a "significant difference." A significant difference is when 3 or more characters differ between two strings.
If the current URL and all previous X amount of URLs have do not have a significant difference, the current URL will not be added to the schedule.

The issue

This simple algorithm breaks quite easily when an evil page generates multiple evil pages that have a significant difference in URLs. Also, there is small potential for false positives to arise.

Argparse descriptions for - - help

There are many place holder texts for each attribute when using - - help. They must be written concisely and get the point across.

php.finder.prompt_for_php_path() does not sanitize user input

This function prompts the user to give it a path to a php executable. This input is not sanitized, allowing the user to cause some errors.

The function must be safe (not crash or cause errors) from the following:

User inputting areas of the computer where permission access is denied
User inputting nothing
User inputting a directory
User inputting a path that doesn't exist
User inputting an executable that is not php

Store scheduler data in a database, rather than in memory

When crawling a large website, it is not feasible to store all the data in memory. For scalability, it makes sense to store all of the data on disk instead.

Issue Warning to SSL Errors due to Dated Cipher Suites

Attempting to execute the following

import requests
requests.get("https://mail.calcharter.org")

Results in the following error

Traceback (most recent call last):
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 472, in wrap_socket
    cnx.do_handshake()
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1915, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1640, in _raise_ssl_error
    raise SysCallError(-1, "Unexpected EOF")
OpenSSL.SSL.SysCallError: (-1, 'Unexpected EOF')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
    chunked=chunked)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
    self._validate_conn(conn)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in _validate_conn
    conn.connect()
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
    ssl_context=context)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 355, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 478, in wrap_socket
    raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 641, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/util/retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mail.calcharter.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/__main__.py", line 3, in <module>
    main()
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/arachnid.py", line 191, in main
    crawl()
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/arachnid.py", line 175, in crawl
    while c.crawl_next():
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/crawler.py", line 66, in crawl_next
    c_url = self.schedule.next_url()
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/scheduler.py", line 85, in next_url
    self._fuzz_for_domainblocks()
  File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/scheduler.py", line 113, in _fuzz_for_domainblocks
    r = requests.head(sub_to_check.get_url(), headers=self.headers)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/api.py", line 101, in head
    return request('head', url, **kwargs)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='mail.calcharter.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

After some research, I've found a similar situation here. After trying out some possible solutions, I've found that its due to cipher suites that the requests library no longer supports.

It's possible to add these cipher suites manually, but there are likely too many edge cases around the internet to keep track of. I decided that it'd be best to only use the Cipher Suites that requests supports. Because of this decision, Arachnid can't access the website and must notify the user by a warning.

Need a visualized output

Arachnid needs a helpful and visually pleasing output. It'll utilize the json file that is supplied by the crawler.

Potential dependency conflicts between arachnid-spider and chardet

Hi, as shown in the following full dependency graph of arachnid-spider, arachnid-spider requires chardet * ， while the installed version of requests(2.22.0) requires chardet <3.1.0,>=3.0.2.

According to Pip's “first found wins” installation strategy, chardet 3.0.4 is the actually installed version.

Although the first found package version chardet 3.0.4 just satisfies the later dependency constraint （chardet <3.1.0,>=3.0.2), it will lead to a build failure once developers release a newer version of chardet.

Dependency tree--------

arachnid-spider - 0.9.4
| +- beautifulsoup4(install version:4.8.1 version range:*)
| | +- soupsieve(install version:1.9.5 version range:>=1.2)
| +- chardet(install version:3.0.4 version range:*)
| +- ndg-httpsclient(install version:0.5.1 version range:*)
| | +- pyasn1(install version:0.4.8 version range:>=0.1.1)
| | +- pyopenssl(install version:19.1.0 version range:*)
| | | +- cryptography(install version:2.8 version range:>=2.8)
| | | +- six(install version:1.13.0 version range:>=1.5.2)
| +- pyasn1(install version:0.4.8 version range:*)
| +- pyopenssl(install version:19.1.0 version range:*)
| | +- cryptography(install version:2.8 version range:>=2.8)
| | +- six(install version:1.13.0 version range:>=1.5.2)
| +- requests(install version:2.22.0 version range:*)
| | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| +- tldextract(install version:2.2.2 version range:*)
| | +- idna(install version:2.8 version range:*)
| | +- requests(install version:2.22.0 version range:>=2.1.0)
| | | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| | +- requests-file(install version:1.4.3 version range:>=1.4)
| | | +- requests(install version:2.22.0 version range:>=1.0.0)
| | | | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | | | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | | | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | | | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| | | +- six(install version:1.13.0 version range:*)
| | +- setuptools(install version:42.0.1 version range:*)

Thanks for your attention.
Best,
Neolith

Implement a way to pause, stop, and restart crawls at a later time

"on_fuzzed" data may be inaccurate when user does not use '-S' or '-F' options

Right now, the "on_fuzzed" boolean value for each page is automatically false if -S or -F is not supplied even though the page could vary well be on a fuzz list.
Arachnid ought to check the fuzz data to ensure that a given page exists on the fuzz list despite whether the user wants to fuzz for pages and subdomains or not.

jake-bickle / arachnid Goto Github PK

arachnid's People

Contributors

Stargazers

Watchers

Forkers

arachnid's Issues

The AOPIC algorithm

The algorithm

The issue

The URLDiffFilter

The algorithm

The issue

Dependency tree--------

Recommend Projects

Recommend Topics

Recommend Org

Jobs