jake-bickle / arachnid Goto Github PK
View Code? Open in Web Editor NEWAn OSINT tool to find data leaks on a targeted website
License: GNU General Public License v3.0
An OSINT tool to find data leaks on a targeted website
License: GNU General Public License v3.0
$ pip install arachnid-spider
Collecting arachnid-spider
Could not find a version that satisfies the requirement arachnid-spider (from versions: )
No matching distribution found for arachnid-spider
A fresh version of php7.3 and apache2 was installed shortly before attempting to download Arachnid via pip, not sure which requirements aren't being met.
Each request that is made are missing a timeout parameter, which can leave the crawler hanging indefinitely.
In addition, a warning must be issued to the user any time this happens.
Currently, if the same social URL appears more than once on the page (think sidebar and footer) then it will record both of those URLs in the JSON.
Do not request any URLs with fragments (#)
find_all_phones will find "123-123-1234" out of "123-123-12345" and "1234567890" out of "1234567890123"
I'm sure there are other similar scenarios this will occur as well.
Need to find a place and write a user documentation for each option that Arachnid provides.
Also, perhaps an advanced section for using the json output to create your own output application.
Creating unit tests is extremely important in ensuring Arachnid is stable. Stability is important for a long lasting crawl, as a crash could lead to lost data and wasting a lot of time (though issues #29 and #27 are looking to solve this issue). Unfortunately, tests have fallen by the wayside in this project. Nearly all the modules have no unit tests, and the existing tests are broken.
The output holds a special area where Arachnid can provide warnings to the user that it cannot handle. It needs a way to issue warnings to the PHP output.
EDIT: For those experiencing this problem, read about ways to fix it here.
Arachnid has no method in place that stops crawling pages that continuously generate more pages. For example, a web calendar may have a "next month" button that one could theoretically continuously navigate to move forwards hundreds of years.
This is a common bot trap. The issue is that the crawler is stuck continuously navigating to new pages and gathering useless data. It also takes a long or indefinite amount of time for the scheduler to provide a new, unique URL that navigates the crawler away from the page.
Methods attempted so far:
The official paper is located here.
Explained in layman's terms here.
This algorithm is designed for incremental crawlers, not for snapshot crawlers. However, it naturally pushes the evil pages towards the end of the crawl. If we could think of a way to modify this algorithm to detect when the crawler is stuck on an evil page, this would be our solution.
I pushed the branch "AOPIC" with my implementation of the algorithm. The file of interest is here. Feel free to fork it and mess around with the algorithm.
The DiffFilter checks an arbitrary amount X of previously crawled URL strings for a "significant difference." A significant difference is when 3 or more characters differ between two strings.
If the current URL and all previous X amount of URLs have do not have a significant difference, the current URL will not be added to the schedule.
This simple algorithm breaks quite easily when an evil page generates multiple evil pages that have a significant difference in URLs. Also, there is small potential for false positives to arise.
There are many place holder texts for each attribute when using - - help. They must be written concisely and get the point across.
This function prompts the user to give it a path to a php executable. This input is not sanitized, allowing the user to cause some errors.
The function must be safe (not crash or cause errors) from the following:
When crawling a large website, it is not feasible to store all the data in memory. For scalability, it makes sense to store all of the data on disk instead.
Attempting to execute the following
import requests
requests.get("https://mail.calcharter.org")
Results in the following error
Traceback (most recent call last):
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 472, in wrap_socket
cnx.do_handshake()
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1915, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1640, in _raise_ssl_error
raise SysCallError(-1, "Unexpected EOF")
OpenSSL.SSL.SysCallError: (-1, 'Unexpected EOF')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen
chunked=chunked)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
self._validate_conn(conn)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in _validate_conn
conn.connect()
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
ssl_context=context)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 355, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 478, in wrap_socket
raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(-1, 'Unexpected EOF')",)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/connectionpool.py", line 641, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/urllib3/util/retry.py", line 399, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mail.calcharter.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/__main__.py", line 3, in <module>
main()
File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/arachnid.py", line 191, in main
crawl()
File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/arachnid.py", line 175, in crawl
while c.crawl_next():
File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/crawler.py", line 66, in crawl_next
c_url = self.schedule.next_url()
File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/scheduler.py", line 85, in next_url
self._fuzz_for_domainblocks()
File "/Users/jacobbickle/PycharmProjects/Arachnid/arachnid/crawler/scheduler.py", line 113, in _fuzz_for_domainblocks
r = requests.head(sub_to_check.get_url(), headers=self.headers)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/api.py", line 101, in head
return request('head', url, **kwargs)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/Users/jacobbickle/.local/share/virtualenvs/Arachnid-do6dGE4u/lib/python3.7/site-packages/requests/adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='mail.calcharter.org', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))
After some research, I've found a similar situation here. After trying out some possible solutions, I've found that its due to cipher suites that the requests library no longer supports.
It's possible to add these cipher suites manually, but there are likely too many edge cases around the internet to keep track of. I decided that it'd be best to only use the Cipher Suites that requests supports. Because of this decision, Arachnid can't access the website and must notify the user by a warning.
Arachnid needs a helpful and visually pleasing output. It'll utilize the json file that is supplied by the crawler.
Hi, as shown in the following full dependency graph of arachnid-spider, arachnid-spider requires chardet * , while the installed version of requests(2.22.0) requires chardet <3.1.0,>=3.0.2.
According to Pip's “first found wins” installation strategy, chardet 3.0.4 is the actually installed version.
Although the first found package version chardet 3.0.4 just satisfies the later dependency constraint (chardet <3.1.0,>=3.0.2), it will lead to a build failure once developers release a newer version of chardet.
arachnid-spider - 0.9.4
| +- beautifulsoup4(install version:4.8.1 version range:*)
| | +- soupsieve(install version:1.9.5 version range:>=1.2)
| +- chardet(install version:3.0.4 version range:*)
| +- ndg-httpsclient(install version:0.5.1 version range:*)
| | +- pyasn1(install version:0.4.8 version range:>=0.1.1)
| | +- pyopenssl(install version:19.1.0 version range:*)
| | | +- cryptography(install version:2.8 version range:>=2.8)
| | | +- six(install version:1.13.0 version range:>=1.5.2)
| +- pyasn1(install version:0.4.8 version range:*)
| +- pyopenssl(install version:19.1.0 version range:*)
| | +- cryptography(install version:2.8 version range:>=2.8)
| | +- six(install version:1.13.0 version range:>=1.5.2)
| +- requests(install version:2.22.0 version range:*)
| | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| +- tldextract(install version:2.2.2 version range:*)
| | +- idna(install version:2.8 version range:*)
| | +- requests(install version:2.22.0 version range:>=2.1.0)
| | | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| | +- requests-file(install version:1.4.3 version range:>=1.4)
| | | +- requests(install version:2.22.0 version range:>=1.0.0)
| | | | +- certifi(install version:2019.9.11 version range:>=2017.4.17)
| | | | +- chardet(install version:3.0.4 version range:<3.1.0,>=3.0.2)
| | | | +- idna(install version:2.8 version range:>=2.5,<2.9)
| | | | +- urllib3(install version:1.25.7 version range:<1.26,>=1.21.1)
| | | +- six(install version:1.13.0 version range:*)
| | +- setuptools(install version:42.0.1 version range:*)
Thanks for your attention.
Best,
Neolith
Right now, the "on_fuzzed" boolean value for each page is automatically false if -S or -F is not supplied even though the page could vary well be on a fuzz list.
Arachnid ought to check the fuzz data to ensure that a given page exists on the fuzz list despite whether the user wants to fuzz for pages and subdomains or not.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.