18f / domain-scan Goto Github PK
View Code? Open in Web Editor NEWA lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.
License: Other
A lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.
License: Other
I know we're all busy, but I figured it's at least worth starting the discussion around this repo's lack of tests. Right now, the only a Python linter (?) is run.
I'll take a first stab at what I think should be tested:
./scan
command with unit tests (throw all different args at it, see if it still works)@konklone, is this where the other scanners will live, or is domain-scan
going to remain focused on SSL scanning?
--xml_out
can now print to STDOUT, which means we can clean up the file management code in the sslyze
scanner.
I don't have enough information to report a bug yet. When I checked on our server, the sslyze scans had stalled out with 9 in-flight, with a bunch of defunct
sslyze processes. These were the domains:
nces.ed.gov
autodiscover.ors.od.nih.gov
stg-reg2.hcia.cms.gov
vpn1.cjis.gov
portcullis.nlrb.gov
safesupportivelearning.ed.gov
my.uscis.gov
www.educationusa.state.gov
pittsburgh.feb.gov
But when I ran a scan using sslyze
on all 9 of those in a row, using --serial
, none of them stalled out. So I'm not totally sure how to reproduce this.
When i use
sudo ./gather censys,dap,
--suffix=.gov
--censys_id=id
--censys_key=key
--dap=https://analytics.usa.gov/data/live/sites-extended.csv
--parents=https://raw.githubusercontent.com/GSA/data/gh-pages/dotgov-domains/current-federal.csv
this error was detected
Done fetching from API.
Results written to CSV.
rootk@ubuntu:~/domain-scan-master$ ./st2.sh
Fetching up to 100 records, starting at page 1.
[1] Cached page.
[] Gatherer not found, or had an error during loading.
ERROR: <class 'ImportError'>
No module named 'gatherers.'
What is the problem?
Observing this sslyze error during scans:
Traceback (most recent call last):
File "/opt/scan/domain-scan/scan", line 120, in process_scan
rows = list(scanner.scan(domain, options))
File "/opt/scan/domain-scan/scanners/sslyze.py", line 75, in scan
data = parse_sslyze(xml)
File "/opt/scan/domain-scan/scanners/sslyze.py", line 205, in parse_sslyze
issuer = certificates[-1].select_one("issuer commonName")
IndexError: list index out of range
These appear to happen after a long timeout, suggesting that there could be a connection/timeout error that results in no certificate data being available.
I'm trying to call pshtt
as a docker container, which results in something like the following command being passed executed by domain-scan
:
docker run --rm -e USER_ID=1042 -e GROUP_ID=1042 -v $(pwd):/data dockerpulse_c-pshtt rijksoverheid.nl
This fails when a scanner tries this command using try_command()
[code], because it runs which
on the whole command including parameters. This could be 'fixed' if only the actual executable of the command is passed to which
, e.g.:
subprocess.check_call(["which", command.split(' ')[0]], ...)
I'm not sure if this has any implications, so I made an issue out of this instead of a pull request.
@jtexnl's script, saving here for posterity:
import csv
import re
import json
def readData(inputFile):
outList = []
with open(inputFile, 'rU') as infile:
reader = csv.reader(infile)
firstRow = True
for row in reader:
if firstRow == True:
firstRow = False
continue
else:
outList.append(row)
return outList
def writeJson(inputData, fileName):
with open(fileName, 'w+') as outfile:
json.dump(inputData, outfile, indent = 4)
def makeAgencyOutput(inputList, errorDict, errorTypeDict):
output = []
for row in inputList:
subSet = row[1]
subDict = collections.OrderedDict({})
subDict['Agency'] = row[0]
subDict['Errors'] = errorDict[row[0]]
for key, value in errorTypeDict.items():
k = key
try:
subDict[k] = subSet[value]
except KeyError:
subDict[k] = 0
except TypeError:
subDict[k] = 0
output.append(subDict)
return output
def getKey(item):
return item[0]
def trimErrorField(errorField):
pieces = re.split('.*(Guideline)', errorField)
shortened = pieces[-1]
pieces = shortened.split('.')
num = pieces[0]
return num
def categorize(dataset, referenceDict, colNum, altName):
for row in dataset:
if row[colNum] in referenceDict.keys():
row.append(referenceDict[row[colNum]])
else:
row.append(altName)
return dataset
def countDict(dataset, colIndex):
output = {}
for row in dataset:
if row[colIndex] in output:
output[row[colIndex]] += 1
else:
output[row[colIndex]] = 1
return output
#Read in a11y.csv for errors and domains.csv for agencies
ally1 = readData('a11y.csv')
domains = readData('domains.csv')
#need to remove ussm.gov, whistleblower.gov, and safeocs.gov from ally due to discrepancies between the datasets. Solve at some point
ally = []
for row in ally1:
if row[0] != 'safeocs.gov' and row[0] != 'whistleblower.gov' and row[0] != 'ussm.gov':
ally.append(row)
#Truncate the a11y file so that it's a bit more manageable. Need the domain name [0] and the principle [4]
main = []
for row in ally:
main.append([row[0], trimErrorField(row[4])])
#Add the information on the agency [1] and branch [2]
for error in main:
for domain in domains:
if error[0] == domain[0].lower():
error.append(domain[1])
error.append(domain[2])
#Dictionaries; branches = branch lookup, errorCats = error category lookup
branches = {"Library of Congress":"Legislative","The Legislative Branch (Congress)":"Legislative",
"Government Printing Office":"Legislative","Congressional Office of Compliance":"Legislative",
"The Judicial Branch (Courts)":"Judicial"}
errorCats = {'1_4':'Color Contrast Error', '1_1':'Alt Tag Error', '4_1':'HTML/Attribute Error', '1_3':'Form Error'}
#define branches for the 'main' and 'domains' sets, define error categories for 'main'
main = categorize(main, branches, -1, 'Executive')
domains = categorize(domains, branches, 2, 'Executive')
main = categorize(main, errorCats, 1, 'Other Error')
totalErrorsByDomain = countDict(main, 0)
totalErrorsByAgency = countDict(main, 3)
#createe dict of base vs. canonical domains
canonicals = {}
for row in ally:
try:
if row[0] in canonicals.keys():
continue
else:
canonicals[row[0]] = row[1]
except KeyError:
continue
noErrors = []
errors = []
for domain in domains:
if not domain[0].lower() in totalErrorsByDomain.keys():
noErrors.append(domain)
else:
errors.append(domain)
for row in noErrors:
row.append(0)
row.append({})
try:
if row[0] in canonicals.keys():
row.append('http://' + canonicals[row[0].lower()])
else:
row.append('http://' + row[0].lower())
except TypeError:
continue
for row in errors:
row.append(totalErrorsByDomain[row[0].lower()])
subset = []
for line in main:
if line[0] == row[0].lower():
subset.append(line)
errorDict = countDict(subset, -1)
row.append(errorDict)
try:
if row[0] in canonicals.keys():
row.append('http://' + canonicals[row[0].lower()])
else:
row.append('http://' + row[0].lower())
except TypeError:
continue
domains = errors + noErrors
domains = sorted(domains, key = getKey)
dictList = []
for row in domains:
subDict = collections.OrderedDict({})
subDict['agency'] = row[2]
subDict['branch'] = row[5]
subDict['canonical'] = row[8]
subDict['domain'] = row[0].lower()
subDict['errors'] = row[6]
subDict['errorlist'] = row[7]
dictList.append(subDict)
finalDict = {}
finalDict['data'] = dictList
writeJson(finalDict, 'domains.json')
agencyList = []
for row in main:
if row[3] in agencyList:
continue
else:
agencyList.append(row[3])
agencyErrorSets = []
for agency in agencyList:
subList = []
sub = {}
for row in main:
if row[3] == agency:
if row[-1] in sub:
sub[row[-1]] += 1
else:
sub[row[-1]] = 1
subList.append(agency)
subList.append(sub)
agencyErrorSets.append(subList)
errorTypes = {'Color Contrast Errors':'Color Constrast Error', 'HTML/Attribute Errors':'HTML/Attribute Error',
'Form Errors':'Form Error', 'Alt Tag Errors':'Alt Tag Error', 'Other Errors':'Other Error'}
output = makeAgencyOutput(agencyErrorSets, agencyErrorDict, errorTypes)
finalOutput = {}
finalOutput['data'] = output
writeJson(finalOutput, 'agencies.json')
The current instructions aren't clear about where/how to set the environmental variables required for localdev.
With a note to the user that Python 3 is required.
The SSL Labs API will stop automatically guessing whether or not the domain needs a www
prefix there or not in the next version:
http://sourceforge.net/p/ssllabs/mailman/message/34661550/
We do make a best guess at the "canonical" form of a domain in the inspect
step, using site-inspector
, so we can use this to submit the right endpoint.
That said, since that canonical prefix detection is buggy (and arguably has been giving us incomplete data anyway) we may be better off submitting either or both of the root and www
prefix, based on whether or not we detect HTTPS as available on that endpoint. That will leave less room for bugs and give us more data.
$ docker-compose run scan particulier.api.gouv.fr --scan=tls
Results written to CSV.
But the file results/tls.csv
is empty
$ cat results/tls.csv
Domain,Base Domain,Grade,Signature Algorithm,Key Type,Key Size,Forward Secrecy,OCSP Stapling,Fallback SCSV,RC4,SSLv3,TLSv1.2,SPDY,Requires SNI,HTTP/2
But i works well in the interface
This command docker-compose run scan geo.api.gouv.fr --scan=tls
works.
Nothing is printed on stderr
. Do you know where is the problem ?
thx to #132 !!
but when i worked this command
./gather censys
--suffix=.gov
--censys_id=id
--censys_key=key
--start=1
--end=2
--delay=5
--debug
Traceback (most recent call last):
File "./gather", line 6, in
import requests
ImportError: No module named 'requests'
i am getting this error. What is the problem?
i was try to pip install requests
Perhaps this is a Docker-ism that I'm not as familiar with, but is there a reason to pin pshtt to a specific version, rather than leaving the version off and getting the latest from PyPI ? Is this something to optimize the container image building?
Line 139 in 8a002b5
The following domains break the a11y scan such that I have to stop it, remove the domain, and re-restart the scan all over again.
Two problems result:
afadvantage.gov
ama.gov
banknet.gov
biomassboard.gov
broadband.gov
dea.gov
disasterhousing.gov
export.gov
flightschoolcandidates.gov
grantsolutions.gov
gsaadvantage.gov
gsaauctions.gov
hrsa.gov
hydrogen.gov
idmanagement.gov
invasivespecies.gov
myfdicinsurance.gov
nationalbank.gov
nationalbanknet.gov
nationalhousing.gov
nationalhousinglocator.gov
nhl.gov
nls.gov
onhir.gov
pay.gov
realestatesales.gov
safetyact.gov
sciencebase.gov
segurosocial.gov
selectusa.gov
stopfakes.gov
tvaoig.gov
usdebitcard.gov
These need major refactoring
formats csv into a json format
require 'bundler/setup'
require 'pry'
require 'csv'
require 'json'
require 'parallel'
def get_scan_error row_hash
{
"code" => row_hash["code"],
"typeCode" => row_hash["typeCode"],
"message" => row_hash["message"],
"context" => row_hash["context"],
"selector" => row_hash["selector"],
"type" => row_hash["typeCode"] == "1" ? "error" : "other"
}
end
Dir.chdir(File.dirname(__FILE__))
csv_scan = File.read('../data/a11y-8-4-2016-no-2_csv.csv')
inspect_domains = File.read('../data/inspect-domains.csv')
domains = {}
# create domains hash with just domains from inspect file
CSV.parse(inspect_domains, headers: true) do |row|
row_hash = row.to_hash
if row_hash["Live"] != "False"
domains[row_hash["Domain"]] = {
"Domain Name" => row_hash["Domain"],
"scan" => []
}
end
end
#go through get each error, add to scan output
CSV.parse(csv_scan, headers: true) do |row|
row_hash = row.to_hash
if !domains[row_hash["Domain"]]
domains[row_hash["Domain"]] = {
"Domain Name" => row_hash["Domain"],
"scan" => [get_scan_error(row_hash)]
}
else
domains[row_hash["Domain"]]["scan"] << get_scan_error(row_hash)
end
end
combined_domains = []
domains.each do |domain|
combined_domains << domain[1]
end
File.open("../data/a11y-8-4-2016-no-2_csv.json","w") do |f|
f.write(combined_domains.to_json)
end
takes that JSON and makes the 3 files needed for pulse
require 'bundler/setup'
require 'pry'
require 'csv'
require 'json'
def total_errors domain
errors = domain["scan"].select{|row|
row["type"] == "error"
}
errors.length
end
def get_branch domain, sample
url = domain["Domain Name"].downcase
puts "Get Branch url = #{url}"
branch = ""
sample["data"].each do |sample|
if sample["domain"] == url
branch = sample["branch"]
end
end
puts "Branch = #{branch}"
branch
end
def get_agency domain, sample
url = domain["Domain Name"].downcase
puts "Get Branch url = #{url}"
agency = ""
sample["data"].each do |sample|
if sample["domain"] == url
agency = sample["agency"]
end
end
puts "Agency = #{agency}"
agency
end
def get_error_cat_count domain
errorlist = {
"Alt Tag Errors" => 0,
"Color Contrast Errors" => 0,
"Form Errors" => 0,
"HTML/Attribute Errors" => 0,
"Other Errors" => 0
}
codes = {
"1_4." => "Color Contrast Errors"
}
domain["scan"].each do |error|
if error["code"].include? "1_4."
errorlist["Color Contrast Errors"] = errorlist["Color Contrast Errors"] + 1
elsif error["code"].include? "1_1."
errorlist["Alt Tag Errors"] = errorlist["Alt Tag Errors"] + 1
elsif error["code"].include? "4_1."
errorlist["HTML/Attribute Errors"] = errorlist["HTML/Attribute Errors"] + 1
elsif error["code"].include? "1_3."
errorlist["Form Errors"] = errorlist["Form Errors"] + 1
else
errorlist["Other Errors"] = errorlist["Other Errors"] + 1
end
end
errorlist
end
def get_cat_errors domain
errorlist = {
"Alt Tag Errors" => [],
"Color Contrast Errors" => [],
"Form Errors" => [],
"HTML/Attribute Errors" => [],
"Other Errors" => []
}
domain["scan"].each do |error|
if error["code"].include? "1_4."
errorlist["Color Contrast Errors"] << error
elsif error["code"].include? "1_1."
errorlist["Alt Tag Errors"] << error
elsif error["code"].include? "4_1."
errorlist["HTML/Attribute Errors"] << error
elsif error["code"].include? "1_3."
errorlist["Form Errors"] << error
else
errorlist["Other Errors"] << error
end
end
errorlist
end
Dir.chdir(File.dirname(__FILE__))
scans = File.read('../data/a11y-8-4-2016-no-2_csv.json')
domains_sample = JSON.parse(File.read("../data/domains-sample.json"))
error_cats = JSON.parse(File.read('../config/error_cat.json'))
puts domains_sample["data"].length
scans = JSON.parse(scans)
all_errors_count = 0
domains = {}
domains["data"] = []
a11y = {}
a11y["data"] = {}
scans.each do |scan|
puts scan["Domain Name"]
puts "Total Errors = #{total_errors scan}"
puts "Branch = #{get_branch(scan, domains_sample)}"
all_errors_count += total_errors scan
domains["data"] << {
"agency": get_agency(scan, domains_sample),
"branch": get_branch(scan, domains_sample),
"canonical": "http://#{scan["Domain Name"].downcase}",
"domain": scan["Domain Name"].downcase,
"errors": total_errors(scan),
"errorlist": get_error_cat_count(scan)
}
a11y["data"][scan["Domain Name"].downcase] = get_cat_errors(scan)
end
agencies = {}
agency_hash = {}
domains["data"].each do |domain|
if agency_hash[domain["agency"]]
agency = agency_hash[domain["agency"]]
domain_error_list = domain["errorlist"]
agency["Average Errors per Page"] += domain["errors"]
agency["Alt Tag Errors"] += domain_error_list["Alt Tag Errors"]
agency["HTML/Attribute Errors"] += domain_error_list["HTML/Attribute Errors"]
agency["Form Errors"] += domain_error_list["Form Errors"]
agency["Color Contrast Errors"] += domain_error_list["Color Contrast Errors"]
agency["Other Errors"] += domain_error_list["Other Errors"]
else
agency_hash[domain[:agency]] = {}
agency = agency_hash[domain[:agency]]
domain_error_list = domain[:errorlist]
# binding.pry
agency["Agency"] = domain[:agency]
agency["Average Errors per Page"] = domain[:errors]
agency["Alt Tag Errors"] = domain_error_list["Alt Tag Errors"]
agency["HTML/Attribute Errors"] = domain_error_list["HTML/Attribute Errors"]
agency["Form Errors"] = domain_error_list["Form Errors"]
agency["Color Contrast Errors"] = domain_error_list["Color Contrast Errors"]
agency["Other Errors"] = domain_error_list["Other Errors"]
end
if domain["agency"] == "Department of Defense"
puts agency_hash["Department of Defense"]
end
end
agencies["data"] = agency_hash.map{|agency|
agency[1]
}
puts domains["data"].first
puts "Total Errors #{all_errors_count}"
File.open("../data/domains.json","w") do |f|
f.write(domains.to_json)
end
File.open("../data/a11y.json","w") do |f|
f.write(a11y.to_json)
end
File.open("../data/agencies.json","w") do |f|
f.write(agencies.to_json)
end
Starting new HTTPS connection (1): www.censys.io
https://www.censys.io:443 "GET /api/v1/account HTTP/1.1" 200 243
Censys query:
SELECT parsed.subject.common_name, parsed.extensions.subject_alt_name.dns_names from FLATTEN([certificates.certificates], parsed.extensions.subject_alt_name.dns_names) where parsed.subject.common_name LIKE "%.gov" OR parsed.extensions.subject_alt_name.dns_names LIKE "%.gov";
Kicking off SQL query job.
https://www.censys.io:443 "POST /api/v1/export HTTP/1.1" 403 115
Traceback (most recent call last):
File "/home/user/domain-scan/gatherers/censys.py", line 194, in export_mode
job = export_api.new_job(query, format='csv', flatten=True)
File "/home/user/domain-scan/censys/export.py", line 25, in new_job
return self._post("export", data=data)
File "/home/user/domain-scan/censys/base.py", line 111, in _post
return self._make_call(self._session.post, endpoint, args, data)
File "/home/user/domain-scan/censys/base.py", line 105, in _make_call
const=const)
censys.base.CensysUnauthorizedException: 403 (unauthorized): Unauthorized. You do not have access to this service.
Censys error, aborting.
Downloading results of SQL query.
Traceback (most recent call last):
File "./gather", line 175, in
run(options)
File "./gather", line 73, in run
for domain in gatherer.gather(suffix, options, extra):
File "/home/user/domain-scan/gatherers/censys.py", line 66, in gather
hostnames_map = export_mode(suffix, options, uid, api_key)
File "/home/user/domain-scan/gatherers/censys.py", line 231, in export_mode
utils.download(results_url, download_file)
File "/home/user/domain-scan/scanners/utils.py", line 34, in download
filename, headers = urllib.request.urlretrieve(url, destination)
File "/usr/lib/python3.4/urllib/request.py", line 184, in urlretrieve
url_type, path = splittype(url)
File "/usr/lib/python3.4/urllib/parse.py", line 857, in splittype
match = _typeprog.match(url)
TypeError: expected string or buffer
how can i run this
At the end of the scan, --debug
or not, list any errors that gave an invalid: true
response in their cached data. Possibly include it in the meta.json
file for the scan, too. It should be easy to find these and re-scan them.
When the version of pshtt that is loaded by domain-scan lands into an ConnectionError
or a RequestException
exception case in its basic_check
method, it dumps an error and returns an empty string where a JSON object representing the domain's test results should be. This ripples through sslyze and errors out of the domain-scan pshtt invocation altogether (i.e., "Bad news scanning"), and never writes an entry to the results. Here's an example (domain redacted; hit me up for a live example):
› docker-compose run scan domain.tld --scan=pshtt --debug --force
[domain.tld][pshtt]
/opt/pyenv/versions/2.7.11/bin/pshtt domain.tld
Failed to connect.
Certificate did not match expected hostname: domain.tld. Certificate: {[certificate chain information]}
Traceback (most recent call last):
File "/opt/pyenv/versions/2.7.11/bin/pshtt", line 9, in <module>
load_entry_point('pshtt==0.1.5', 'console_scripts', 'pshtt')()
File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/cli.py", line 54, in main
results = pshtt.inspect_domains(domains, options)
File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 882, in inspect_domains
results.append(inspect(domain))
File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 63, in inspect
basic_check(domain.https)
File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 173, in basic_check
https_check(endpoint)
File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 331, in https_check
cert_plugin_result = cert_plugin.process_task(server_info, 'certinfo_basic')
File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/sslyze/plugins/certificate_info_plugin.py", line 115, in process_task
if scan_command.custom_ca_file:
AttributeError: 'str' object has no attribute 'custom_ca_file'
Error running eval "$(pyenv init -)" && pyenv shell 2.7.11 && /opt/pyenv/versions/2.7.11/bin/pshtt domain.tld --json --user-agent "github.com/18f/domain-scan, pshtt.py" --timeout 30 --preload-cache ./cache/preload-list.json.
Bad news scanning, sorry!
Results written to CSV.
That last bit is a lie... Nothing is written because the raw
data out of the pshtt invocation is None
.
This ultimately is an issue with the way exception cases in pshtt are ordered in the overall flow of logic, and I'll open up an issue there and eventually offer a solution. The reason I'm bringing it up here, is because pshtt will produce a report with these exception cases properly reflected (i.e., as failing), but domain-scan just ends up dropping them from the report all together, which can be confusing as anything when your target list of 12K domains only results in a results.csv of 11,999 rows (gah!). So this is more just an "awareness" issue.
We're replacing site-inspector
with pshtt
. Eventually, Pulse will stop including the inspect
scan in its weekly run. The pa11y
scanner should rely on pshtt
results instead of inspect
results.
Doing this all in serial is a big waste, especially since the requests are hitting different domains. Parallelizing the tasks won't increase load on any scanned site, but will drastically speed up completion time.
The overall process here should be very light, and less tied to any one programming language.
https://github.com/benbalter/site-inspector-ruby#command-line-usage
It needs to be modified to output JSON, though, instead of key-value line by line.
In the current workflow, in step 5, I run the following command: docker-compose run scan domains.csv --scan=inspect,a11y --debug
. However, if I am using a domains.csv that has been derived straight from recent DAP results, there's no need to run the inspect command (which adds a decent bit of time to the scan). It would be faster if I could just run the a11y scan without the inspect scan: 'docker-compose run scan domains.csv --scan=a11y --debug`.
This does not work though, as it seems that the a11y scan depends on the inspect scan having already been run. You can see the error message below.
It would be handy to be able to run the a11y scan without it needing the inspect scan cache results.
[youthrules.gov][a11y]
Traceback (most recent call last):
File "./scan", line 120, in process_scan
rows = list(scanner.scan(domain, options))
File "/home/scanner/scanners/a11y.py", line 197, in scan
inspect_data = get_from_inspect_cache(domain)
File "/home/scanner/scanners/a11y.py", line 23, in get_from_inspect_cache
inspect_raw = open(inspect_cache).read()
FileNotFoundError: [Errno 2] No such file or directory: './cache/inspect/youthrules.gov.json'
This looks promising:
http://nabla-c0d3.github.io/blog/2016/02/01/sslyze-0.13.3-released/
When it comes to HSTS preloading, we're measuring the qualities of the domain, not the site at the root of the domain. So, if a domain appears in the list but doesn't include subdomains, for the purpose of our measurement, it should be considered not preloaded.
This seems straightforward enough:
https://censys.io/api/v1/docs/export
I think our hand is forced by the maximum page limit recently enforced in Censys' search API, but the export API is also probably orders of magnitude quicker than the search API for the same data.
Note: this is a potentially big task, that should be broken into smaller tasks/stages. But also, there is value to starting with a naive, simple crawler and leveling it up in stages.
Either baked into domain-scan, or finding/making a separate tool that does this. We could also potentially use Common Crawl data.
But the basic need is to gather domains through web crawling, as this is a fertile source for hostnames that do not appear in Censys.io. For .gov, both Censys and the LOC's web crawl (the End of Term Archive) each had ~50% of unique domains not found through any other public method. The LOC crawl data, performed in late 2016, is getting more stale by the month, and also won't be helpful for non-USG sources.
In step 6 of the a11y scanning process, the process_a11y.py script is not factoring in domains that have no errors since it is just building from the results of the a11y.csv file generated in step 5, however it needs to.
Imagine an executive branch agency with three active, non-redirecting domains. After step 5 completes, there are only error results for two domains (either because the third domain did not scan successfully or because no errors were detected). The problem is that step 6 computes based on the a11y.csv file of individual error results and does not factor in the total domain set that it should be considering.
Expand that script, which runs first in the a11y scan process to have the resulting inspect.csv
include columns for Agency
and Branch
.
Agency
could carry over from the domains.csv file that inspect.py
is running on.Branch
- I would think that this could be done using the same method that is applied in this later script.We would need to ensure that these changes do not adversely impact the workflows for the HTTPS or DAP sections. cc @konklone
The benefits to making these changes is that it would make resolving #101 and, to a degree, #102 easier to resolve.
$ docker --version
Docker version 1.9.1, build a34a1d5
$ docker-compose up
Building scan
Step 1 : FROM ubuntu:14.04.3
---> 6cc0fc2a5ee3
Step 2 : MAINTAINER V. David Zvenyach <[email protected]>
---> Using cache
---> 9c7124f58945
Step 3 : RUN apt-get update -qq && apt-get install -qq --yes --no-install-recommends --no-install-suggests build-essential=11.6ubuntu6 curl=7.35.0-1ubuntu2.5 git=1:1.9.1-1ubuntu0.1 libc6-dev=2.19-0ubuntu6.6 libfontconfig1=2.11.0-0ubuntu4.1 libreadline-dev=6.3-4ubuntu2 libssl-dev=1.0.1f-1ubuntu2.15 libssl-doc=1.0.1f-1ubuntu2.15 libxml2-dev=2.9.1+dfsg1-3ubuntu4.4 libxslt1-dev=1.1.28-2build1 libyaml-dev=0.1.4-3ubuntu3.1 make=3.81-8.2ubuntu3 nodejs=0.10.25~dfsg2-2ubuntu1 npm=1.3.10~dfsg-1 python3-dev=3.4.0-0ubuntu2 python3-pip=1.5.4-1ubuntu3 unzip=6.0-9ubuntu1.3 wget=1.15-1ubuntu1.14.04.1 zlib1g-dev=1:1.2.8.dfsg-1ubuntu1 autoconf=2.69-6 automake=1:1.14.1-2ubuntu1 bison=2:3.0.2.dfsg-2 gawk=1:4.0.1+dfsg-2.1ubuntu2 libffi-dev=3.1~rc1+r3.0.13-12 libgdbm-dev=1.8.3-12build1 libncurses5-dev=5.9+20140118-1ubuntu1 libsqlite3-dev=3.8.2-1ubuntu2.1 libtool=2.4.2-1.7ubuntu1 pkg-config=0.26-1ubuntu4 sqlite3=3.8.2-1ubuntu2.1 && apt-get clean && rm -rf /var/lib/apt/lists/*
---> Running in eb3383000cf7
E: Version '7.35.0-1ubuntu2.5' for 'curl' was not found
E: Version '1:1.9.1-1ubuntu0.1' for 'git' was not found
E: Version '1.0.1f-1ubuntu2.15' for 'libssl-dev' was not found
E: Version '1.0.1f-1ubuntu2.15' for 'libssl-doc' was not found
E: Version '2.9.1+dfsg1-3ubuntu4.4' for 'libxml2-dev' was not found
E: Version '6.0-9ubuntu1.3' for 'unzip' was not found
ERROR: Service 'scan' failed to build: The command '/bin/sh -c apt-get update -qq && apt-get install -qq --yes --no-install-recommends --no-install-suggests build-essential=11.6ubuntu6 curl=7.35.0-1ubuntu2.5 git=1:1.9.1-1ubuntu0.1 libc6-dev=2.19-0ubuntu6.6 libfontconfig1=2.11.0-0ubuntu4.1 libreadline-dev=6.3-4ubuntu2 libssl-dev=1.0.1f-1ubuntu2.15 libssl-doc=1.0.1f-1ubuntu2.15 libxml2-dev=2.9.1+dfsg1-3ubuntu4.4 libxslt1-dev=1.1.28-2build1 libyaml-dev=0.1.4-3ubuntu3.1 make=3.81-8.2ubuntu3 nodejs=0.10.25~dfsg2-2ubuntu1 npm=1.3.10~dfsg-1 python3-dev=3.4.0-0ubuntu2 python3-pip=1.5.4-1ubuntu3 unzip=6.0-9ubuntu1.3 wget=1.15-1ubuntu1.14.04.1 zlib1g-dev=1:1.2.8.dfsg-1ubuntu1 autoconf=2.69-6 automake=1:1.14.1-2ubuntu1 bison=2:3.0.2.dfsg-2 gawk=1:4.0.1+dfsg-2.1ubuntu2 libffi-dev=3.1~rc1+r3.0.13-12 libgdbm-dev=1.8.3-12build1 libncurses5-dev=5.9+20140118-1ubuntu1 libsqlite3-dev=3.8.2-1ubuntu2.1 libtool=2.4.2-1.7ubuntu1 pkg-config=0.26-1ubuntu4 sqlite3=3.8.2-1ubuntu2.1 && apt-get clean && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100
We should move the scripts/
directory, and the associated unit tests, into the Pulse repo. They are not part of domain-scan
's scope.
As it currently stands, I don't think there is a way to use the gather tool from within the Docker images built from this repo? I think it's worth creating a dockerfile for building an image used to gather hostnames before scanning. I would love some feedback on this idea, and would be more than happy to help out if it is something that is wanted.
Right now, the following edits are manually made during the a11y scan process. We should go through and change the scripts and scans to address these so that I no longer need to manually make them:
Have gotten a handful of these notifications:
There seems to have been an issue with your Automated Build "18fgsa/domain-scan" (VCS repository: 18F/domain-scan) during the build step. You can find more information on
https://hub.docker.com/r/18fgsa/domain-scan/builds/bjafkmaqrftff5tgaytnwgb/
The README doesn't explain how to get the results from running docker-compose run scan
. Are they downloaded locally or do they have to be fetched from the docker image some how?
As in #36, update the scanner to take a URL as the primary input CSV, not just a local file.
It'd be nice to be able to easily scan domains for their SMTP configuration as well, especially given DROWN. I'll be working on this.
The lambda/remote_build.sh
script has the commands I use to build the domain-scan Lambda environment, but it's not repeatable, and rebuilds require me to copy/paste manual subsets of the instructions.
This is going to become more of a burden over time, as any updates to dependencies will require a rebuild to capture these changes (and pshtt
itself is likely to keep rapidly improving in very relevant ways), followed by a re-upload to Lambda.
{
"ignore": [
"notice",
"warning",
"WCAG2AA.Principle1.Guideline1_4.1_4_3.G18.BgImage",
"WCAG2AA.Principle1.Guideline1_4.1_4_3.G18.Abs",
"WCAG2AA.Principle1.Guideline1_4.1_4_3.G145.Abs",
"WCAG2AA.Principle3.Guideline3_1.3_1_1.H57.2",
"WCAG2AA.Principle3.Guideline3_1.3_1_1.H57.3",
"WCAG2AA.Principle3.Guideline3_1.3_1_2.H58.1",
"WCAG2AA.Principle4.Guideline4_1.4_1_1.F77"
]
}
In step 5 of the a11y scanning process, the a11y.py scan is not excluding individual errors that are listed in the ignore list. Right now, to get by, I go back in and hand remove them from the a11y.csv file that is generated after step 5, but this is laborious and error prone.
Notices and warnings are correctly excluded but not the individual errors. It's as if I hadn't included them there.
I suspect that this comes from them being improperly formatted or referenced, though I don't know how. Here's some documentation that I've found:
In step 6 of the a11y scanning process, the process_a11y.py script is not removing inactive and redirecting domains automatically. Right now, to get by, I go in and hand remove them from the domains.csv file as an extra part of step 3 of the a11y scanning process , but this is laborious and error prone.
I don't believe that this is broken functionality so much as functionality that has never existed. It'll be helpful to automate this. It doesn't actually matter at what step this occurs, so long as the domains from these other branches are not present in the final files generated in step 6.
Get the Chrome preload list from source control somehow, and measure whether the domain is actually in there. This can help eliminate the gap between what we think is preload-ready and what actually made it.
There are no more defunct
processes after #151, and our bulk scans go for much longer before they become an issue, but eventually they do just get stuck. Or as this says:
If a process is forked while an io lock is being held, the child process will deadlock on the next call to
flush
.
After looking at the stuck processes' trace with gdb
, I'm convinced I'm facing the same issue described here:
https://stackoverflow.com/questions/39884898/large-amount-of-multiprocessing-process-causing-deadlock
And that this bug, opened in 2009 and still quite actively discussed in October 2017, is the cause:
https://bugs.python.org/issue6721
The folks on that bug thread seem to be reaching process at a fix that is specific to logging calls and buffered IO, which I suspect would be enough to fix our case. There's also some related discussion on this bug, with Guido indicating he believes something should be done for the GC interrupt case.
There are a few ways to work around this I can think of:
SynchronousScanner
instead of the ConcurrentScanner
. However, this is both much slower and results in a distinct memory leak, as noted in #151.sslyze
that is the problem. Workers at the domain-scan
level are done as threads via a ThreadPoolExecutor
, whereas I believe it's SSLyze's ConcurrentScanner
that forks off processes. So I may have limited control here. The only reference to using stdout in SSLyze's core is this emergency shutdown message, so I am not sure where in SSLyze this might be happening.pip
supports github repo syntax).Right now, I'm leaning toward the 3rd option, the Python module. I'll try it out and see how it goes.
Looks like a few of them ran, but then I got an error.
[acus.gov]
[acus.gov]
Fetched, cached.
[achp.gov]
[achp.gov]
Fetched, cached.
[preserveamerica.gov]
[preserveamerica.gov]
Fetched, cached.
[adf.gov]
[adf.gov]
Fetched, cached.
[usadf.gov]
[usadf.gov]
Fetched, cached.
[abmc.gov]
[abmc.gov]
Fetched, cached.
[amtrakoig.gov]
[amtrakoig.gov]
Fetched, cached.
[arc.gov]
[arc.gov]
Fetched, cached.
[afrh.gov]
[afrh.gov]
Fetched, cached.
[cia.gov]
[cia.gov]
Fetched, cached.
[ic.gov]
[ic.gov]
/Library/Ruby/Gems/2.0.0/gems/site-inspector-1.0.0/lib/site-inspector/headers.rb:24:in strict_transport_security': undefined method
[]' for nil:NilClass (NoMethodError)
from /Library/Ruby/Gems/2.0.0/gems/site-inspector-1.0.0/lib/site-inspector/headers.rb:10:in strict_transport_security?' from ./https-scan.rb:137:in
domain_details'
from ./https-scan.rb:105:in check_domain' from ./https-scan.rb:51:in
block (2 levels) in go'
from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1716:in each' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1120:in
block in foreach'
from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1266:in open' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1119:in
foreach'
from ./https-scan.rb:31:in block in go' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1266:in
open'
from ./https-scan.rb:17:in go' from ./https-scan.rb:152:in
They currently use the inspect
scanner (to determine the "canonical" URL), let's have them use pshtt
instead.
Hey! Just checking out this repo and noticed that Dockerfile as-is doesn't build for me:
docker@boot2docker:~/domain-scan$ docker run 5740b16bdfb7
Traceback (most recent call last):
File "/tmp/scan", line 6, in <module>
from scanners import utils
File "/tmp/scanners/utils.py", line 10, in <module>
import strict_rfc3339
ImportError: No module named 'strict_rfc3339'
Running in boot2docker 1.6.2, and docker version
yields:
docker@boot2docker:~/domain-scan$ docker version
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 7c8fca2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2
OS/Arch (server): linux/amd64
It should be easy to compare the results of one output to another, using diff tools, without worrying about the order that all the asynchronous parallelized tasks happened to complete in.
The various output .csv's should be ordered alphabetically, by domain name, with the header intact at the top. Hopefully this can be done more-or-less in place in a quick, streaming way.
Just took a look at the Dockerfile for the first time, and was surprised to see how much is in there. I guess it's because the various tools being used all have different dependencies?
Having multiple languages in a single Dockerfile is an antipattern (IMHO), and I think the setup for each scanner could be a lot simpler if you isolated each tool to its own Dockerfile. These could then be run independently, or via a domain-scan
Dockerfile that calls out to docker run <scanner>
and then stitches the results together.
I got this idea from the architecture of the Code Climate CLI, so you could look there for inspiration if you're interested in pursuing this.
Ideas here:
In step 6 of the a11y scanning process, the process_a11y.py script is not removing domains from the legislative, judicial, and 'non-federal' branches automatically. Right now, to get by, I go in and hand remove them from the domains.csv file as an extra part of step 3 of the a11y scanning process , but this is laborious and error prone.
I don't believe that this is broken functionality so much as functionality that has never existed. It'll be helpful to automate this. It doesn't actually matter at what step this occurs, so long as the domains from these other branches are not present in the final files generated in step 6.
Here is the list of agencies that shouldn't be included in the a11y scan.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.