GithubHelp home page GithubHelp logo

18f / domain-scan Goto Github PK

View Code? Open in Web Editor NEW
369.0 31.0 139.0 111.34 MB

A lightweight pipeline, locally or in Lambda, for scanning things like HTTPS, third party service use, and web accessibility.

License: Other

Python 88.90% Shell 2.81% JavaScript 6.41% Dockerfile 1.87%

domain-scan's Issues

Tests

I know we're all busy, but I figured it's at least worth starting the discussion around this repo's lack of tests. Right now, the only a Python linter (?) is run.

I'll take a first stab at what I think should be tested:

  • each scanner with unit tests
  • the ./scan command with unit tests (throw all different args at it, see if it still works)
  • an integration test for a big scan

sslyze appearing to stall out on some domains

I don't have enough information to report a bug yet. When I checked on our server, the sslyze scans had stalled out with 9 in-flight, with a bunch of defunct sslyze processes. These were the domains:

nces.ed.gov
autodiscover.ors.od.nih.gov
stg-reg2.hcia.cms.gov
vpn1.cjis.gov
portcullis.nlrb.gov
safesupportivelearning.ed.gov
my.uscis.gov
www.educationusa.state.gov
pittsburgh.feb.gov

But when I ran a scan using sslyze on all 9 of those in a row, using --serial, none of them stalled out. So I'm not totally sure how to reproduce this.

No module named 'gatherers.'

When i use

sudo ./gather censys,dap,
--suffix=.gov
--censys_id=id
--censys_key=key
--dap=https://analytics.usa.gov/data/live/sites-extended.csv
--parents=https://raw.githubusercontent.com/GSA/data/gh-pages/dotgov-domains/current-federal.csv

this error was detected

Done fetching from API.
Results written to CSV.
rootk@ubuntu:~/domain-scan-master$ ./st2.sh
Fetching up to 100 records, starting at page 1.
[1] Cached page.
[] Gatherer not found, or had an error during loading.
ERROR: <class 'ImportError'>

No module named 'gatherers.' 

What is the problem?

sslyze error where (apparently) no certificates are delivered

Observing this sslyze error during scans:

Traceback (most recent call last):

  File "/opt/scan/domain-scan/scan", line 120, in process_scan
    rows = list(scanner.scan(domain, options))

  File "/opt/scan/domain-scan/scanners/sslyze.py", line 75, in scan
    data = parse_sslyze(xml)

  File "/opt/scan/domain-scan/scanners/sslyze.py", line 205, in parse_sslyze
    issuer = certificates[-1].select_one("issuer commonName")

IndexError: list index out of range

These appear to happen after a long timeout, suggesting that there could be a connection/timeout error that results in no certificate data being available.

Change the way `try_command()` checks command

I'm trying to call pshtt as a docker container, which results in something like the following command being passed executed by domain-scan:

docker run --rm -e USER_ID=1042 -e GROUP_ID=1042 -v $(pwd):/data dockerpulse_c-pshtt rijksoverheid.nl

This fails when a scanner tries this command using try_command()[code], because it runs which on the whole command including parameters. This could be 'fixed' if only the actual executable of the command is passed to which, e.g.:

subprocess.check_call(["which", command.split(' ')[0]], ...)

I'm not sure if this has any implications, so I made an issue out of this instead of a pull request.

Tindel's data wrangling script

@jtexnl's script, saving here for posterity:

import csv
import re
import json

def readData(inputFile):
    outList = []
    with open(inputFile, 'rU') as infile:
        reader = csv.reader(infile)
        firstRow = True
        for row in reader:
            if firstRow == True:
                firstRow = False
                continue
            else:
                outList.append(row)
    return outList

def writeJson(inputData, fileName):
    with open(fileName, 'w+') as outfile:
        json.dump(inputData, outfile, indent = 4)

def makeAgencyOutput(inputList, errorDict, errorTypeDict):
    output = []
    for row in inputList:
        subSet = row[1]
        subDict = collections.OrderedDict({})
        subDict['Agency'] = row[0]
        subDict['Errors'] = errorDict[row[0]]
        for key, value in errorTypeDict.items():
            k = key
            try:
                subDict[k] = subSet[value]
            except KeyError:
                subDict[k] = 0
            except TypeError:
                subDict[k] = 0
        output.append(subDict)
    return output

def getKey(item): 
    return item[0]

def trimErrorField(errorField):
    pieces = re.split('.*(Guideline)', errorField)
    shortened = pieces[-1]
    pieces = shortened.split('.')
    num = pieces[0]
    return num

def categorize(dataset, referenceDict, colNum, altName):
    for row in dataset:
        if row[colNum] in referenceDict.keys():
            row.append(referenceDict[row[colNum]])
        else:
            row.append(altName)
    return dataset

def countDict(dataset, colIndex):
    output = {}
    for row in dataset:
        if row[colIndex] in output:
            output[row[colIndex]] += 1
        else:
            output[row[colIndex]] = 1
    return output

#Read in a11y.csv for errors and domains.csv for agencies
ally1 = readData('a11y.csv')
domains = readData('domains.csv')
#need to remove ussm.gov, whistleblower.gov, and safeocs.gov from ally due to discrepancies between the datasets. Solve at some point
ally = []
for row in ally1:
    if row[0] != 'safeocs.gov' and row[0] != 'whistleblower.gov' and row[0] != 'ussm.gov':
        ally.append(row)

#Truncate the a11y file so that it's a bit more manageable. Need the domain name [0] and the principle [4]
main = []
for row in ally:
    main.append([row[0], trimErrorField(row[4])])

#Add the information on the agency [1] and branch [2]
for error in main:
    for domain in domains:
        if error[0] == domain[0].lower():
            error.append(domain[1])
            error.append(domain[2])

#Dictionaries; branches = branch lookup, errorCats = error category lookup
branches = {"Library of Congress":"Legislative","The Legislative Branch (Congress)":"Legislative",
"Government Printing Office":"Legislative","Congressional Office of Compliance":"Legislative",
"The Judicial Branch (Courts)":"Judicial"}
errorCats = {'1_4':'Color Contrast Error', '1_1':'Alt Tag Error', '4_1':'HTML/Attribute Error', '1_3':'Form Error'}

#define branches for the 'main' and 'domains' sets, define error categories for 'main'
main = categorize(main, branches, -1, 'Executive')
domains = categorize(domains, branches, 2, 'Executive')
main = categorize(main, errorCats, 1, 'Other Error')

totalErrorsByDomain = countDict(main, 0)
totalErrorsByAgency = countDict(main, 3)

#createe dict of base vs. canonical domains
canonicals = {}
for row in ally:
    try:
        if row[0] in canonicals.keys():
            continue
        else:
            canonicals[row[0]] = row[1]
    except KeyError:
        continue


noErrors = []
errors = []
for domain in domains:
    if not domain[0].lower() in totalErrorsByDomain.keys():
        noErrors.append(domain)
    else:
        errors.append(domain)

for row in noErrors:
    row.append(0)
    row.append({})
    try:
        if row[0] in canonicals.keys():
            row.append('http://' + canonicals[row[0].lower()])
        else:
            row.append('http://' + row[0].lower())
    except TypeError:
        continue

for row in errors:
    row.append(totalErrorsByDomain[row[0].lower()])
    subset = []
    for line in main:
        if line[0] == row[0].lower():
            subset.append(line)
    errorDict = countDict(subset, -1)
    row.append(errorDict)
    try:
        if row[0] in canonicals.keys():
            row.append('http://' + canonicals[row[0].lower()])
        else:
            row.append('http://' + row[0].lower())
    except TypeError:
        continue

domains = errors + noErrors
domains = sorted(domains, key = getKey)

dictList = []
for row in domains:
    subDict = collections.OrderedDict({})
    subDict['agency'] = row[2]
    subDict['branch'] = row[5]
    subDict['canonical'] = row[8]
    subDict['domain'] = row[0].lower()
    subDict['errors'] = row[6]
    subDict['errorlist'] = row[7]
    dictList.append(subDict)

finalDict = {}
finalDict['data'] = dictList

writeJson(finalDict, 'domains.json')

agencyList = []
for row in main:
    if row[3] in agencyList:
        continue
    else:
        agencyList.append(row[3])

agencyErrorSets = []
for agency in agencyList:
    subList = []
    sub = {}
    for row in main:
        if row[3] == agency:
            if row[-1] in sub:
                sub[row[-1]] += 1
            else:
                sub[row[-1]] = 1
    subList.append(agency)
    subList.append(sub)
    agencyErrorSets.append(subList)

errorTypes = {'Color Contrast Errors':'Color Constrast Error', 'HTML/Attribute Errors':'HTML/Attribute Error', 
'Form Errors':'Form Error', 'Alt Tag Errors':'Alt Tag Error', 'Other Errors':'Other Error'}

output = makeAgencyOutput(agencyErrorSets, agencyErrorDict, errorTypes)
finalOutput = {}
finalOutput['data'] = output

writeJson(finalOutput, 'agencies.json')

Submit either/both HTTPS-enabled endpoints for a domain to ssllabs-scan

The SSL Labs API will stop automatically guessing whether or not the domain needs a www prefix there or not in the next version:

http://sourceforge.net/p/ssllabs/mailman/message/34661550/

We do make a best guess at the "canonical" form of a domain in the inspect step, using site-inspector, so we can use this to submit the right endpoint.

That said, since that canonical prefix detection is buggy (and arguably has been giving us incomplete data anyway) we may be better off submitting either or both of the root and www prefix, based on whether or not we detect HTTPS as available on that endpoint. That will leave less room for bugs and give us more data.

Nothing is return for a specific domain name

$ docker-compose run scan particulier.api.gouv.fr --scan=tls
Results written to CSV.

But the file results/tls.csv is empty

$ cat results/tls.csv 
Domain,Base Domain,Grade,Signature Algorithm,Key Type,Key Size,Forward Secrecy,OCSP Stapling,Fallback SCSV,RC4,SSLv3,TLSv1.2,SPDY,Requires SNI,HTTP/2

But i works well in the interface

This command docker-compose run scan geo.api.gouv.fr --scan=tls works.

Nothing is printed on stderr. Do you know where is the problem ?

ImportError: No module named 'requests'

thx to #132 !!

but when i worked this command

./gather censys 
--suffix=.gov 
--censys_id=id 
--censys_key=key 
--start=1 
--end=2 
--delay=5 
--debug

Traceback (most recent call last):
File "./gather", line 6, in
import requests

ImportError: No module named 'requests'

i am getting this error. What is the problem?

i was try to pip install requests

Pshtt version

Perhaps this is a Docker-ism that I'm not as familiar with, but is there a reason to pin pshtt to a specific version, rather than leaving the version off and getting the latest from PyPI ? Is this something to optimize the container image building?

RUN pip3 install pshtt==0.2.1

a11y scanner freezes on certain domains

The following domains break the a11y scan such that I have to stop it, remove the domain, and re-restart the scan all over again.

Two problems result:

  • We don't get any a11y scan results for these domains.
  • Having to restart the scan significantly adds to the time and effort that goes into it.
afadvantage.gov
ama.gov
banknet.gov
biomassboard.gov
broadband.gov
dea.gov
disasterhousing.gov
export.gov
flightschoolcandidates.gov
grantsolutions.gov
gsaadvantage.gov
gsaauctions.gov
hrsa.gov
hydrogen.gov
idmanagement.gov
invasivespecies.gov
myfdicinsurance.gov
nationalbank.gov
nationalbanknet.gov
nationalhousing.gov
nationalhousinglocator.gov
nhl.gov
nls.gov
onhir.gov
pay.gov
realestatesales.gov
safetyact.gov
sciencebase.gov
segurosocial.gov
selectusa.gov
stopfakes.gov
tvaoig.gov
usdebitcard.gov

Converts a11y scan result to a format pulse can use

These need major refactoring

formats csv into a json format

require 'bundler/setup'
require 'pry'
require 'csv'
require 'json'
require 'parallel'

def get_scan_error row_hash
  {
    "code" => row_hash["code"], 
    "typeCode" => row_hash["typeCode"],
    "message" => row_hash["message"],
    "context" => row_hash["context"],
    "selector" => row_hash["selector"],
    "type" => row_hash["typeCode"] == "1" ? "error" : "other"
  }
end

Dir.chdir(File.dirname(__FILE__))

csv_scan = File.read('../data/a11y-8-4-2016-no-2_csv.csv')
inspect_domains = File.read('../data/inspect-domains.csv')
domains = {}

# create domains hash with just domains from inspect file
CSV.parse(inspect_domains, headers: true) do |row|
  row_hash = row.to_hash
  if row_hash["Live"] != "False"
    domains[row_hash["Domain"]] = {
      "Domain Name" => row_hash["Domain"],
      "scan" => []
    }
  end
end

#go through get each error, add to scan output
CSV.parse(csv_scan, headers: true) do |row|
  row_hash  = row.to_hash
  if !domains[row_hash["Domain"]]
    domains[row_hash["Domain"]] = {
      "Domain Name" => row_hash["Domain"],
      "scan" => [get_scan_error(row_hash)]
    }
  else
    domains[row_hash["Domain"]]["scan"] << get_scan_error(row_hash)
  end
end

combined_domains = []
domains.each do |domain|
  combined_domains << domain[1]
end

File.open("../data/a11y-8-4-2016-no-2_csv.json","w") do |f|
  f.write(combined_domains.to_json)
end

takes that JSON and makes the 3 files needed for pulse

require 'bundler/setup'
require 'pry'
require 'csv'
require 'json'

def total_errors domain
  errors = domain["scan"].select{|row|
    row["type"] == "error"
  }
  errors.length
end

def get_branch domain, sample
  url = domain["Domain Name"].downcase
  puts "Get Branch url = #{url}"
  branch = ""
  sample["data"].each do |sample|
    if sample["domain"] == url
      branch = sample["branch"]
    end
  end
  puts "Branch = #{branch}"
  branch
end

def get_agency domain, sample
  url = domain["Domain Name"].downcase
  puts "Get Branch url = #{url}"
  agency = ""
  sample["data"].each do |sample|
    if sample["domain"] == url
      agency = sample["agency"]
    end
  end
  puts "Agency = #{agency}"
  agency
end

def get_error_cat_count domain
  errorlist = {
    "Alt Tag Errors" => 0,
    "Color Contrast Errors" => 0,
    "Form Errors" => 0,
    "HTML/Attribute Errors" => 0,
    "Other Errors" => 0
  }
  codes = {
    "1_4." => "Color Contrast Errors"
  }
  domain["scan"].each do |error|
    if error["code"].include? "1_4."
      errorlist["Color Contrast Errors"] = errorlist["Color Contrast Errors"] + 1
    elsif error["code"].include? "1_1."
      errorlist["Alt Tag Errors"] = errorlist["Alt Tag Errors"] + 1
    elsif error["code"].include? "4_1."
      errorlist["HTML/Attribute Errors"] = errorlist["HTML/Attribute Errors"] + 1
    elsif error["code"].include? "1_3."
      errorlist["Form Errors"] = errorlist["Form Errors"] + 1
    else
      errorlist["Other Errors"] = errorlist["Other Errors"] + 1
    end
  end
  errorlist
end

def get_cat_errors domain
  errorlist = {
    "Alt Tag Errors" => [],
    "Color Contrast Errors" => [],
    "Form Errors" => [],
    "HTML/Attribute Errors" => [],
    "Other Errors" => []
  }
  domain["scan"].each do |error|
    if error["code"].include? "1_4."
      errorlist["Color Contrast Errors"] << error
    elsif error["code"].include? "1_1."
      errorlist["Alt Tag Errors"] << error
    elsif error["code"].include? "4_1."
      errorlist["HTML/Attribute Errors"] << error
    elsif error["code"].include? "1_3."
      errorlist["Form Errors"] << error
    else
      errorlist["Other Errors"] << error
    end
  end
  errorlist
end



Dir.chdir(File.dirname(__FILE__))

scans = File.read('../data/a11y-8-4-2016-no-2_csv.json')

domains_sample = JSON.parse(File.read("../data/domains-sample.json"))
error_cats = JSON.parse(File.read('../config/error_cat.json'))

puts domains_sample["data"].length

scans = JSON.parse(scans)
all_errors_count = 0

domains = {}
domains["data"] = []

a11y = {}
a11y["data"] = {}
scans.each do |scan|
  puts scan["Domain Name"]
  puts "Total Errors = #{total_errors scan}"
  puts "Branch = #{get_branch(scan, domains_sample)}"
  all_errors_count += total_errors scan

  domains["data"] << {
    "agency": get_agency(scan, domains_sample),
     "branch": get_branch(scan, domains_sample),
     "canonical": "http://#{scan["Domain Name"].downcase}",
     "domain": scan["Domain Name"].downcase,
     "errors": total_errors(scan),
     "errorlist": get_error_cat_count(scan)
  }
  a11y["data"][scan["Domain Name"].downcase] = get_cat_errors(scan)
end

agencies = {}
agency_hash = {}

domains["data"].each do |domain|
  if agency_hash[domain["agency"]]
    agency = agency_hash[domain["agency"]]
    domain_error_list = domain["errorlist"]
    agency["Average Errors per Page"] += domain["errors"]
    agency["Alt Tag Errors"] += domain_error_list["Alt Tag Errors"]
    agency["HTML/Attribute Errors"] += domain_error_list["HTML/Attribute Errors"]
    agency["Form Errors"] += domain_error_list["Form Errors"]
    agency["Color Contrast Errors"] += domain_error_list["Color Contrast Errors"]
    agency["Other Errors"] += domain_error_list["Other Errors"]
  else
    agency_hash[domain[:agency]] = {}

    agency = agency_hash[domain[:agency]]
    domain_error_list = domain[:errorlist]
    # binding.pry
    agency["Agency"] = domain[:agency]
    agency["Average Errors per Page"] = domain[:errors]
    agency["Alt Tag Errors"] = domain_error_list["Alt Tag Errors"]
    agency["HTML/Attribute Errors"] = domain_error_list["HTML/Attribute Errors"]
    agency["Form Errors"] = domain_error_list["Form Errors"]
    agency["Color Contrast Errors"] = domain_error_list["Color Contrast Errors"]
    agency["Other Errors"] = domain_error_list["Other Errors"]
  end
  if domain["agency"] == "Department of Defense"
    puts agency_hash["Department of Defense"]
  end
end

agencies["data"] = agency_hash.map{|agency| 
  agency[1]
}



puts domains["data"].first

puts "Total Errors #{all_errors_count}"


File.open("../data/domains.json","w") do |f|
  f.write(domains.to_json)
end

File.open("../data/a11y.json","w") do |f|
  f.write(a11y.to_json)
end

File.open("../data/agencies.json","w") do |f|
  f.write(agencies.to_json)
end

Gracefully handle unauthenticated use of Censys.io Export API

Starting new HTTPS connection (1): www.censys.io
https://www.censys.io:443 "GET /api/v1/account HTTP/1.1" 200 243
Censys query:
SELECT parsed.subject.common_name, parsed.extensions.subject_alt_name.dns_names from FLATTEN([certificates.certificates], parsed.extensions.subject_alt_name.dns_names) where parsed.subject.common_name LIKE "%.gov" OR parsed.extensions.subject_alt_name.dns_names LIKE "%.gov";

Kicking off SQL query job.
https://www.censys.io:443 "POST /api/v1/export HTTP/1.1" 403 115
Traceback (most recent call last):

File "/home/user/domain-scan/gatherers/censys.py", line 194, in export_mode
job = export_api.new_job(query, format='csv', flatten=True)

File "/home/user/domain-scan/censys/export.py", line 25, in new_job
return self._post("export", data=data)

File "/home/user/domain-scan/censys/base.py", line 111, in _post
return self._make_call(self._session.post, endpoint, args, data)

File "/home/user/domain-scan/censys/base.py", line 105, in _make_call
const=const)

censys.base.CensysUnauthorizedException: 403 (unauthorized): Unauthorized. You do not have access to this service.

Censys error, aborting.
Downloading results of SQL query.
Traceback (most recent call last):
File "./gather", line 175, in
run(options)
File "./gather", line 73, in run
for domain in gatherer.gather(suffix, options, extra):
File "/home/user/domain-scan/gatherers/censys.py", line 66, in gather
hostnames_map = export_mode(suffix, options, uid, api_key)
File "/home/user/domain-scan/gatherers/censys.py", line 231, in export_mode
utils.download(results_url, download_file)
File "/home/user/domain-scan/scanners/utils.py", line 34, in download
filename, headers = urllib.request.urlretrieve(url, destination)
File "/usr/lib/python3.4/urllib/request.py", line 184, in urlretrieve
url_type, path = splittype(url)
File "/usr/lib/python3.4/urllib/parse.py", line 857, in splittype
match = _typeprog.match(url)
TypeError: expected string or buffer

how can i run this

Report on any errors found during the process

At the end of the scan, --debug or not, list any errors that gave an invalid: true response in their cached data. Possibly include it in the meta.json file for the scan, too. It should be easy to find these and re-scan them.

pshtt scan exception case drops record from report

When the version of pshtt that is loaded by domain-scan lands into an ConnectionError or a RequestException exception case in its basic_check method, it dumps an error and returns an empty string where a JSON object representing the domain's test results should be. This ripples through sslyze and errors out of the domain-scan pshtt invocation altogether (i.e., "Bad news scanning"), and never writes an entry to the results. Here's an example (domain redacted; hit me up for a live example):

› docker-compose run scan domain.tld --scan=pshtt --debug --force
[domain.tld][pshtt]
	 /opt/pyenv/versions/2.7.11/bin/pshtt domain.tld
Failed to connect.
Certificate did not match expected hostname: domain.tld. Certificate: {[certificate chain information]}
Traceback (most recent call last):
  File "/opt/pyenv/versions/2.7.11/bin/pshtt", line 9, in <module>
    load_entry_point('pshtt==0.1.5', 'console_scripts', 'pshtt')()
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/cli.py", line 54, in main
    results = pshtt.inspect_domains(domains, options)
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 882, in inspect_domains
    results.append(inspect(domain))
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 63, in inspect
    basic_check(domain.https)
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 173, in basic_check
    https_check(endpoint)
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/pshtt/pshtt.py", line 331, in https_check
    cert_plugin_result = cert_plugin.process_task(server_info, 'certinfo_basic')
  File "/opt/pyenv/versions/2.7.11/lib/python2.7/site-packages/sslyze/plugins/certificate_info_plugin.py", line 115, in process_task
    if scan_command.custom_ca_file:
AttributeError: 'str' object has no attribute 'custom_ca_file'
Error running eval "$(pyenv init -)" && pyenv shell 2.7.11 && /opt/pyenv/versions/2.7.11/bin/pshtt domain.tld --json --user-agent "github.com/18f/domain-scan, pshtt.py" --timeout 30 --preload-cache ./cache/preload-list.json.
	Bad news scanning, sorry!
Results written to CSV.

That last bit is a lie... Nothing is written because the raw data out of the pshtt invocation is None.

This ultimately is an issue with the way exception cases in pshtt are ordered in the overall flow of logic, and I'll open up an issue there and eventually offer a solution. The reason I'm bringing it up here, is because pshtt will produce a report with these exception cases properly reflected (i.e., as failing), but domain-scan just ends up dropping them from the report all together, which can be confusing as anything when your target list of 12K domains only results in a results.csv of 11,999 rows (gah!). So this is more just an "awareness" issue.

remove need for a11y.py to run alongside inspect.py

In the current workflow, in step 5, I run the following command: docker-compose run scan domains.csv --scan=inspect,a11y --debug. However, if I am using a domains.csv that has been derived straight from recent DAP results, there's no need to run the inspect command (which adds a decent bit of time to the scan). It would be faster if I could just run the a11y scan without the inspect scan: 'docker-compose run scan domains.csv --scan=a11y --debug`.

This does not work though, as it seems that the a11y scan depends on the inspect scan having already been run. You can see the error message below.

It would be handy to be able to run the a11y scan without it needing the inspect scan cache results.


[youthrules.gov][a11y]
Traceback (most recent call last):

  File "./scan", line 120, in process_scan
    rows = list(scanner.scan(domain, options))

  File "/home/scanner/scanners/a11y.py", line 197, in scan
    inspect_data = get_from_inspect_cache(domain)

  File "/home/scanner/scanners/a11y.py", line 23, in get_from_inspect_cache
    inspect_raw = open(inspect_cache).read()

FileNotFoundError: [Errno 2] No such file or directory: './cache/inspect/youthrules.gov.json'

A recursive web crawler to gather domains

Note: this is a potentially big task, that should be broken into smaller tasks/stages. But also, there is value to starting with a naive, simple crawler and leveling it up in stages.

Either baked into domain-scan, or finding/making a separate tool that does this. We could also potentially use Common Crawl data.

But the basic need is to gather domains through web crawling, as this is a fertile source for hostnames that do not appear in Censys.io. For .gov, both Censys and the LOC's web crawl (the End of Term Archive) each had ~50% of unique domains not found through any other public method. The LOC crawl data, performed in late 2016, is getting more stale by the month, and also won't be helpful for non-USG sources.

process_a11y.py script is not factoring in domains with no errors but it should

In step 6 of the a11y scanning process, the process_a11y.py script is not factoring in domains that have no errors since it is just building from the results of the a11y.csv file generated in step 5, however it needs to.

Imagine an executive branch agency with three active, non-redirecting domains. After step 5 completes, there are only error results for two domains (either because the third domain did not scan successfully or because no errors were detected). The problem is that step 6 computes based on the a11y.csv file of individual error results and does not factor in the total domain set that it should be considering.

Expand the `inspect.py` script

Expand that script, which runs first in the a11y scan process to have the resulting inspect.csv include columns for Agency and Branch.

  • Agency could carry over from the domains.csv file that inspect.py is running on.
  • Branch - I would think that this could be done using the same method that is applied in this later script.

We would need to ensure that these changes do not adversely impact the workflows for the HTTPS or DAP sections. cc @konklone

The benefits to making these changes is that it would make resolving #101 and, to a degree, #102 easier to resolve.

docker-compose up fails

$ docker --version
Docker version 1.9.1, build a34a1d5
$ docker-compose up                                    
Building scan
Step 1 : FROM ubuntu:14.04.3
 ---> 6cc0fc2a5ee3
Step 2 : MAINTAINER V. David Zvenyach <[email protected]>
 ---> Using cache
 ---> 9c7124f58945
Step 3 : RUN apt-get update         -qq     && apt-get install         -qq         --yes         --no-install-recommends         --no-install-suggests       build-essential=11.6ubuntu6       curl=7.35.0-1ubuntu2.5       git=1:1.9.1-1ubuntu0.1       libc6-dev=2.19-0ubuntu6.6       libfontconfig1=2.11.0-0ubuntu4.1       libreadline-dev=6.3-4ubuntu2       libssl-dev=1.0.1f-1ubuntu2.15       libssl-doc=1.0.1f-1ubuntu2.15       libxml2-dev=2.9.1+dfsg1-3ubuntu4.4       libxslt1-dev=1.1.28-2build1       libyaml-dev=0.1.4-3ubuntu3.1       make=3.81-8.2ubuntu3       nodejs=0.10.25~dfsg2-2ubuntu1       npm=1.3.10~dfsg-1       python3-dev=3.4.0-0ubuntu2       python3-pip=1.5.4-1ubuntu3       unzip=6.0-9ubuntu1.3       wget=1.15-1ubuntu1.14.04.1       zlib1g-dev=1:1.2.8.dfsg-1ubuntu1       autoconf=2.69-6       automake=1:1.14.1-2ubuntu1       bison=2:3.0.2.dfsg-2       gawk=1:4.0.1+dfsg-2.1ubuntu2       libffi-dev=3.1~rc1+r3.0.13-12       libgdbm-dev=1.8.3-12build1       libncurses5-dev=5.9+20140118-1ubuntu1       libsqlite3-dev=3.8.2-1ubuntu2.1       libtool=2.4.2-1.7ubuntu1       pkg-config=0.26-1ubuntu4       sqlite3=3.8.2-1ubuntu2.1     && apt-get clean     && rm -rf /var/lib/apt/lists/*
 ---> Running in eb3383000cf7
E: Version '7.35.0-1ubuntu2.5' for 'curl' was not found
E: Version '1:1.9.1-1ubuntu0.1' for 'git' was not found
E: Version '1.0.1f-1ubuntu2.15' for 'libssl-dev' was not found
E: Version '1.0.1f-1ubuntu2.15' for 'libssl-doc' was not found
E: Version '2.9.1+dfsg1-3ubuntu4.4' for 'libxml2-dev' was not found
E: Version '6.0-9ubuntu1.3' for 'unzip' was not found
ERROR: Service 'scan' failed to build: The command '/bin/sh -c apt-get update         -qq     && apt-get install         -qq         --yes         --no-install-recommends         --no-install-suggests       build-essential=11.6ubuntu6       curl=7.35.0-1ubuntu2.5       git=1:1.9.1-1ubuntu0.1       libc6-dev=2.19-0ubuntu6.6       libfontconfig1=2.11.0-0ubuntu4.1       libreadline-dev=6.3-4ubuntu2       libssl-dev=1.0.1f-1ubuntu2.15       libssl-doc=1.0.1f-1ubuntu2.15       libxml2-dev=2.9.1+dfsg1-3ubuntu4.4       libxslt1-dev=1.1.28-2build1       libyaml-dev=0.1.4-3ubuntu3.1       make=3.81-8.2ubuntu3       nodejs=0.10.25~dfsg2-2ubuntu1       npm=1.3.10~dfsg-1       python3-dev=3.4.0-0ubuntu2       python3-pip=1.5.4-1ubuntu3       unzip=6.0-9ubuntu1.3       wget=1.15-1ubuntu1.14.04.1       zlib1g-dev=1:1.2.8.dfsg-1ubuntu1       autoconf=2.69-6       automake=1:1.14.1-2ubuntu1       bison=2:3.0.2.dfsg-2       gawk=1:4.0.1+dfsg-2.1ubuntu2       libffi-dev=3.1~rc1+r3.0.13-12       libgdbm-dev=1.8.3-12build1       libncurses5-dev=5.9+20140118-1ubuntu1       libsqlite3-dev=3.8.2-1ubuntu2.1       libtool=2.4.2-1.7ubuntu1       pkg-config=0.26-1ubuntu4       sqlite3=3.8.2-1ubuntu2.1     && apt-get clean     && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100

Gathering hostnames via Docker

As it currently stands, I don't think there is a way to use the gather tool from within the Docker images built from this repo? I think it's worth creating a dockerfile for building an image used to gather hostnames before scanning. I would love some feedback on this idea, and would be more than happy to help out if it is something that is wanted.

Finish integrating semantic changes to a11y scans

Right now, the following edits are manually made during the a11y scan process. We should go through and change the scripts and scans to address these so that I no longer need to manually make them:

  • Alt Text => Missing Image Description - issue
  • add http:// to canonical domains - issue
  • "errors" - "initial findings" - issue
  • agency -> Agency - issue

Build scripts for remote compilation of dependencies for domain-scan.zip

The lambda/remote_build.sh script has the commands I use to build the domain-scan Lambda environment, but it's not repeatable, and rebuilds require me to copy/paste manual subsets of the instructions.

This is going to become more of a burden over time, as any updates to dependencies will require a rebuild to capture these changes (and pshtt itself is likely to keep rapidly improving in very relevant ways), followed by a re-upload to Lambda.

For a11y, include the following options as default

{
  "ignore": [
    "notice",
    "warning",
    "WCAG2AA.Principle1.Guideline1_4.1_4_3.G18.BgImage",
    "WCAG2AA.Principle1.Guideline1_4.1_4_3.G18.Abs",
    "WCAG2AA.Principle1.Guideline1_4.1_4_3.G145.Abs",
    "WCAG2AA.Principle3.Guideline3_1.3_1_1.H57.2",
    "WCAG2AA.Principle3.Guideline3_1.3_1_1.H57.3",
    "WCAG2AA.Principle3.Guideline3_1.3_1_2.H58.1",
    "WCAG2AA.Principle4.Guideline4_1.4_1_1.F77"
  ]
}

a11y.py scan is not ignoring individual errors

In step 5 of the a11y scanning process, the a11y.py scan is not excluding individual errors that are listed in the ignore list. Right now, to get by, I go back in and hand remove them from the a11y.csv file that is generated after step 5, but this is laborious and error prone.

Notices and warnings are correctly excluded but not the individual errors. It's as if I hadn't included them there.

I suspect that this comes from them being improperly formatted or referenced, though I don't know how. Here's some documentation that I've found:

process_a11y.py script is not removing inactive, redirecting domains automatically

In step 6 of the a11y scanning process, the process_a11y.py script is not removing inactive and redirecting domains automatically. Right now, to get by, I go in and hand remove them from the domains.csv file as an extra part of step 3 of the a11y scanning process , but this is laborious and error prone.

I don't believe that this is broken functionality so much as functionality that has never existed. It'll be helpful to automate this. It doesn't actually matter at what step this occurs, so long as the domains from these other branches are not present in the final files generated in step 6.

sslyze calls deadlocking due to combination of threads/processes/logging

There are no more defunct processes after #151, and our bulk scans go for much longer before they become an issue, but eventually they do just get stuck. Or as this says:

If a process is forked while an io lock is being held, the child process will deadlock on the next call to flush.

After looking at the stuck processes' trace with gdb, I'm convinced I'm facing the same issue described here:
https://stackoverflow.com/questions/39884898/large-amount-of-multiprocessing-process-causing-deadlock

And that this bug, opened in 2009 and still quite actively discussed in October 2017, is the cause:
https://bugs.python.org/issue6721

The folks on that bug thread seem to be reaching process at a fix that is specific to logging calls and buffered IO, which I suspect would be enough to fix our case. There's also some related discussion on this bug, with Guido indicating he believes something should be done for the GC interrupt case.

There are a few ways to work around this I can think of:

  • Use sslyze's SynchronousScanner instead of the ConcurrentScanner. However, this is both much slower and results in a distinct memory leak, as noted in #151.
  • Don't ever write to stdout from the child processes. However, this exposes some of my not-totally-complete understanding here -- I am not sure whether it's my own logging calls or something inside sslyze that is the problem. Workers at the domain-scan level are done as threads via a ThreadPoolExecutor, whereas I believe it's SSLyze's ConcurrentScanner that forks off processes. So I may have limited control here. The only reference to using stdout in SSLyze's core is this emergency shutdown message, so I am not sure where in SSLyze this might be happening.
  • This Python module that lets you register an after-fork hook that clears up any held locks the child copied over. It's not on PyPi and would need to be installed from the repo (pip supports github repo syntax).

Right now, I'm leaning toward the 3rd option, the Python module. I'll try it out and see how it goes.

undefined method `[]' for nil:NilClass (NoMethodError)

Looks like a few of them ran, but then I got an error.

[acus.gov]
[acus.gov]
Fetched, cached.
[achp.gov]
[achp.gov]
Fetched, cached.
[preserveamerica.gov]
[preserveamerica.gov]
Fetched, cached.
[adf.gov]
[adf.gov]
Fetched, cached.
[usadf.gov]
[usadf.gov]
Fetched, cached.
[abmc.gov]
[abmc.gov]
Fetched, cached.
[amtrakoig.gov]
[amtrakoig.gov]
Fetched, cached.
[arc.gov]
[arc.gov]
Fetched, cached.
[afrh.gov]
[afrh.gov]
Fetched, cached.
[cia.gov]
[cia.gov]
Fetched, cached.
[ic.gov]
[ic.gov]
/Library/Ruby/Gems/2.0.0/gems/site-inspector-1.0.0/lib/site-inspector/headers.rb:24:in strict_transport_security': undefined method[]' for nil:NilClass (NoMethodError)
from /Library/Ruby/Gems/2.0.0/gems/site-inspector-1.0.0/lib/site-inspector/headers.rb:10:in strict_transport_security?' from ./https-scan.rb:137:indomain_details'
from ./https-scan.rb:105:in check_domain' from ./https-scan.rb:51:inblock (2 levels) in go'
from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1716:in each' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1120:inblock in foreach'
from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1266:in open' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1119:inforeach'
from ./https-scan.rb:31:in block in go' from /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/csv.rb:1266:inopen'
from ./https-scan.rb:17:in go' from ./https-scan.rb:152:in

'

Dockerfile needs an update?

Hey! Just checking out this repo and noticed that Dockerfile as-is doesn't build for me:

docker@boot2docker:~/domain-scan$ docker run 5740b16bdfb7
Traceback (most recent call last):
  File "/tmp/scan", line 6, in <module>
    from scanners import utils
  File "/tmp/scanners/utils.py", line 10, in <module>
    import strict_rfc3339
ImportError: No module named 'strict_rfc3339'

Running in boot2docker 1.6.2, and docker version yields:

docker@boot2docker:~/domain-scan$ docker version
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 7c8fca2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2
OS/Arch (server): linux/amd64

Sort output results alphabetically by domain

It should be easy to compare the results of one output to another, using diff tools, without worrying about the order that all the asynchronous parallelized tasks happened to complete in.

The various output .csv's should be ordered alphabetically, by domain name, with the header intact at the top. Hopefully this can be done more-or-less in place in a quick, streaming way.

Suggestion: Split up the Dockerfile

Just took a look at the Dockerfile for the first time, and was surprised to see how much is in there. I guess it's because the various tools being used all have different dependencies?

Having multiple languages in a single Dockerfile is an antipattern (IMHO), and I think the setup for each scanner could be a lot simpler if you isolated each tool to its own Dockerfile. These could then be run independently, or via a domain-scan Dockerfile that calls out to docker run <scanner> and then stitches the results together.

I got this idea from the architecture of the Code Climate CLI, so you could look there for inspiration if you're interested in pursuing this.

process_a11y.py script is not removing other branches automatically

In step 6 of the a11y scanning process, the process_a11y.py script is not removing domains from the legislative, judicial, and 'non-federal' branches automatically. Right now, to get by, I go in and hand remove them from the domains.csv file as an extra part of step 3 of the a11y scanning process , but this is laborious and error prone.

I don't believe that this is broken functionality so much as functionality that has never existed. It'll be helpful to automate this. It doesn't actually matter at what step this occurs, so long as the domains from these other branches are not present in the final files generated in step 6.

Here is the list of agencies that shouldn't be included in the a11y scan.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.