GithubHelp home page GithubHelp logo

lanzay / commonspeak2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from assetnote/commonspeak2

0.0 1.0 0.0 27.28 MB

Leverages publicly available datasets from Google BigQuery to generate content discovery and subdomain wordlists

License: Apache License 2.0

Go 92.87% TSQL 7.13%

commonspeak2's Introduction

Commonspeak2

CircleCI

Commonspeak2 leverages publicly available datasets from Google BigQuery to generate content discovery and subdomain wordlists.

As these datasets are updated on a regular basis, the wordlists generated via Commonspeak2 reflect the current technologies used on the web.

By using the Golang client for BigQuery, we can stream the data and process it very quickly. The future of this project will revolve around improving the quality of wordlists generated by creating automated filters and substitution functions.

Let's turn creating wordlists from a manual task, into a reproducible and reliable science with BigQuery.

I just want the wordlists...

We will update the commonspeak2-wordlists repo with any wordlists generated the Commonspeak2 tool.

More infrastructure will be developed to deliver wordlists continuously and this section will be updated in the future.

Instructions & Usage

If you're compiling or running Commonspeak2 from source:

If you're using the pre-built binaries:

  • Download the newest release here

Upon completing the above steps, Commonspeak2 can be used in the following ways:

Subdomains

Currently subdomains are extracted from HackerNews and HTTPArchive's latest scans. Unlike the previous revision of Commonspeak, the datasets and queries have been optimised to contain valid data that occurs often in the wild.

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json subdomains -o subdomains.txt

INFO[0000] Generated SQL template for HackerNews.        Mode=Subdomains
INFO[0000] Generated SQL template for HTTPArchive.       Mode=Subdomains
INFO[0000] Executing BigQuery SQL... this could take some time.  Mode=Subdomains Source=hackernews
INFO[0019] Total rows extracted 71415.                   Mode=Subdomains Silent=false Source=hackernews Verbose=false
INFO[0019] Executing BigQuery SQL... this could take some time.  Mode=Subdomains Source=httparchive
INFO[0075] Total rows extracted 484701.                  Mode=Subdomains Silent=false Source=httparchive Verbose=false

Words with extensions

Using a single query on GitHub's dataset, we can extract every path filtered by file extension. This can be done with:

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json ext-wordlist -e jsp -l 100000 -o jsp.txt

INFO[0000] Executing BigQuery SQL... this could take some time.  Extensions=jsp Limit=100000 Mode=WordsWithExt Source=Github
INFO[0013] Total rows extracted 100000.                  Mode=WordsWithExt Source=Github

Any set of extensions can be passed via the -e flag, i.e. -e aspx,php,html,js.

Deleted files

Using GitHub's commits dataset, we can extract what may be files that developers decided to delete from their public repositories. These files may contain sensitive data. This can be done with:

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json deleted-files -l 50000 -o deleted.txt

INFO[0000] Executing BigQuery SQL... this could take some time.  Limit=50000 Mode=DeletedFiles Source=Github
INFO[0013] Total rows extracted 50000.                  Mode=DeletedFiles Source=Github

Features in Active Development

Feel free to send pull requests to complete the features below, add datasets or improve the architecture of this project. Thank you!

Routes Based Extraction

We can create SQL statements that cover routing patterns in almost any web framework. For now we support the following web frameworks to extract path's from:

  • Rails [working implementation ✅]
  • NodeJS [to be implemented ❎]
  • Tomcat [to be implemented ❎]

This data can be extracted using the following command:

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json routes --frameworks rails -l 100000 -o rails-routes.txt

WARNING: running the above query will cost you lots of money (over $20 per framework). Commonspeak2 will prompt to confirm that this is OK. To skip this prompt use the --silent flag.

When this is ran on for Rails routes, Commonspeak2 does the following:

  1. Pulls Rails routes from config/routes.rb using Regex and the latest Github dataset.
  2. Processes the data, converts it into paths and does contexual replacements to make the path valid (i.e. converting /:id to /1234)
  3. Normalizes the path, finally saving to disk after all the processing is complete.

Scheduled Wordlist Generation

Planned feature to use a cron-like system to allow for wordlist generation from BigQuery to happen continuously.

When this command is introduced, we will insert the --schedule parameter to any of our pre-existing commands covered in this README like so:

⟩ ./commonspeak2 --project crunchbox-160315 --credentials credentials.json --schedule weekly routes --frameworks nodejs,tomcat -l 100000 -o nodejs-tomcat-routes.txt

The above query will run a weekly BigQuery and save the output to ./nodejs-tomcat-routes.txt.

Substitutions and Alterations

Generate smart substitutions and alterations for the datasets that it makes sense for. For example, converting string values from /admin/users/:id to /admin/users/1234 (contextually aware of the number).

Credits

Shubham Shah @infosec_au

Michael Gianarakis @mgianarakis

License

   Copyright 2018 Assetnote

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

Assetnote Pty. Ltd. - Twitter @assetnote

commonspeak2's People

Contributors

infosec-au avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.