GithubHelp home page GithubHelp logo

cli-published / gcamcscrapy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vcgato29/gcamcscrapy

0.0 1.0 0.0 10.2 MB

Secure your website with this fast scraper that aims to copy an entire website as accurately as possible into a static site

License: Apache License 2.0

Makefile 6.59% Go 93.41%

gcamcscrapy's Introduction

GCA McScrapy

Genesis

To maintain and enhance GCA’s secure operating environment, a secure Website nearly immune to compromise was established in September 2016 and the DMARC micro-site was established in October 2016. Both Websites were created using WordPress in order to allow content creators the ability to modify text on the sites with ease.

While WordPress is a popular blogging platform, by its nature, it is prone to potential compromise. WordPress dynamically composes web pages using PHP and JavaScript and thus carries with it a high risk for bugs and security vulnerabilities that serve as a vector for compromise. Because the ability to create and update content on the sites by multiple parties is a necessity, the decision was made to secure the sites by scraping all of the dynamic content into static sites.

How McScrapy Works

The foundation is a scraping tool that attempts to scrape every piece of a website to be as functional as possible as a static clone, removing potential security issues of third-party services or unnecessary requests. HTML pages are scanned for URLs embedded in element attributes thoroughly including but not limited to: element href attributes, img element src attributes, contents of style tags, and inline styles. If any CSS files are found, their contents are also scanned for potential resources. As they’re found, resources (JS, CSS, images, PDFs, etc.) are downloaded from those URLs and saved relative to the path portion of the URL mimicking the original structure of the website being scraped.

McScrapy includes the ability to debug a scrape, cache files as they’re saved, ignore a website’s robots.txt restrictions, specify a maximum recursive depth to scan HTML pages, and scrape using a specific user agent. These features can be used in any combination, for example, to reduce scan times, acquire more resources, and scrape mobile sites.

A preview function is available to test the functionality of the scraped website to check for completion and general asset availability once the scrape has finished. Using a generic file server, the preview function is able to host a now static clone of the original website including dynamic routing of HTML pages.

Build

To build the application simply run make in the root directory. Alternatively run:

go build -o bin/mcscrapy github.com/GlobalCyberAlliance.org/McScrapy/cmd/mcscrapy

Scrape

To scrape a website run mcscrapy scrape [domain]

Flags

-c --cache Specify where requests are caches as files.

-d --debug Output of debug logs.

-i --ignore-robots Ignore restrictions set by a host's robots.txt file.

-m --max-depth Set the max depth of recursion of visited URLs. Leave blank to allow all.

-o --output-dir Output scraped websites to a specific directory.

-u --user-agent Set the user agent used by the scraper.

-v --verbose Verbose output of logs.

Preview

Preview a scraped website with a built in web server. Paths ending in / or without a file extension default to serving the path as an HTML file.

mcscrapy preview [path_to_site_directory]

Flags

-a --address Set the address of the preview. Default: 127.0.0.1

-p --port Set the port of the preview. Default: 8000

License

This repository is licensed under the Apache License version 2.0.

Some of the project's dependencies may be under different licenses.

gcamcscrapy's People

Contributors

ziethan avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.