GithubHelp home page GithubHelp logo

scalastic / aws-documentation-scraper Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 1.0 336 KB

Scraper of AWS documentation website which runs through 2 levels and exports documentation into a JSON file. Distinct IDs are also generated on array data to deal with some HTML UI components that would parse the generated JSON data.

License: MIT License

Scala 100.00%
aws documentation scraper json upickle html2json

aws-documentation-scraper's Introduction

AWS Documentation Scraper

GitHub GitHub contributors code size maintain status

Are you struggling to navigate the vast and intricate world of AWS documentation? AWS provides a wealth of services with esoteric names that can be daunting for newcomers. Fortunately, AWS's documentation website is XML encoded, which means we can extract data from each page and create our own comprehensive documentation in JSON format. This allows for easy publication on your website or application.

Requirements

Before you begin, ensure you have met the following prerequisites:

Usage

  1. Launch the application by running the following command:
sbt run
  1. It will generate 3 files into ./data folder:
    • root-documentation.ser which is the serialized data from the root documentation page of AWS website (actually https://docs.aws.amazon.com),

    • full-documentation.ser that contains all the serialized AWS documentation from the root page and all its associated pages,

    • full-documentation.json which is the resulting file containing all the AWS documentation in JSON format.

Note:

The two serialized files are used for caching data, so don't forget to remove them if you need fresh and up-to-date data from AWS.

Expected result

You will obtain a JSON file with a structure similar to the following:

{
  "title":"AWS Documentation",
  "subtitle":"Guides and API References",
  "abstract":"Find user guides, developer guides, API references, tutorials, and more.",
  "panels":[
    {
      "services":{
        "service":[
          {
            "prefix":"Amazon",
            "name":"EC2",
            "href":{
              "title":"Amazon Elastic Compute Cloud Documentation",
              "short-title":"Amazon EC2",
              "abstract":"Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable computing capacity—literally, servers in Amazon's data centers—that you use to build and host your software systems.",
              "sections":[
                {
                  "tiles":{
                    "tile":[
                      {
                        "href":"https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/",
                        "abstract":"Use Amazon EC2 to configure, launch, and manage virtual servers in the AWS cloud.",
                        "more-links":"",
                        "title":"User Guide for Linux Instances",
                        "id":"main-panels0-services-service0-href-sections0-tiles-tile0",
                        "locale":"en_us",
                        "pdf":"https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-ug.pdf",
                        "kindle":"https://www.amazon.com/dp/B076452RSZ",
                        "github":"https://github.com/awsdocs/amazon-ec2-user-guide/tree/master/doc_source"
.../...
  • Unique IDs with id keys are generated on array items. These will be useful when parsing data and integrating with web UI components.

Example of rendering

You can render this JSON data on a Jekyll website using the client-side json2html library, along with jQuery and Bootstrap for querying and displaying the JSON as HTML components.

Here are the necessary imports for rendering:

<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-+0n0xVW2eSR5OomGNYDnhzAbDsOXxcvSN1TPprVMTNDbiYZCxYbOOl7+AMvyTG2x" crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.6.0.min.js" integrity="sha256-/xUj+3OJU5yExlq6GSYGSHk7tPXikynS7ogEvDej/m4=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/json2html/2.1.0/json2html.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js" integrity="sha384-gtEjrD/SeCtmISkJkNUaaKMoLD0//ElJ19smozuHV6z3Iehds+3Ulb9Bn9Plx0x4" crossorigin="anonymous"></script>

Additionally, here is the dedicated json2html rendering script:

  $.getJSON( "{{ site.url }}{{ site.baseurl }}/assets/full-documentation.json", function( data ) {
    
    json2html.component.add('main-header',
    {'<>':'section','html':[
        {'<>':'h1','text':'${title}','class':'text-muted'},
        {'<>':'p','text':'${abstract}','class':'lead'},
        {'[]':'panels','obj':function(){return(this.panels)}}
    ]});
    
    json2html.component.add('panels',
    {'<>':'section','html':[
      {'<>':'header','html':[
        {'<>':'div','class':'accordion accordion-flush','id':'accordion-panels'},
        {'<>':'div','class':'accordion-item','html':[
          {'<>':'h2','class':'accordion-header','id':'heading-${id}','html':[
            {'<>':'button','text':'${title}','class':'accordion-button collapsed','type':'button','data-bs-toggle':'collapse','data-bs-target':'#collapse-${id}','aria-expande':'false','aria-controls':'collapse-${id}'}
          ]},
          {'<>':'div','id':'collapse-${id}','class':'accordion-collapse collapse','aria-labelledby':'heading-${id}','data-bs-parent':'#accordion-panels','html':[
            {'<>':'div','class':'accordion-body'},
              {'<>':'div','class':'row g-2','html':[
                {'[]':'service','obj':function(){return(this.services.service)}}
              ]}
          ]}
        ]}
      ]}
    ]});

    json2html.component.add('service',
      {'<>':'div','class':'col-6','html':[
        {'<>':'div','class':'p-3 border bg-light','html':[
          {'<>':'section','html':[
          {'<>':'header','html':[
            {'<>':'div','class':'accordion accordion-flush','id':'accordion-service'},
            {'<>':'div','class':'accordion-item','html':[
              {'<>':'h3','class':'accordion-header','id':'heading-${id}','html':[
                {'<>':'button','class':'accordion-button collapsed','type':'button','data-bs-toggle':'collapse','data-bs-target':'#collapse-${id}','aria-expande':'false','aria-controls':'collapse-${id}','html':[
                    {'<>':'span','class':'text-muted','html':'${prefix}'},
                    {'html':'&nbsp;${name}'}
                  ]},
              ]},
              {'<>':'div','id':'collapse-${id}','class':'accordion-collapse collapse','aria-labelledby':'heading-${id}','data-bs-parent':'#accordion-service','html':[
                {'<>':'div','class':'accordion-body','html':[
                  {'[]':'service-href','obj':function(){return(this.href)}}
                ]}
              ]}
            ]}
          ]}
        ]}
      ]}
    ]});

    json2html.component.add('service-href',
      {'<>':'div','class':'card','html':[
        {'<>':'div','class':'card-header','text':'${abstract}'},
        {'[]':'sections','obj':function(){return(this.sections)}}
      ]}
    );

    json2html.component.add('sections',
      {'html':[
        {'<>':'div','class':'card-body','html':[
          {'<>':'h4','class':'card-title','html':'${title}'}
        ]},
        {'[]':'tiles','obj':function(){return(this.tiles)}}
      ]}
    );

    json2html.component.add('tiles',
      {'[]':'tile','obj':function(){return(this.tile)}}
    );

    json2html.component.add('tile',
      {'<>':'div','class':'card text-dark bg-light mb-3','html':[
        {'<>':'h5','class':'card-header','text':'${title}'},
        {'<>':'div','class':'card-body','html':[
          {'<>':'p','class':'card-text','html':'${abstract}'},
          {'<>':'div','class':'d-flex flex-row mb-3 justify-content-evenly','html':[
            {'[]':'amazon','obj':function(){return(this)}},
            {'[]':'pdf','obj':function(){return(this)}},
            {'[]':'github','obj':function(){return(this)}}
          ]}
        ]}
      ]}
    );

    json2html.component.add('amazon',
      {'<>':'div','class':function(){if(!!this.href) return("p-2"); else return("visually-hidden");},'html':[
        {'<>':'a','rel':'noopener noreferrer nofollow','href':'${href}','data-bs-toggle':'tooltip','data-bs-placement':'top','title':'More on AWS website','html':[
          {'<>':'span','html':[
            {'<>':'svg','width':'22.5','height':'18','xmlns':'http://www.w3.org/2000/svg','viewBox':'0 0 640 512','html':[
              {'<>':'path', 'd':'M180.41 203.01c-.72 22.65 10.6 32.68 10.88 39.05a8.164 8.164 0 0 1-4.1 6.27l-12.8 8.96a10.66 10.66 0 0 1-5.63 1.92c-.43-.02-8.19 1.83-20.48-25.61a78.608 78.608 0 0 1-62.61 29.45c-16.28.89-60.4-9.24-58.13-56.21-1.59-38.28 34.06-62.06 70.93-60.05 7.1.02 21.6.37 46.99 6.27v-15.62c2.69-26.46-14.7-46.99-44.81-43.91-2.4.01-19.4-.5-45.84 10.11-7.36 3.38-8.3 2.82-10.75 2.82-7.41 0-4.36-21.48-2.94-24.2 5.21-6.4 35.86-18.35 65.94-18.18a76.857 76.857 0 0 1 55.69 17.28 70.285 70.285 0 0 1 17.67 52.36l-.01 69.29zM93.99 235.4c32.43-.47 46.16-19.97 49.29-30.47 2.46-10.05 2.05-16.41 2.05-27.4-9.67-2.32-23.59-4.85-39.56-4.87-15.15-1.14-42.82 5.63-41.74 32.26-1.24 16.79 11.12 31.4 29.96 30.48zm170.92 23.05c-7.86.72-11.52-4.86-12.68-10.37l-49.8-164.65c-.97-2.78-1.61-5.65-1.92-8.58a4.61 4.61 0 0 1 3.86-5.25c.24-.04-2.13 0 22.25 0 8.78-.88 11.64 6.03 12.55 10.37l35.72 140.83 33.16-140.83c.53-3.22 2.94-11.07 12.8-10.24h17.16c2.17-.18 11.11-.5 12.68 10.37l33.42 142.63L420.98 80.1c.48-2.18 2.72-11.37 12.68-10.37h19.72c.85-.13 6.15-.81 5.25 8.58-.43 1.85 3.41-10.66-52.75 169.9-1.15 5.51-4.82 11.09-12.68 10.37h-18.69c-10.94 1.15-12.51-9.66-12.68-10.75L328.67 110.7l-32.78 136.99c-.16 1.09-1.73 11.9-12.68 10.75h-18.3zm273.48 5.63c-5.88.01-33.92-.3-57.36-12.29a12.802 12.802 0 0 1-7.81-11.91v-10.75c0-8.45 6.2-6.9 8.83-5.89 10.04 4.06 16.48 7.14 28.81 9.6 36.65 7.53 52.77-2.3 56.72-4.48 13.15-7.81 14.19-25.68 5.25-34.95-10.48-8.79-15.48-9.12-53.13-21-4.64-1.29-43.7-13.61-43.79-52.36-.61-28.24 25.05-56.18 69.52-55.95 12.67-.01 46.43 4.13 55.57 15.62 1.35 2.09 2.02 4.55 1.92 7.04v10.11c0 4.44-1.62 6.66-4.87 6.66-7.71-.86-21.39-11.17-49.16-10.75-6.89-.36-39.89.91-38.41 24.97-.43 18.96 26.61 26.07 29.7 26.89 36.46 10.97 48.65 12.79 63.12 29.58 17.14 22.25 7.9 48.3 4.35 55.44-19.08 37.49-68.42 34.44-69.26 34.42zm40.2 104.86c-70.03 51.72-171.69 79.25-258.49 79.25A469.127 469.127 0 0 1 2.83 327.46c-6.53-5.89-.77-13.96 7.17-9.47a637.37 637.37 0 0 0 316.88 84.12 630.22 630.22 0 0 0 241.59-49.55c11.78-5 21.77 7.8 10.12 16.38zm29.19-33.29c-8.96-11.52-59.28-5.38-81.81-2.69-6.79.77-7.94-5.12-1.79-9.47 40.07-28.17 105.88-20.1 113.44-10.63 7.55 9.47-2.05 75.41-39.56 106.91-5.76 4.87-11.27 2.3-8.71-4.1 8.44-21.25 27.39-68.49 18.43-80.02z'}
            ]}
          ]}
        ]}
      ]});

    json2html.component.add('pdf',
      {'<>':'div','class':function(){if(!!this.pdf) return("p-2"); else return("visually-hidden");},'html':[
        {'<>':'a','rel':'noopener noreferrer nofollow','href':'${pdf}','data-bs-toggle':'tooltip','data-bs-placement':'top','title':'Download PDF','html':[
          {'<>':'span','html':[
            {'<>':'svg','width':'13.5','height':'18','xmlns':'http://www.w3.org/2000/svg','viewBox':'0 0 384 512','html':[
              {'<>':'path','d':'M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48zm250.2-143.7c-12.2-12-47-8.7-64.4-6.5-17.2-10.5-28.7-25-36.8-46.3 3.9-16.1 10.1-40.6 5.4-56-4.2-26.2-37.8-23.6-42.6-5.9-4.4 16.1-.4 38.5 7 67.1-10 23.9-24.9 56-35.4 74.4-20 10.3-47 26.2-51 46.2-3.3 15.8 26 55.2 76.1-31.2 22.4-7.4 46.8-16.5 68.4-20.1 18.9 10.2 41 17 55.8 17 25.5 0 28-28.2 17.5-38.7zm-198.1 77.8c5.1-13.7 24.5-29.5 30.4-35-19 30.3-30.4 35.7-30.4 35zm81.6-190.6c7.4 0 6.7 32.1 1.8 40.8-4.4-13.9-4.3-40.8-1.8-40.8zm-24.4 136.6c9.7-16.9 18-37 24.7-54.7 8.3 15.1 18.9 27.2 30.1 35.5-20.8 4.3-38.9 13.1-54.8 19.2zm131.6-5s-5 6-37.3-7.8c35.1-2.6 40.9 5.4 37.3 7.8z'}
            ]}
          ]}
        ]}
      ]});

    json2html.component.add('github',
      {'<>':'div','class':function(){if(!!this.github) return("p-2"); else return("visually-hidden");},'html':[
        {'<>':'a','rel':'noopener noreferrer nofollow','href':'${github}','data-bs-toggle':'tooltip','data-bs-placement':'top','title':'See source code on Github','html':[
          {'<>':'span','html':[
            {'<>':'svg','width':'17.5','height':'18','xmlns':'http://www.w3.org/2000/svg','viewBox':'0 0 496 512','html':[
              {'<>':'path','d':'M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z'}
            ]}
          ]}
        ]}
      ]});
    
    let template = [
      {'[]':'main-header'}];
    
    $('.container-fluid').json2html(data,template);
    var tooltipTriggerList = [].slice.call(document.querySelectorAll('[data-bs-toggle="tooltip"]'));
    var tooltipList = tooltipTriggerList.map(function (tooltipTriggerEl) {
      return new bootstrap.Tooltip(tooltipTriggerEl)
    })
  });

You can view the resulting page here: https://scalastic.io/en/aws-documentation/.

Contributing to aws-documentation-scraper

There may be cases where the tool doesn't scrape all the content, such as when encapsulated XML is incomplete, or when some content relies on pure HTML that the tool cannot scrape. If you want to contribute to aws-documentation-scraper, follow these steps:

  1. Fork this repository.
  2. Create a branch with a clear name: git checkout -b <branch_name>.
  3. Make your changes and commit them: git commit -m '<commit_message>'
  4. Push to the original branch: git push origin <project_name>/
  5. Create a pull request.

For more information, see GitHub's documentation on creating a pull request.

License

This project is licensed under the MIT License.

aws-documentation-scraper's People

Contributors

jeanjerome avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

jeffcarsonmusic

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.