GithubHelp home page GithubHelp logo

read-gooder-wikiparse's Introduction

read-gooder-wikiparse

Parsing wikipedia content into plain text and creating small reading comprehension questions

Running

usage: simplify_wiki_html.py [-h] [-t TITLE | -f FILE]

Convert a wikipedia page into OpenMind JSON format

optional arguments:
  -h, --help            show this help message and exit
  -t TITLE, --title TITLE
                        fetch the wikipedia article with this title from the
                        Wikimedia REST API
  -f FILE, --file FILE  convert the local HTML file at this path

The t option fetches the HTML version of the wikipedia article from the Wikimedia REST API and thus requires internet connectivity. For development, it's probably nicer to use the -f option.

Dependencies

nltk

Install nltk and nltk data

Stanford Parser

Text is parsed using the Stanford parser. Follow instructions to install the Stanford parser and use the nltk interface nltk interface.

Wikimedia REST API

This utility makes heavy use of the Wikimedia REST API. In particular, we use the HTML endpoint which allows you to retrieve the latest html for a wikipedia page title.

JSON Format

I'll try to document this a couple ways to see which one makes more sense.

Grammar-like documentation

<document> ::=
	"header": STRING,
	"sections": <sections> | <paragraphs>

<section> ::=
	?"header": STRING,
	<paragraphs> | <section> | <paragraphs>,<section>

<paragraphs> ::=
	"sentences": [<sentences>]

<sentences> ::=
	[<sentence>]

<sentence> ::=
	"num_words": INT,
	"sentence_parts": [<sentence_parts>]

<sentence_parts> ::=
	"indent": INT,
	"text": STRING

English-like documentation

{
    "header": "Train", # Title of document
    "section": { # Sections are made up of paragraphs or subsections or both
        "paragraphs": [ # Paragraphs is a list of paragraph
            { # A paragraph
                "sentences": [ # Sentences is a list of sentence
                    { # sentence has num_words and a list of sentence_parts
                        "num_words": 26, 
                        "sentence_parts": [
                            { # sentence_parts have an indent amount and text
                                "indent": 0, 
                                "text": "A train is a"
                            }, 
                            {
                                "indent": 0, 
                                "text": "form of rail transport"
                            }, 
                            {
                                "indent": 0, 
                                "text": "consisting of a series"
                            },
                            .
                            .
                            .
                        ] # End sentence_part
                    } # End sentence
                ] # End sentences
            }, # End paragraph
            .
            .
            .
        ], # End paragraphs
        "section": [
            {
                "header": "Types", # Title of the section
                "paragraphs": [ # Paragraphs in that section
                    {
                        "sentences": [
                            {
                                "num_words": 12, 
                                "sentence_parts": [
                                    {
                                        "indent": 0, 
                                        "text": "There are various types"
                                    }, 
                                    {
                                        "indent": 0, 
                                        "text": "of trains that are"
                                    }, 
                                    {
                                        "indent": 0, 
                                        "text": "designed for particular purposes."
                                    }
                                ]
                            }, 
                        ]
                    }
                ]
            }
        ]
    }
}

read-gooder-wikiparse's People

Contributors

blpercha avatar zimmeee avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.