GithubHelp home page GithubHelp logo

tom-stack3 / wikigraph Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 12.84 MB

Visualizing the phenomenon "Getting to Philosophy" that clicking the first link in the main text of a Wikipedia article, and then repeating the process, will lead to the 'Philosophy' article.

License: Apache License 2.0

Python 100.00%
getting-to-philosophy graphviz wikipedia-scraper

wikigraph's Introduction

Wiki Graph - Getting to Philosophy

A script created to check the "Getting to Philosophy" phenomenon:
Clicking on the first link in the main text of an English Wikipedia article, and then repeating the process for subsequent articles, leads to the Philosophy article.

You are welcome to read more about this interesting phenomenon here: https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy

I created a python script which "plays" this "game" and generates a nice Graph showing the paths created by clicking the first link in a Wikipedia article.
For example the following graph, which ran on these first pages:
"Russia", "Space", "Coronavirus", "Art", "LeBron James","Real Madrid" and "Formula One".
(to view raw: 7_0.svg , 7_0.pdf)

Installations

  1. pip install -r requirements.txt
  2. Install Graphviz from here: https://www.graphviz.org/download/ and make sure that the directory containing the dot executable is on your systems’ path !

Libraries used:

  • bs4 to scrape information from web pages easily.
  • wikipedia to access and parse data from Wikipedia.
  • Graphviz to create and render graphs.

Running the script

There are three options to run:

draw_pages.py:

Gets a list of Wikipedia article names to draw. The script finds the closest article to each name entered and draws the graph for the pages chosen.

e.g:
python draw_pages.py formula 1, Nervous system, Road Bicycle Racing, minerals, baseball, cafe

Results in the following graph: (to view raw: 6_1.svg , 6_1.pdf)

6 Formula One+Nervous system+Road bicycle racing.svg

draw_random.py:

Gets a list of integers. For each number in the arguments, the script generates and draws a graph, with randomly chosen articles. Each integer corresponds for the number of random articles in a drawing.

e.g:
python draw_random.py 10 18

Results in two graphs. One with 10 randomly chosen Wikipedia articles to start with, and one with 18 randomly chosen Wikipedia articles to start with.

draw_handpicked_pages.py:

Gets a number of pages to draw. Then the script lets the user choose each Wikipedia article manually, (using console I/O). After all the pages are chosen, the script generates the graph from the Wikipedia articles chosen.

e.g:
python draw_handpicked_pages.py 8

Gives the user 8 articles to choose, and then draws the graph for the 8 Wikipedia articles chosen.

How we decide what to click on?

Following the chain consists of:

  • Clicking on the first non-parenthesized, non-italicized link.
  • Ignoring external links, links to the current page, or red links (links to non-existent pages).
  • Stopping when reaching "Philosophy", a page with no links or a page that does not exist, or when a loop occurs.

The function that decides what we should click-on is: is_href_valid(), located in the WikiPage.py, in class WikiPage. It gets a href html tag, parsed with bs4(BeautifulSoup) and decides if it is valid to click on or not. If the page is valid - it returns True, otherwise - False.
You can go take a look on the checks it does, but in general we check the following stuff:

  1. It is indeed a link to a Wikipedia article. Meaning it is not an external link to somewhere outside Wikipedia.

  2. It is not a link enclosed in brackets.
    For example in Epistemology the first link that is clicked shouldn't be (🔊listen), Greek or ἐπιστήμη. The right link to click on is branch of philosophy instead.

  3. It is not a side-comment, meaning the link is not in the following tags:

    1. italicized (<i>)
    2. smaller text (<small>)
    3. supper text (<sup>)
  4. It is not a link to a disambiguation page ( disambiguation ).

How the graph is generated?

To generate the graph, I used a very convenient open-source library I found called Graphviz.

Output formats

The Graphviz library supports tons of output formats ( their documentation). In this project I preferred to use .SVG and .PDF files. Both preserve "quality" when zooming in.

One advantage of .SVG over .PDF files is that it allows adding URL links onto nodes, a feature which I found very useful. Consequently, the nodes of the graphs in the .SVG files are clickable and lead to the Wikipedia page they represent.

Loops found 😯

Of course the "Getting to Philosophy" phenomenon doesn't happen in 100% percent of the cases, and there are some loopholes in it. Some interesting loops of Wikipedia articles I found:

So.. what does Philosophy lead to?

As surprising as it sounds, Philosophy also leads to Philosophy 🥳🥳
You can see its path here: philosophy path
(to view raw: philosophy.svg , philosophy.pdf)

Examples:

Check out the folder output_examples for some examples of generated graphs.

Created by Tommy Zaft

wikigraph's People

Contributors

tom-stack3 avatar

Stargazers

 avatar

Watchers

 avatar

wikigraph's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.