GithubHelp home page GithubHelp logo

devsinghsachan / wikipedia_parser Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jonathanraiman/wikipedia_parser

0.0 2.0 0.0 204 KB

Parse Wikipedia dumps, extracts links, and page types.

Ruby 100.00%

wikipedia_parser's Introduction

Wikipedia Parser

Wikipedia parser lets you parse .bz2 and .xml dumps of Wikipedia articles without decompressing the .bz2 file. The Parser functions as an enumerator over the pages to stream-parse the dumps (which can take several hours to read).

Usage

To get started gem install wikipedia_parser, then head over to the Wikipedia dumps, and now you can:

require 'wikipedia_parser'

parser = WikiParser.new :path => File.dirname(__FILE__)+"/enwiki.bz2" # path to wikipedia dump

loop do
	page = parser.get_next_page
	if !page              then break
	else
		puts page.title
		puts page.internal_links
	end
end

The internal links contain the title and the uri of the destination page:

loop do
	page = parser.get_next_page
	if !page              then break
	else
		page.internal_links.each do |link|
			puts link[:title][page.language] + "points to" + link[:uri]
		end
	end
end

For larger pages, such as Barack Obama or United States reading all the internal links takes a long time. If you are only interested in certain pages, say those that start with the letter A, then you could skip those pages you don't want without incuring the cost of reading the entire article since the title is at the top of the xml file. Here's one way of implementing this check:

loop do
	page = parser.get_next_page :until => "title" # or "id" or "redirect" (boolean)
	if !page then break
	else
		if page.title ~= /^[aA]/ # starts with A
			page.finish_processing # reads the remainder of the nodes.
			page.internal_links.each do |link|
				puts link[:title][page.language] + "points to" + link[:uri]
			end
		end
	end
end

Testing

Simply run:

rake test

Issues

On Mac and Linux bzip2 is included, but it appears that on Windows a seperate bzip2 reader is needed.

wikipedia_parser's People

Contributors

jonathanraiman avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.