GithubHelp home page GithubHelp logo

pombredanne / wikiassoc Goto Github PK

View Code? Open in Web Editor NEW

This project forked from larsmans/wikiassoc

0.0 0.0 0.0 240 KB

Link structure mining in the Wikipedia

License: GNU General Public License v3.0

Shell 0.58% C++ 99.42%

wikiassoc's Introduction

Wikiassoc

Wikiassoc is a tool for generating term associations by analyzing the link structure of Wikipedia (or any other wiki based on the MediaWiki software). You put in a Wikipedia database dump, and get out a table of terms (article titles) and the most strongly related terms.

To build and install Wikiassoc, you need a fairly modern C++ compiler (tested with GCC 4.3 and Open64 4.2.2.2), the Boost libraries (specifically Boost.IOStreams and Boost.Regex), zlib, bzlib and the GNU autools (autoconf and automake).

It is highly advisable to use

as Wikiassoc will be very slow or consume huge amounts of memory without these.

Wikiassoc uses the GNU build tools. Enter

./prepare
./configure && make && make install

or see the file INSTALL for more detailed instructions.

Usage

To compile an associative thesaurus with Wikiassoc, first download some files from the Wikimedia dump repository. For example, say you want word associations in Latin. Fetch the files

lawiki-YYYYMMDD-page.sql.gz
lawiki-YYYYMMDD-pagelinks.sql.gz

and run the Wikiassoc program as

wikiassoc lawiki-YYYYMMDD-page.sql.gz lawiki-YYYYMMDD-pagelinks.sql.gz \
  | gzip -c > lawiki-associations.gz

(Using gzip is highly recommended, as Wikiassoc produces a lot of output.)

You will get a log of what's happening on stderr. Note that Wikiassoc takes a lot of memory; on the larger Wikipedias, it may be as much as 12GB or more.

In lawiki-associations.gz, you will find a text file with terms and indented associations for the term:

Astronomia
    Scientia
    Physica
    Universum
    Galaxias
    Planeta
    Geologia
    Mathematica
    Stella
    Luna
    Terra

See the manpage for details (man wikiassoc).

How does it work?

For each article in the Wikipedia database dump, Wikiassoc looks at all the articles that can be reached by following at most two links. It then weighs all these articles by a scheme called pf-ibf, or path frequency-inverse backlink frequency. For further explanation, refer to:

Nakayama, K., Hara, T. and Nishio, S. (2007) Wikipedia Mining for an Association Web Thesaurus Construction. In Proc. International Conference on Web Information Systems Engineering (WISE), pp. 322-334.

wikiassoc's People

Contributors

larsmans avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.