chrip / wikiprep-esa Goto Github PK
View Code? Open in Web Editor NEWThis project forked from faraday/wikiprep-esa
ESA implementation using Wikiprep output
This project forked from faraday/wikiprep-esa
ESA implementation using Wikiprep output
Wikiprep-ESA This is an effort to implement Explicit Semantic Analysis (ESA) as described in this paper: "Wikipedia-based semantic interpretation for natural language processing" 2009, Gabrilovich, E. and Markovitch, S. You can find this paper at: http://www.jair.org/media/2669/live-2669-4346-jair.pdf This implementation consists of: * scanData.py : that reads Wikiprep output into a MySQL database. It creates "article","text" and "pagelinks" tables. * addAnchors.py : that adds anchor text to target articles. * addRedirects.py : that adds redirect text to target articles. The scripts above are able to work on both Wikiprep legacy formats and modern format (as in Zemanta fork). Evgeniy Gabrilovich provides a preprocessed dump for 5 November 2005 snapshot of Wikipedia English. It is available at: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/wikipedia-051105-preprocessed.tar.bz2 In its current settings, Python scripts of wikiprep-esa are ready to process this dump with --format=gabrilovich or --format=gl setting. If you need to process dumps in formats of Zemanta, you need to set format with --format argument, e.g. --format=zemanta-modern , --format=zm , --format=modern. Wikiprep dump Format can be following: 1. Gabrilovich [gl, gabrilovich] 2. Zemanta legacy [zl, legacy, zemanta-legacy] 3. Zemanta modern [zm, modern, zemanta-modern] After reading preprocessed dump into the database and adding anchors and redirects, you need to use "esa-lucene" to perform indexing. * ESAWikipediaIndexer: performs indexing with Lucene by feeding it with article content from database. * WikipediaNormalSearcher: at this step, you can use this class to perform a search in Lucene index. keep in mind that at this point, the implementation won't be the same with Gabrilovich et al. (2009), since cosine normalization is term-based in Gabrilovich et al. but document length based in Lucene. Additionally, pruning is not yet applied in Lucene index as in Gabrilovich et al. However, TF.IDF weighing scheme is the same (log-based) and is located in ESASimilarity class. * IndexModifier: reads term frequency vectors from Lucene index and writes cosine-normalized TF.IDF values into "tfidf" table in the database. This is done to apply the same normalization method used in Gabrilovich et al. (2009). [DEPRECATED] * IndexPruner: prunes concept vectors for each term with a sliding window. By default, window_size = 100 and threshold = 0.05 as in Gabrilovich et al. (2009). You can modify these values in IndexPruner class. * ESASearcher: performs search and computes vectors by using the resulting index in the database. * TestESAVectors: produces and displays regular feature vector. * TestGeneralESAVectors: produces and displays "Second Order Interpretation" vector filtered with "Concept Generality Filter" as in Gabrilovich et al. (2009). DEPENDENCIES Python scripts use MySQL-Python to access database. MySQL-Python: http://sourceforge.net/projects/mysql-python/ Python scripts also use PyStemmer, which is the project encapsulating Python wrappers of Snowball: You can find further info at: http://snowball.tartarus.org/download.php "esa-lucene" Java project used for indexing, pruning etc. uses MySQL Connector/J to access database, Lucene 3.0 for indexing and Trove and these libraries are included in project files. MySQL Connector/J: http://www.mysql.com/downloads/connector/j/ Lucene 3.0: http://lucene.apache.org Trove: http://trove4j.sourceforge.net/ USAGE This creates the pagelinks table and records incoming and outgoing link counts. [STANDARD] python scanLinks.py <hgw.xml file from Wikiprep dump> (e.g. python scanLinks.py simplewiki/simplewiki-20110620-pages-articles.gum.xml ) You can provide a list of stop categories for your Wikipedia dump, to help filter irrelevant articles. A list for 2005 dump of Gabrilovich et al. is provided in "2005_wiki_stop_categories.txt". Note that you should prepare your own, updated file for your Wikipedia dump, if you are going to use stop category filtering. If you want to descend down and include all subtrees of these categories, you can use: [OPTIONAL] python scanCatHier.py <hgw.xml/gum.xml file from Wikiprep> <output file path> --stopcats=<stop category file> [The commands below are all STANDARD] python scanData.py <hgw.xml/gum.xml file from Wikiprep dump> --format=<Wikiprep dump format> [--stopcats=<stop category file>] (e.g. python scanData.py simplewiki/simplewiki-20110620-pages-articles.gum.xml --format=zm ) python addAnchors.py <anchor_text file from Wikiprep dump> <a writeable folder>' --format=<Wikiprep dump format> (e.g. python addAnchors.py simplewiki/simplewiki-20110620-pages-articles.anchor_text anchor --format=zm) java -cp esa-lucene.jar edu.wiki.index.ESAWikipediaIndexer <Lucene index folder> java -cp esa-lucene.jar edu.wiki.modify.IndexModifier <Lucene index folder> ... or, if you have a sufficient RAM (15 Gb was enough to process en-20090618 dump) try this instead: java -cp esa-lucene.jar edu.wiki.modify.MemIndexModifier <Lucene index folder> IndexModifier sorts TF-IDF vectors using sort utility of Unix, also using the disk. MemIndexModifier handles sorting in memory instead. Then perform a feature generation to test: To generate regular features: java -cp esa-lucene.jar edu.wiki.demo.TestESAVectors To generate features using only more general links: java -cp esa-lucene.jar edu.wiki.demo.TestGeneralESAVectors
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.