GithubHelp home page GithubHelp logo

wullli / hiexpan Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mickeysjm/hiexpan

0.0 0.0 0.0 71.12 MB

The source code used for automatic taxonomy construction method HiExpan, published in KDD 2018

License: GNU General Public License v3.0

Shell 8.41% C++ 32.05% Python 31.28% Perl 14.64% C 1.59% Java 11.39% Makefile 0.23% Dockerfile 0.40%

hiexpan's Introduction

HiExpan

The source code used for automatic taxonomy construction method HiExpan, published in KDD 2018.

Requirments

Before running, you need to first install the required packages by typing following commands:

$ pip3 install -r requirements.txt

Also, a C++ compiler supporting C++11 is needed.

If you want to use our pre-processing code, including corpus preprocessing and feature extraction pipeline, you need to install SpaCy and gensim. See detailed information in /src/corpusProcessing/README.md and /src/featureExtraction/README.md

Step 1: Corpus pre-processing

To reuse our corpus pre-processing pipeline, you need to first create a data folder with dataset name $DATA at the project root, and then put your raw text corpus (each line represents a single document) under the $DATA/source/ folder.

data/$DATA
└── source
    └── corpus.txt

Step 2: Feature extraction

You need to first transform the raw text corpus into a standard JSON format and use the code for feature extraction. The above pipeline will output two files, organized as follows:

data/$DATA
└── intermediate
	└── sentences.json
	└── entity2id.txt	

Based on these two files, feature extraction pipeline will output all the needed feature files for HiExpan model.

data/$DATA
└── intermediate
	└── eidSkipgramCounts.txt
	└── eidSkipgram2TFIDFStrength.txt
	└── eidTypeCounts.txt
	└── eidType2TFIDFStrength.txt
	└── eid2embed.txt
	└── eidDocPairPPMI.txt

Explanation of each intermediate files

  1. entity2id.txt: each line has two columns (separated by a “\t” character) and represents one entity. The first column is entity surface name (with an underscore concatenating all words) and the second column is the entityID (which will be the unique identifier to retrieve each entity’s features).
  2. eidSkipgramCounts.txt: each line has three columns (separated by a “\t” character). The first column is an entity id. The second column is a Skipgram feature associated with this entity. In the skipgram, the occurrence position of the entity is replaced with the placeholder “__”. Finally, the third column is the co-occurrence count between this entity id and the skipgram. For example, the line “0 \t reconstructed __ from \t 2” means “entity with id 0 appears twice in the context reconstructed __ from”.
  3. eidSkipgram2TFIDFStrength.txt: each line has four columns (separated by a “\t” character). The first and second columns are exactly the same as the eidSkipgramCounts.txt. The third and fourth columns are the association strength between entity and skipgram features. Larger values in third/fourth columns indicate stronger association between entity and skipgram features.
  4. eidTypeCounts.txt: each line has three columns (separated by a “\t” character). The first column is an entity id. The second column is a type feature (in current version, the type is retrieved from Probase) associated with this entity. The third column is the probability that this entity has the corresponding type. For example, the line “2025 \t conditional simulation algorithm \t 0.251” means “the probability that entity with id 2025 is of type conditional simulation algorithm is 0.251”
  5. eidType2TFIDFStrength.txt: each line has four columns (separated by a “\t” character). The first and second columns are exactly the same as the eidTypeCounts.txt. The third and fourth columns are normalized probability. Larger values in third/fourth columns indicate stronger association between entity and type features.
  6. eid2embed.txt: each line is the embedding of one entity. This file is not human readable.
  7. eidDocPairPPMI.txt: each line has three columns (separated by a “\t” character). The first and second columns are two entity ids. The third column is the Positive Pointwise Mutual Information (PPMI) behind these two entities. Larger values of PPMI indicate stronger association between these two entities.
  8. linked_results.txt: each line has two columns (separated by a “\t” character). The first column is the entity surface name (no underscore) used as Probase linking input. The second column is the linking results. If an entity can not be linked, then the second column will simply be an empty list []. Otherwise, the second column will be a list of tuples and each tuple is (type name, linking probability). The linking probability indicates how likely an entity has the type. By analyzing this file, we can easily get how many entities are linkable to Probase.

Step 3: Taxonomy Construction

After obtaining all features for your corpus, you can provide seed taxonomy in ./HiExpan-new/seedLoader.py and start running HiExpan model by the following commands:

$ cd ./HiExpan-new
$ python main -data $corpus_name -taxonPrefix $taxonPrefix

Evaluation Datasets

The original evaluation dataset link is expired and the new link to our original paper evaluation dataset is available at: Google Drive.

hiexpan's People

Contributors

mickeysjm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.