GithubHelp home page GithubHelp logo

mickeysjm / hiexpan Goto Github PK

View Code? Open in Web Editor NEW
71.0 71.0 18.0 55.78 MB

The source code used for automatic taxonomy construction method HiExpan, published in KDD 2018

License: GNU General Public License v3.0

Python 30.86% Shell 8.40% Makefile 0.23% C++ 32.28% C 1.60% Java 11.47% Perl 14.74% Dockerfile 0.40%
taxonomy-construction

hiexpan's People

Contributors

mickeysjm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hiexpan's Issues

Chinese entity HiExpan issues

Hi jiaming,
Thanks for your idea and codes. When I run those codes in Chinese corpus, I found some issues:

  • First, Dependent syntax and part of speech seem to be unnecessary in corpus processing.

  • Second, getCombinedWeightByFeatureMap function use too much time when the featuresOfSeed size is large(the skip gram patterns size reaches hundreds of thousands of levels). So I only retained the 600 features with the highest score and standard length in "eidSkipgram2TFIDFStrength.txt" for each entity. This method reduced the run time from 30 hours to 30 minutes, but there is the possibility of reducing the accuracy of the calculation of the combinedWeight score.

  • Third, type feature is useless for Chinese, I have to use the LDA model's score instead. Now I am evaluating the effectiveness of this method.

  • At last, I didn't find the code for the Taxonomy Global Optimization section. Where can I find it?

现在有的算法计算瓶颈在哪里?

很感谢你上次及时回复我的 issue,你们的方法很棒,非常 impressive
但是我在运行 dblp 和 wiki 的数据集时,运行的结果非常不错,但是有个问题就是计算的时间太长了,考虑到做实验的时候可能不会特别注意优化的问题,我想请教一下有没有哪里值得研究一下优化,谢谢

missing indent

Hi Jiaming,

In HiExpan/src/SetExpan-new/set_expan_standalone.py, line 89-90 need indents otherwise the redundant feature will not be filtered out. Thanks!

Best,
Jieyu

How to run HiExpan after Corpus pre-processing and feature extraction

Hi, I want to run HiExpan to see how it works on DBLP corpus. I use the test_dataset provided in original HiExpan folder. I have ran the corpus preprocessing and feature extraction part. When I try to run main.py in new-HiExpan folder, it tells me main.py: error: the following arguments are required: -data, -taxonPrefix. Is there any instruction about how to run that part? In addition, where should I put seed taxonomy and what the format of that should be. Thanks.

better positions for extracting skip gram feature?

Hi Jiaming,

In the code of extracting skip gram features https://github.com/mickeystroller/HiExpan/blob/master/src/featureExtraction/extractSkipGramFeature.py, the positions of possible skip gram are set as [(-1, 1), (-2, 1), (-3, 1), (-1, 3), (-2, 2), (-1, 2)] (line 30) , but I found when the center word is the first word of a sentence, the positions will actually become (0, 1) instead of (-1, 1) since there is no word before the center word, so maybe we should add positions like (0, 1), (0, 2) . Otherwise, we will see some entities have "a _ problem" feature but do not have "_ problem" feature. It may hurt when "_ problem" become an important feature later. Thanks!

Best,
Jieyu

KeyError: 'united_states'

Hello, I would like to test HiExpan on wiki corpus. After featureExtraction, I ran

~/HiExpan/src/HiExpan-new$ python3.6 main.py -data wiki

to test.
But after loading those files in wiki/intermediate, I got:

=== Finish loading data ...... ===
=== Start loading seed supervision ...... ===
Traceback (most recent call last):
  File "main.py", line 120, in <module>
    newNode = TreeNode(parent=rootNode, level=0, eid=ename2eid[children], ename=children,
KeyError: 'united_states'

It seems that united_states is not included in those entities. What could possibly be wrong?
Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.