mickeysjm / hiexpan Goto Github PK

View Code? Open in Web Editor NEW

71.0 71.0 18.0 55.78 MB

The source code used for automatic taxonomy construction method HiExpan, published in KDD 2018

License: GNU General Public License v3.0

Python 30.86% Shell 8.40% Makefile 0.23% C++ 32.28% C 1.60% Java 11.47% Perl 14.74% Dockerfile 0.40%

taxonomy-construction

hiexpan's People

Contributors

Stargazers

Watchers

Forkers

wanzhengzhu mindis sarwar187 hanayashiki jsw-zorro youhebuke strmic-aar8 titsuki li-ming-fan krishna-kimo wayne9qiu smartnews vietvudanh tiffen cestlucas jjangdh simonmssu kristinakupf wullli

hiexpan's Issues

Chinese entity HiExpan issues

Hi jiaming,
Thanks for your idea and codes. When I run those codes in Chinese corpus, I found some issues:

First, Dependent syntax and part of speech seem to be unnecessary in corpus processing.
Second, getCombinedWeightByFeatureMap function use too much time when the featuresOfSeed size is large(the skip gram patterns size reaches hundreds of thousands of levels). So I only retained the 600 features with the highest score and standard length in "eidSkipgram2TFIDFStrength.txt" for each entity. This method reduced the run time from 30 hours to 30 minutes, but there is the possibility of reducing the accuracy of the calculation of the combinedWeight score.
Third, type feature is useless for Chinese, I have to use the LDA model's score instead. Now I am evaluating the effectiveness of this method.
At last, I didn't find the code for the Taxonomy Global Optimization section. Where can I find it?

现在有的算法计算瓶颈在哪里？

很感谢你上次及时回复我的 issue，你们的方法很棒，非常 impressive
但是我在运行 dblp 和 wiki 的数据集时，运行的结果非常不错，但是有个问题就是计算的时间太长了，考虑到做实验的时候可能不会特别注意优化的问题，我想请教一下有没有哪里值得研究一下优化，谢谢

missing indent

Hi Jiaming,

In HiExpan/src/SetExpan-new/set_expan_standalone.py, line 89-90 need indents otherwise the redundant feature will not be filtered out. Thanks!

Best,
Jieyu

How to run HiExpan after Corpus pre-processing and feature extraction

Hi, I want to run HiExpan to see how it works on DBLP corpus. I use the test_dataset provided in original HiExpan folder. I have ran the corpus preprocessing and feature extraction part. When I try to run main.py in new-HiExpan folder, it tells me main.py: error: the following arguments are required: -data, -taxonPrefix. Is there any instruction about how to run that part? In addition, where should I put seed taxonomy and what the format of that should be. Thanks.

better positions for extracting skip gram feature?

Hi Jiaming,

In the code of extracting skip gram features https://github.com/mickeystroller/HiExpan/blob/master/src/featureExtraction/extractSkipGramFeature.py, the positions of possible skip gram are set as [(-1, 1), (-2, 1), (-3, 1), (-1, 3), (-2, 2), (-1, 2)] (line 30) , but I found when the center word is the first word of a sentence, the positions will actually become (0, 1) instead of (-1, 1) since there is no word before the center word, so maybe we should add positions like (0, 1), (0, 2) . Otherwise, we will see some entities have "a _ problem" feature but do not have "_ problem" feature. It may hurt when "_ problem" become an important feature later. Thanks!

Best,
Jieyu

KeyError: 'united_states'

Hello, I would like to test HiExpan on wiki corpus. After featureExtraction, I ran

~/HiExpan/src/HiExpan-new$ python3.6 main.py -data wiki

to test.
But after loading those files in wiki/intermediate, I got:

=== Finish loading data ...... ===
=== Start loading seed supervision ...... ===
Traceback (most recent call last):
  File "main.py", line 120, in <module>
    newNode = TreeNode(parent=rootNode, level=0, eid=ename2eid[children], ename=children,
KeyError: 'united_states'

It seems that united_states is not included in those entities. What could possibly be wrong?
Thank you.

mickeysjm / hiexpan Goto Github PK

hiexpan's People

Contributors

Stargazers

Watchers

Forkers

hiexpan's Issues

Chinese entity HiExpan issues

现在有的算法计算瓶颈在哪里？

missing indent

How to run HiExpan after Corpus pre-processing and feature extraction

better positions for extracting skip gram feature?

KeyError: 'united_states'

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs