The deion in README indicates that the word order of the contents is included in

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

The word order of the data contents is missing about dbn HOT 4 OPEN

CarlKilhart commented on August 17, 2024

The word order of the data contents is missing

from dbn.

Comments (4)

cezhang01 commented on August 17, 2024

Hi @CarlKilhart ,

Thank you for your interest in our work!

The current contents.txt contains sequences of words after preprocessing - we removed stop words, punctuations, and other meaningless words. The current vocabulary contains remaining words. The current contents.txt contains the sequence of these remaining words. These remaining words are still ordered in the correct sequence of their original raw content.

For example, suppose vocabulary is [welcome, new, york, best, city, ...] and if the original raw content is welcome to the best city new york!, after preprocessing we have [0, 3, 4, 1, 2] for this document. Here to and the are removed because they are stop words. But the remaining 5 words (welcome, best, city, new, york) are still in the correct order with their original raw content.

Do I answer your question clearly? Or do you mean you need the original raw content of documents (including stop words, punctuations, etc)?

from dbn.

CarlKilhart commented on August 17, 2024

Maybe you uploaded a wrong version of contents.txt? Taking the first row of ml dataset as an example, clearly 3 16 17 28 34 36 39 45 46 85 111 150 150 151 192 192 192 192 200 201 217 218 269 306 328 351 377 476 477 488 507 623 723 762 898 947 1270 1347 1494 1587 1697 is ordered by the word ID, not the correct sequence of their original raw content. I would appreciate it if you could check the data files.

from dbn.

cezhang01 commented on August 17, 2024

Hi @CarlKilhart ,

Thank you for the reminder!

The current datasets indeed don't have word order. But my model also doesn't use word order for training. Thus the current datasets are still valid and correct for reproducing the results in the paper.

I just processed the datasets again the obtain the word order. You can download all 5 datasets with word order, including Web dataset, using the below Google Drive link: https://drive.google.com/file/d/10sGsStbutM-e1XfM8uDwP354YcXdpmgj/view?usp=sharing

Please note that for Aminer dataset, I forgot how I preprocessed it last year. I recently rewrite the preprocessing code to produce Aminer dataset, but the current dataset may have some deviations from the one uploaded on github repo.

from dbn.

cezhang01 commented on August 17, 2024

Hi @CarlKilhart ,

Did I clearly answer your question? If no more questions, could I close this issue?

from dbn.

The word order of the data contents is missing about dbn HOT 4 OPEN

Comments (4)

Related Issues (1)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs