GithubHelp home page GithubHelp logo

Comments (4)

cezhang01 avatar cezhang01 commented on August 17, 2024

Hi @CarlKilhart ,

Thank you for your interest in our work!

The current contents.txt contains sequences of words after preprocessing - we removed stop words, punctuations, and other meaningless words. The current vocabulary contains remaining words. The current contents.txt contains the sequence of these remaining words. These remaining words are still ordered in the correct sequence of their original raw content.

For example, suppose vocabulary is [welcome, new, york, best, city, ...] and if the original raw content is welcome to the best city new york!, after preprocessing we have [0, 3, 4, 1, 2] for this document. Here to and the are removed because they are stop words. But the remaining 5 words (welcome, best, city, new, york) are still in the correct order with their original raw content.

Do I answer your question clearly? Or do you mean you need the original raw content of documents (including stop words, punctuations, etc)?

from dbn.

CarlKilhart avatar CarlKilhart commented on August 17, 2024

Maybe you uploaded a wrong version of contents.txt? Taking the first row of ml dataset as an example, clearly 3 16 17 28 34 36 39 45 46 85 111 150 150 151 192 192 192 192 200 201 217 218 269 306 328 351 377 476 477 488 507 623 723 762 898 947 1270 1347 1494 1587 1697 is ordered by the word ID, not the correct sequence of their original raw content. I would appreciate it if you could check the data files.

from dbn.

cezhang01 avatar cezhang01 commented on August 17, 2024

Hi @CarlKilhart ,

Thank you for the reminder!

The current datasets indeed don't have word order. But my model also doesn't use word order for training. Thus the current datasets are still valid and correct for reproducing the results in the paper.

I just processed the datasets again the obtain the word order. You can download all 5 datasets with word order, including Web dataset, using the below Google Drive link: https://drive.google.com/file/d/10sGsStbutM-e1XfM8uDwP354YcXdpmgj/view?usp=sharing

Please note that for Aminer dataset, I forgot how I preprocessed it last year. I recently rewrite the preprocessing code to produce Aminer dataset, but the current dataset may have some deviations from the one uploaded on github repo.

from dbn.

cezhang01 avatar cezhang01 commented on August 17, 2024

Hi @CarlKilhart ,

Did I clearly answer your question? If no more questions, could I close this issue?

from dbn.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.