GithubHelp home page GithubHelp logo

mr2's Introduction

Obtain the Dataset

Usage

You can find the baselines and guidance at https://aistudio.baidu.com/datasetdetail/230144.

Notebook baseline: https://aistudio.baidu.com/projectdetail/6371316?sUid=2739660&shared=1&ts=1689818284921 baseline: https://aistudio.baidu.com/clusterprojectdetail/6549711

Data

Organization Structure

The organization structure of the dataset is as follows

.
├── dataset_items_test.json
├── dataset_items_train.json
├── dataset_items_val.json
├── test
├── train
└── val

We have merged the English and Chinese datasets together and you can split them if necessary

Labels 0, 1, 2 represent non-rumor, rumor, and unverified, respectively.

File Format

Images for each claim are stored in the img folder.

The img_html_news folder contains web pages and images retrieved based on the caption of each claim. The folder includes a direct_annotation.json file containing the following information:

{
      "img_link": "Link to the retrieved related image",
      "page_link": "Link to the retrieved web page",
      "domain": "Domain of the retrieved web page",
      "snippet": "Brief summary of the retrieved web page",
      "image_path": "Path to the retrieved image",
      "html_path": "Path to the retrieved web page",
      "page_title": "Title of the retrieved web page"
}

The inverse_search folder contains web pages found based on the images in the claims. The folder includes an inverse_annotation.json file containing the following information:

{
"entities": "Entities in the image of the claim", 
"entities_scores": "Scores of the entities in the image of the claim", 
"best_guess_lbl": "The most likely description of the image in the claim", 
"all_fully_matched_captions": "", 
"all_partially_matched_captions": "",
"fully_matched_no_text": "",
//The above three fields are values of the found web pages, which are a list. Each element in the list is a dictionary, formatted as follows:
	{
	"page_link": "Link to the retrieved web page", 
	"image_link": "Link to the retrieved image", 
	"html_path": "Path to the retrieved web page", 
	"title": "Title of the retrieved web page"
	}
}

Dataset Copyright

Our datasets are open sourced for commercial and academic use and licensed under CC-BY-SA 4.0 license.

mr2's People

Contributors

chenjz20thu avatar

Stargazers

 avatar  avatar Jialiang Shi avatar  avatar  avatar yuliu avatar Xie zhiwei avatar William Wang avatar Makabaka avatar  avatar Chong Shen avatar Lizh avatar  avatar  avatar  avatar Xuming Hu avatar  avatar YANG Zhiwei avatar Zhang Xingquan avatar Shinewww avatar Zhangchi Feng avatar leebit avatar Rui Zhang avatar Zhijiang avatar  avatar Peng QI avatar  avatar Ltfall avatar

Watchers

Peng QI avatar

mr2's Issues

Propagation based model source code

How can I reproduce the results reported against the propagation based baseline models? I cannot find the necessary source code for this part please help out asap. Additionally please provode the script to collect and store conversational threads for the english/chinese dataset. Thanks in advance.

Dataset is difficult to use

The dataset contains a large number of html and npz files, which are not clearly described as well as corresponding claim textual content, pictures, and dissemination structures. Would it be possible to provide a more detailed description of the dataset and how to use it? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.