GithubHelp home page GithubHelp logo

rizzo98 / summarizing-long-form-document-with-rich-discourse-information Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 95 KB

Implementation of Summarizing Long-Form Document with Rich Discourse Information

Python 100.00%

summarizing-long-form-document-with-rich-discourse-information's Introduction

How to produce summaries

  • Download the repository
  • Install the requirements
    pip install -r requirements.txt
  • Add your data in the data folder (must be in the correct format!)
  • Configure the pipeline (the default is ContentRanking + Bart)
  • Launch train.py
    python train.py

Data format

Data must be provided in JSON format with the following structure:

{
    article_id: str
    abstract_text: List[str]
    article_text: List[str]
    section_names: List[str]
    sections: List[List[str]]
}

Configuration

The main configuration file is config.json In this file are defined:

  • Device on which execute the summarizer
  • Modules pipeline
  • Output configuration:
    • name of the folder in which the log and the models will be saved
    • flags for saving models and log file
    • wandb: if null, wandb is not invoked. To set up wandb, set this field to:
      {
          "project":"Name of the project",
          "entity":"Account"
      }

Modules pipeline

Each element in the list model has the following format

  • name: specify the name of the model class
  • config: specify the path of the configuration file (starting from ./config)
  • train: if true, compute train for the model on data specified in the config file of the module
  • pretrained_model: path of the pretrained model
  • inference: if true, compute summarization of documents specified in the config file of the module
  • from_previous: specify if the module takes as input the output of the previous module or reads data directly from the specified files. (true for the first module raise an exception)

Module config file

Each module must have a proper configuration file.
Each configuration file must have:

  • params: all the parameters of the model
  • tokenizer: the tokenizer class and all the tokenizer params
  • train: the trainer class
    • training_dataset: the dataset class, the dataset params and the train data path
    • validation_dataset: the validation dataset class,its params and the validation data path
    • training_dataloader: the training dataloader and its params
    • training_epochs: for how many epochs compute the training
    • optimizer: the optimizer class and its params
    • loss: the loss class and its params

How to add a module

How a module works

Each module, with its own dataset, can read data from file (in correct format) or from the output of the previous module in the pipeline. The latter option is made possible by the wrapper class, that converts the standard dataset format to the one specific for this module.
Each module, in inference time, produces its own output in a standard dataset format.

summarizing-long-form-document-with-rich-discourse-information's People

Contributors

rizzo98 avatar

Watchers

James Cloos avatar  avatar

Forkers

mariafrancescat

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.