GithubHelp home page GithubHelp logo

orientier7 / scitsr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from irene323/scitsr

0.0 0.0 0.0 33 KB

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

Home Page: https://arxiv.org/pdf/1908.04729.pdf

License: MIT License

Python 100.00%

scitsr's Introduction

SciTSR

Introduction

SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.

Download link is here.

There are 15,000 examples in total, and we split 12,000 for training and 3,000 for test. We also provide the test set that only contains complicated tables, called SciTSR-COMP. The indices of SciTSR-COMP is stored in SciTSR-COMP.list.

The statistics of SciTSR dataset is following:

Train Test
# Tables 12,000 3,000
# Complicated tables 2,885 716

Format and Example

The directory tree structure is as follow:

SciTSR
├── SciTSR-COMP.list
├── test
│   ├── chunk
│   ├── img
│   ├── pdf
│   └── structure
└── train
    ├── chunk
    ├── img
    ├── pdf
    ├── rel
    └── structure

The input PDF files are stored in pdf, and the structure labels are stored in the structure directory.

For convenience, we provide the input in image format stored in img, which are converted from PDFs by pdfcairo.

We also provide the extracted chunks stored in chunk, which are pre-processed by Tabby.

For training data, we provide the our constructed relation labels for our GraphTSR model, which are generated by matching chunks and the texts of structure labels.

Note that our pre-processed chunk and relation data may contain noise. The original input files are in PDF.

Text Chunks

File: chunk/[ID].chunk

The pos array contains the x1, x2, y1 and y2 coordinates (in PDF) of the chunk.

{"chunks": [
  {
    "pos": [
      147.96600341796875,
      205.49998474121094,
      475.7929992675781,
      480.4206237792969
    ],
    "text": "Probability"
  },
  {
    "pos": [
      217.45510864257812,
      290.6802673339844,
      475.7929992675781,
      480.4206237792969
    ],
    "text": "Generated Text"
  },
  ...
 ]}

Relations

File rel/[ID].rel

A line of CHUNK_ID_1 CHUNK_ID_2 RELATION_ID:NUM_BLANK represents the relation between CHUNK_ID_1-th chunk and CHUNK_ID_2-th chunk is RELATION_ID, and there are NUM_BLANK blank cells between them. For RELATION_ID, 1 and 2 represents horizontal and vertical, respectively.

0 1 1:0
1 2 1:0
0 9 2:0
...

Structure Labels

File: structure/[ID].json

A table is stored as a list of cells. For each cell, we provide its original tex code, content (split by space) and position in the table (start/end row/column number, started from 0).

{"cells": [
  {
    "id": 21,
    "tex": "959",
    "content": [
      "959"
    ],
    "start_row": 5,
    "end_row": 5,
    "start_col": 1,
    "end_col": 1
  },
  {
    "id": 1,
    "tex": "Training set",
    "content": [
      "Training",
      "set"
    ],
    "start_row": 0,
    "end_row": 0,
    "start_col": 1,
    "end_col": 1
  },
  ...
]}

Implementation Details

Features

The codes for vertex and edge features are at ./scitsr/graph.py.

You can get vertex features by Vertex(vid, chunk, tab_h, tab_w).features and edge features by Edge(vertex1, vertex2).features.

tab_h and tab_w denotes the height (y-axis) and width (x-axis) of the table.

See ./scitsr/graph.py for more details.

Evaluation

In the evaluation procedure, a table should be converted to a list of horizontally/vertically adjacent relations. Then we make a comparison between ground truth relations and output relations.

We release the evaluation scripts for comparing horizontally and vertically adjacent relations. In the following example (./examples/eval.py), we show how to use the scripts to calculate precision/recall/F1 for an output table.

with open(json_path) as fp: json_obj = json.load(fp)
# convert the structure labels (a table in json format) to a list of relations
ground_truth_relations = json2Relations(json_obj, splitted_content=True)
# your_relations should be a List of Relation.
# Here we directly use the ground truth relations in the example.
your_relations = ground_truth_relations
precision, recall = eval_relations(
  gt=[ground_truth_relations], res=[your_relations], cmp_blank=True)

Note: Your output tables should be represented as List[Relation]. You can also store a table as a Table object and then convert it to List[Relation] by using scitsr.eval.Table2Relations.

Citation

Please cite the paper if you found the resources useful.

@article{chi2019complicated,
  title={Complicated Table Structure Recognition},
  author={Chi, Zewen and Huang, Heyan and Xu, Heng-Da and Yu, Houjin and Yin, Wanxuan and Mao, Xian-Ling},
  journal={arXiv preprint arXiv:1908.04729},
  year={2019}
}

scitsr's People

Contributors

czwin32768 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.