GithubHelp home page GithubHelp logo

d2klab / harvestingtablesinthewild Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 0.0 2.98 MB

Table Collector enabling to harvest web tables in the wild, from the Common Crawl, mediawiki and the Web at large. Tables are json-ified and stored in ArrangoDB. Tables are then semantically annotated (interpreted)

Home Page: https://tablecollector.tools.eurecom.fr/

Python 62.61% Dockerfile 1.96% Shell 1.56% Go 0.33% TeX 33.54%
corpus interpretation scrapping tables

harvestingtablesinthewild's People

Contributors

ehrhart avatar jacksgt avatar rohitshubham avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

harvestingtablesinthewild's Issues

Data Format

We should find a common data format in which we can store extracted, parsed and augmented tables.
For now - to keep things simple - we are just using JSON data with the following format:

    obj = {
        'url': url,
        'timestamp': datetime.now().isoformat(), # ISO 8601
        'page_title': page_title,
        'table': table,
    }
    # Example table:

    # COUNTRY, SIZE
    # Finland  5.5
    # Germany  80
    # Norway
    # USA      506

    # table = {
    #   "Country" : ["Finland","Germany","Norway", "USA"],
    #   "Size": ["5.5", "80", "", "506"],
    # }

In this thread we can collect some ideas for other data storage formats.

Extend data format

The data format should be extended with the following fields:

  • tableOrientation
  • language
  • columnNumber
  • rowNumber

Character normalization

As Eelis mentioned:

I've seen some utf-8 characters being used in place of similar looking ascii characters. We might want to do some character normalization before this. Example package: https://github.com/woodgern/confusables

Originally posted by @EelisK in #22 (comment)

So we should look into character normalization at least for get_term_set, but also think about if the regular table cells and headers need to be normalized.

Implement table type and header position detection

Currently, we have no detection for the type of tables ("tableType") we are crawling.
Also the "headerPosition" field is hardcoded ATM.

header_position="FIRST_ROW", # TODO: hardcoded
table_type="RELATION", # TODO: hardcoded

For inspiration, see:

Data collection strategy

As discussed in our last meeting, we agree that it does not make sense to just randomly crawl all possible website on the web - and most likely accrue a lot of junk.

Instead, we should come up with a strategy that biases the selection of websites towards high-quality data sources.

Ingestion service

We need to take the items out from the Kafka message queue and store them in ArangoDB database.
Initial work for this has been done in:
#11

Setup Message queue

Add a message queue for the ingestion of the harvested tables into the database. We have chosen Apache Kafka for this purpose.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.