d2klab / harvestingtablesinthewild Goto Github PK

Table Collector enabling to harvest web tables in the wild, from the Common Crawl, mediawiki and the Web at large. Tables are json-ified and stored in ArrangoDB. Tables are then semantically annotated (interpreted)

Home Page: https://tablecollector.tools.eurecom.fr/

Python 62.61% Dockerfile 1.96% Shell 1.56% Go 0.33% TeX 33.54%

corpus interpretation scrapping tables

harvestingtablesinthewild's People

Contributors

Stargazers

Watchers

harvestingtablesinthewild's Issues

Data Format

We should find a common data format in which we can store extracted, parsed and augmented tables.
For now - to keep things simple - we are just using JSON data with the following format:

    obj = {
        'url': url,
        'timestamp': datetime.now().isoformat(), # ISO 8601
        'page_title': page_title,
        'table': table,
    }

    # Example table:

    # COUNTRY, SIZE
    # Finland  5.5
    # Germany  80
    # Norway
    # USA      506

    # table = {
    #   "Country" : ["Finland","Germany","Norway", "USA"],
    #   "Size": ["5.5", "80", "", "506"],
    # }

In this thread we can collect some ideas for other data storage formats.

Extend data format

The data format should be extended with the following fields:

tableOrientation
language
columnNumber
rowNumber

Character normalization

As Eelis mentioned:

I've seen some utf-8 characters being used in place of similar looking ascii characters. We might want to do some character normalization before this. Example package: https://github.com/woodgern/confusables

Originally posted by @EelisK in #22 (comment)

So we should look into character normalization at least for get_term_set, but also think about if the regular table cells and headers need to be normalized.

Create graphs for ingested data

Maybe we can already create a small graph out of the ingested data, e.g. with edges such as URL and Domain.

Remove stub words from webpage term set

get_term_set function should also remove stub words ("a", "the", "of").
For a start, just for English.

Implement table type and header position detection

Currently, we have no detection for the type of tables ("tableType") we are crawling.
Also the "headerPosition" field is hardcoded ATM.

HarvestingTablesInTheWild/core/parsing/parsers.py

Lines 147 to 148 in 7518114

 header_position="FIRST_ROW", # TODO: hardcoded 

 table_type="RELATION", # TODO: hardcoded

For inspiration, see:

Select interesting websites from Alexa 500 ranking: https://www.alexa.com/topsites
https://ourworldindata.org/ -> not accessible since its rendered client side => FUTURE WORK
Extract common websites from CommonCrawl
Use Wikipedia List / Category pages as seeds, e.g. https://en.wikipedia.org/wiki/Category:Lists_by_economic_indicators

Improve table filtering heuristics

DWTC-Extractor has a good algorithm with filtering heuristics:

https://github.com/JulianEberius/dwtc-extractor/blob/master/src/main/java/webreduce/extraction/basic/BasicExtractionAlgorithm.java

	header_position="FIRST_ROW", # TODO: hardcoded
	table_type="RELATION", # TODO: hardcoded

d2klab / harvestingtablesinthewild Goto Github PK

harvestingtablesinthewild's People

Contributors

Stargazers

Watchers

harvestingtablesinthewild's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs