GithubHelp home page GithubHelp logo

benyeoh / grownup Goto Github PK

View Code? Open in Web Editor NEW
7.0 3.0 1.0 368 KB

GROWN+UP: A "Graph Representation Of a Webpage" Network + Utilizing Pre-Training

License: Apache License 2.0

Dockerfile 1.80% Shell 4.45% Python 91.67% Scala 2.07% PureBasic 0.02%

grownup's Introduction

GROWN+UP: A "Graph Representation Of a Webpage" Network Utilizing Pre-Training

This is the official repo of GROWN+UP and accompanying benchmarks published in the CIKM'22 proceedings. Latest preprint can be found on arxiv.

Pre-requisites

The hardware / software requirements are:

  • Ubuntu 18.04 or newer
  • Docker 19.03 or newer with GPU support
    • (Optional but recommended)
  • NVIDIA GPU with CUDA 11.2.1 support (GPU driver version: >=460.32.03)
    • Typically, you don't need to care on Cuda library installation if you use Docker
  • Git LFS
    • This is super important and is required when you clone / pull from this repo, since some large data (ie, pretrained weights) is stored in LFS.

Introduction

The top level folder structure of this repo consists of:

  1. boilerplate-removal Webpage boilerplate removal benchmarks for GROWN+UP as well as other baselines mentioned in the paper.
  2. genre-classification Webpage genre classification benchmarks for GROWN+UP
  3. klassterfork A subset of an ML framework containing GROWN+UP model components and other ML training necessities to reproduce results, built on Tensorflow v2.5
  4. pre-training TODO

For more details, please consult the README.md in the appropriate folders.

Citation

To cite, please use this BibTex:

@inproceedings{10.1145/3511808.3557340,
author = {Yeoh, Benedict and Wang, Huijuan},
title = {GROWN+UP: A ''Graph Representation Of a Webpage" Network Utilizing Pre-Training},
year = {2022},
isbn = {9781450392365},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3511808.3557340},
doi = {10.1145/3511808.3557340},
booktitle = {Proceedings of the 31st ACM International Conference on Information & Knowledge Management},
pages = {2372โ€“2382},
numpages = {11},
keywords = {web genre classification, webpage, boilerplate removal, feature extractor, self-supervised, graph neural network, backbone, pre-training},
location = {Atlanta, GA, USA},
series = {CIKM '22}
}

grownup's People

Contributors

benyeoh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

lzy-lsf

grownup's Issues

I received a message indicating that the quota has been exceeded when using git lfs.

I received a message indicating that the quota has been exceeded when using git lfs.

Use `git lfs logs last` to view the log.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
Failed to fetch some objects from 'https://github.com/benyeoh/grownup.git/info/lfs'

Git Lfs is not suitable for open-source software that many people download.
As a proposal, please upload models to huggingface or provide them through other storage services.
Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.