GithubHelp home page GithubHelp logo

zokinko / fitlayout Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fitlayout/fitlayout

0.0 0.0 0.0 37.96 MB

An extensible web page segmentation and analysis framework.

License: GNU Lesser General Public License v3.0

Java 99.93% Shell 0.04% JavaScript 0.03%

fitlayout's Introduction

FitLayout/2 - Web Page Analysis Framework

(c) 2015-2020 Radek Burget ([email protected])

FitLayout/2 is an extensible Java framework for web document rendering, modeling and analysis. It provides the following features:

  • A unified model for rendered web page representation in Java. The model describes the page at the level of individual boxes generated by a rendering engine and is independent on the source page format. It is suitable for further analysis for example by page segmentation algorithms.
  • Page renderers for creating page models from source documents. Currently, two renderers are available:
    • A full-featured Chromium-based renderer - it allows rendering any web page including complex and dynamic web pages with JavaScript. It requires a separate backend based on Node.js that contains the Chromium web browser under the hood.
    • A simple built-in CSSBox-based renderer - it is a pure Java renderer with no additional dependencies that is suitable for a quick rendering of simple web pages. It renders HTML+CSS pages and PDF documents and it may be faster for shorter documents because no external browser needs to be started. However, it does not support dynamic pages with JavaScript and complex CSS layouts.
  • Page segmentation algorithms for performing page segmentation on the rendered pages and the corresponding area tree model that describe the segmentation result. Currently, the following page segmentation methods are available:
    • VIPS - Vision-based page segmentation
    • BCS - Block clustering segmentation
    • Visual area grouping - A basic but configurable bottom-up segmentation method
      The framework provides all necessary tools and data structures for easily implementing more page segmentation algorithms.
  • RDF-based storage for storing the rendered pages, segmentation results and other artifacts in a common storage. The rendered pages and the segmentation results are described using the prepared ontologies and may be stored in a common storage or shared. This allows to analyze the rendered pages repeatedly with no need to re-render them. Moreover, this allows easy annotation of any part of the pages with different metadata and querying the page contents using SPARQL.
  • Infrastructure that puts everything together. It allows automation of the web document analysis process and invocation of the individual steps.

Documentation

Detailed documentation is available from the project Wiki.

Installation

Docker Images

For using FitLayout as it is for web page rendering, segmenation and storage, the easiest way is to use the available docker images.

Compilation from Source

FitLayout may be compiled using maven. It requires a non-standard rdf4j dependency which is provided in the lib folder. The following steps should be sufficient for compiling:

git clone https://github.com/FitLayout/FitLayout.git
cd FitLayout
cd lib
./install.sh
cd ..
mvn -DskipTests clean package install

After this, all the maven artifacts should have been installed and moreover a runnable CLI tool should be available under fitlayout-tools/target/FitLayout.jar.

Note that for using the Chromium-based (puppeteer) renderer, the backend must be installed separately (see fitlayout-puppeteer).

Configuration

For using the RDF storage and the puppeteer renderer, some parametres need to be configured using the Java properties. The CLI tool tries to find these properties in a config.properties file located in the current working directory. Alternatively, they may be configured via the java command line (the -D option). See the example configuration file for a list of the most important properties.

Command Line Interface

The command line interface (CLI) is invoked by running FitLayout.jar (see Compilation from Source above) or by running the corresponding docker container. In both cases, it accepts a list of commands as explained below.

For the local installation (FitLayout.jar) run:

java -jar FitLayout.jar <commands>

Make sure that the FitLayout.jar and optionally the config.properties configuration file are in the current working directory.

For the docker container, get the fitlayout.sh script according to the configuration instructions and then run:

./fitlayout.sh <commands>

Commands

The CLI tool understands a small set of commands such as RENDER for rendering the page, SEGMENT for performing page segmentation or EXPORT for exporting the result.

See the Command-line Interface wiki page for a complete reference on commands and their arguments.

Usage examples

Render a page using the puppeteer backend and export the model to a XML file:

./fitlayout.sh \
    RENDER -b puppeteer http://cssbox.sf.net \
    EXPORT -f xml

Render a page using the puppeteer backend, perform segmenation using VIPS and export to a XML file:

./fitlayout.sh \
    RENDER -b puppeteer http://cssbox.sf.net \
    SEGMENT -m vips -O pDoC=9 \
    EXPORT -f xml

Render a page using the cssbox backend, store a screenshot, perform segmentation using BCS, store a screenshot of the segmented page, export areas in RDF/turtle.

./fitlayout.sh \
    RENDER -b cssbox http://cssbox.sf.net \
    EXPORT -f png -o /tmp/page.png \
    SEGMENT -m bcs \
    EXPORT -f png -o /tmp/segments.png \
    EXPORT -f turtle

See the Command-line Interface wiki page for more examples including the usage of the built-in RDF storage.

Publication

If you find FitLayout useful for your scientific work, please cite the following publication:

MILIČKA Martin and BURGET Radek. Information Extraction from Web Sources based on Multi-aspect Content Analysis. In: Semantic Web Evaluation Challenges, SemWebEval 2015 at ESWC 2015. Communications in Computer and Information Science, vol. 2015. Portorož: Springer International Publishing, 2015, pp. 81-92. ISBN 978-3-319-25517-0. ISSN 1865-0929.

License

All the source code of the FIT Layout Analysis Framework is licensed under the GNU Lesser General Public License (LGPL), version 3. A copy of the LGPL can be found in the LICENSE file.

The framework is under development and its API or functionality may change in future versions. See the CHANGELOG for the most important changes to the previous versions.

fitlayout's People

Contributors

radkovo avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.