GithubHelp home page GithubHelp logo

semagrow / sevod-scraper Goto Github PK

View Code? Open in Web Editor NEW
1.0 6.0 1.0 4.45 MB

Scrapes RDF dumps to generate SEVOD metadata for semagrow

License: Apache License 2.0

Shell 0.76% Java 83.12% Scala 16.13%
semagrow rdf void sevod metadata

sevod-scraper's Introduction

Semagrow

GitHub license Build Status

Semagrow is a federated SPARQL query processor that allows combining, cross-indexing and, in general, making the best out of all public data, regardless of their size, update rate, and schema.

Semagrow offers a single SPARQL endpoint that serves data from remote data sources and that hides from client applications heterogeneity in both form (federating non-SPARQL endpoints) and meaning (transparently mapping queries and query results between vocabularies).

The main difference between Semagrow and most existing distributed querying solutions is that Semagrow targets the federation of heterogeneous and independently provided data sources.

In other words, Semagrow aims to offer the most efficient distributed querying solution that can be achieved without controlling the way data is distributed between sources and, in general, without having the responsibility to centrally manage the data sources of the federation.

Getting Started

Building

Building Semagrow from sources requires to have a system with JDK8 and Maven 3.1 or higher.
Optionally, you may need a PostgreSQL as a requirement for the query transformation functionality.

To build Semagrow you should type:

$ mvn clean install

in the top-level project directory. This will result in jar file in the target directory of the respective module and in a war file in the target directory of the webgui module that can be deployed to the Servlet server of your choice.

Bundled with Apache Tomcat

Moreover, Semagrow can be build pre-bundled with the Apache Tomcat servlet server. To achieve that you could issue

$ mvn clean package -P tomcat-bundle

from the top-level directory of the project. This will result in a compressed file in the target directory of the assembly module containing a fully equipped Apache Tomcat with Semagrow pre-installed. However, please note that external dependencies such as the PostgresSQL database needs to be installed and run separately.

Building a Docker image from sources

You can also test your build deployed in a docker image (Docker 18.09 or newer required for building). To do so run at the project root directory:

$ DOCKER_BUILDKIT=1 docker build -t semagrow .

The produced image will be tagged as semagrow:latest and will contain Tomcat with Semagrow deployed.

Configuration

By default, Semagrow look for its configuration files in /etc/default/semagrow and expects to find at least a repository.ttl and a metadata.ttl file in order to establish a federation of endpoints. The repository.ttl describes the configuration of the Semagrow endpoint, while the metadata.ttl describes the endpoints to be federated. The repository.ttl configuration file also defines the location of the metadata.ttl that can be changed to the desired path.

Samples of these configuration files can be found as resources of the http module

Running Semagrow

Running Semagrow from the Apache Tomcat bundle

In order to run the bundle of Apache Tomcat with SemaGrow you should

  1. uncompress the generated zip,
  2. copy the files from the resources folder to /etc/default/semagrow and
  3. run the startup.sh script located in the bin folder.

SemaGrow can be accessed at http://localhost:8080/SemaGrow/.

Running Semagrow using Docker

Semagrow has an official docker repository and official docker images are available in Docker Hub.

To run semagrow using the latest official docker image you should execute

$ docker run -d semagrow/semagrow

Howeover, you can also build your own docker image using the steps described in Section [Building](#### Building a Docker image from sources) The produced image will be tagged as semagrow and will contain Tomcat with Semagrow deployed.

To run the newly produced image you should execute

$ docker run -d semagrow

or if you want to test Semagrow with your configuration files (repository.ttl and metadata.ttl) issue

$ docker run -d -v /path/to/configuration:/etc/default/semagrow semagrow

In either case you can access Semagrow at http://<CONTAINER_IP>:8080/SemaGrow/ where <CONTAINER_IP> is the address assigned to the semagrow container and can be retrieved using docker inspect

Known issues

  • SemaGrow uses UNION instead of VALUES to implement the BindJoin operator. This fails in 4store 1.1.5 and previous versions in the presence of FILTER clauses due to an unsafe optimization by 4store.
  • When deploying in Glassfish 4 by coping the SemaGrow.war file in the autodeploy directory, Semagrow is accessible at http://DOMAIN/SemaGrow/index.jsp instead of http://DOMAIN/SemaGrow/

sevod-scraper's People

Contributors

acharal avatar antru6 avatar gmouchakis avatar stasinos avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

mhoangvslev

sevod-scraper's Issues

Unit test failed when compiling

I compile using

mvn clean && mvn install dependency:copy-dependencies package #-Dmaven.test.skip=true

I receive this report:

Results :

Tests in error: 
  testUriPrunner(org.semagrow.sevod.Tests): Unable to create serializer "com.esotericsoftware.kryo.serializers.FieldSerializer" for class: java.nio.HeapByteBuffer

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for SEVOD scraper 3-SNAPSHOT:
[INFO] 
[INFO] SEVOD scraper ...................................... SUCCESS [  0.102 s]
[INFO] sevod-scraper-commons .............................. SUCCESS [  1.579 s]
[INFO] SEVOD scraper for RDF dump ......................... SUCCESS [  0.690 s]
[INFO] SEVOD scraper for RDF dump in Apache Spark ......... FAILURE [ 11.918 s]
[INFO] SEVOD scraper from Apache Cassandra ................ SKIPPED
[INFO] SEVOD scraper from SPARQL endpoint ................. SKIPPED
[INFO] SEVOD scraper for Geospatial RDF dump .............. SKIPPED
[INFO] SEVOD scraper Command-Line Interface ............... SKIPPED
[INFO] SEVOD scraper assembly ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  14.405 s
[INFO] Finished at: 2023-04-26T10:57:55+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project sevod-scraper-rdf-spark: There are test failures.
[ERROR] 
[ERROR] Please refer to /GDD/RSFB/engines/semagrow/sevod-scraper/rdfdump-spark/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :sevod-scraper-rdf-spark

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.