GithubHelp home page GithubHelp logo

semagrow / sevod-scraper Goto Github PK

View Code? Open in Web Editor NEW
1.0 6.0 1.0 4.45 MB

Scrapes RDF dumps to generate SEVOD metadata for semagrow

License: Apache License 2.0

Shell 0.76% Java 83.12% Scala 16.13%
semagrow rdf void sevod metadata

sevod-scraper's Introduction

sevod-scraper

A tool to create dataset metadata for Semagrow.

Build

Build with maven using the following command:

mvn clean package

Extract the *.tar.gz file that is contained in the assembly/target directory to run sevod scraper.

cd assembly/target
tar xzvf sevod-scraper-*-dist.tar.gz
cd bin
./sevod-scraper.sh

Usage

usage: sevod_scraper.sh [OPTIONS]...
Option Description
     --rdfdump input is a RDF file in ntriples format
     --geordfdump input is a geospatial RDF file in ntriples format
     --cassandra input is a Cassandra keyspace
     --sparql input is a SPARQL endpoint
-i,--input <arg> input
-o,--output <arg> output metadata file in turtle format
-e,--endpoint <arg> SPARQL endpoint URL (used for annotation only)
-p,--prefixes <arg> List of known URI prefixes (comma-separated)
-g,--graph <arg> Graph (only for SPARQL endpoint)
-t,--extentType <arg> Extent type (mbb-union-qtN, for geospatial RDF)
-P,--polygon <arg> Known bounding polygon (for geospatial RDF files)
-n,--namespace <arg> Namespace for URI mappings (only for cassandra)
-h,--help print a help message

Examples

Here we will present some examples of sevod-scraper usages for several input scenarios.

RDF dump file

Suppose that your dataset is stored in input.nt and its SPARQL endpoint is http://localhost:8080/sparql. To extract its metadata for Semagrow, issue the following command:

./sevod-scraper.sh --rdfdump -i input.nt -e http://localhost:8080/sparql -o output.ttl

The dump file should be in NTRIPLES format. Metadata are exported in TTL format. The SPARQL endpoint will be not consulted, but it is going to be used only for annotation purposes.

Semagrow uses the URI prefixes for each dataset for performing a more refined source selection. Therefore, if you know that all subject and object URIs of have one (or more) common prefixes, you can annotate this fact using the following command (the prefixes are comma-separated):

./sevod-scraper.sh --rdfdump -i input.nt -p http://semagrow.eu/p/,http://semagrow.eu/q/,http://www.example.org/ -e http://localhost:8080/sparql -o output.ttl

Otherwise, the output will contain metadata about subject and object authorities.

Geospatial RDF file

If your RDF dump file is a geospatial dataset (i.e., it contains the geo:asWKT predicate), you can issue the following command:

./sevod-scraper.sh --geordfdump -i input.nt -t EXTENT_TYPE -e http://localhost:8080/sparql -o output.ttl

The functionality is the same as for rdfdump option, except that the output file will contain an annotation of the bounding polygon that contains all WKT literals of the dataset.

Extent types can be one of the following:

  • mbb, which exports the Minimum Bounding Box of all WKT literals
  • union, which expors the spatial union of all WKT literals
  • qtN, where N is an integer, which calculates an approximation of the union of all WKT literals using a quadtree of height N.

If you want to provide a manual spatial extent annotation, you can issue the following command:

./sevod-scraper.sh --geordfdump -i input.nt -P POLYGON_IN_WKT -e http://localhost:8080/sparql -o output.ttl

The dump file should be in NTRIPLES format. Metadata are exported in TTL format. The SPARQL endpoint will be not consulted, but it is going to be used only for annotation purposes.

Cassandra keyspace

In order to create Semagrow metadata from a Cassandra keyspace, issue the following command:

./sevod-scraper.sh --cassandra -i IP_ADDRESS:PORT/KEYSPACE -n NAMESPACE_URI -o output.ttl

The Cassandra keyspace is defined using the following parameters: IP_ADDRESS is the cassandra server ip address, PORT is the cassandra server port, KEYSPACE is the cassandra relevant keyspace. Moreover, NAMESPACE_URI us a base string to generate URI predicates. At this moment, it shoud contain the "cassandra" substring.

Metadata are exported in TTL format.

SPARQL endpoint

In order to create Semagrow metadata from a SPARQL endpoint, issue the following command:

./sevod-scraper.sh --sparql -i http://localhost:8890/sparql -o output.ttl

Since many SPARQL endpoints (such as Virtuoso) contain additional system graphs, you can specify the graph in which the dataset is contained. Example:

./sevod-scraper.sh --sparql -i http://localhost:8890/sparql -g http://localhost:8890/DAV  -o output.ttl

Metadata are exported in TTL format.

Spark mode

In order to extract sevod metadata from a dump file using spark, issue the following command:

mvn clean package -P spark

Use spark-submit script (see https://spark.apache.org/docs/latest/submitting-applications.html) to submit your application in an existing spark cluster.

  • the application jar can be found in assembly/target/sevod-scraper-*-spark-onejar-jar-with-dependencies.jar
  • the main class is org.semagrow.sevod.scraper.Scraper
  • the arguments are the same as in the rdfdump mode.

For example configuration using docker containers cf. rdfdump-spark/src/main/resources

sevod-scraper's People

Contributors

acharal avatar antru6 avatar gmouchakis avatar stasinos avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

mhoangvslev

sevod-scraper's Issues

Unit test failed when compiling

I compile using

mvn clean && mvn install dependency:copy-dependencies package #-Dmaven.test.skip=true

I receive this report:

Results :

Tests in error: 
  testUriPrunner(org.semagrow.sevod.Tests): Unable to create serializer "com.esotericsoftware.kryo.serializers.FieldSerializer" for class: java.nio.HeapByteBuffer

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for SEVOD scraper 3-SNAPSHOT:
[INFO] 
[INFO] SEVOD scraper ...................................... SUCCESS [  0.102 s]
[INFO] sevod-scraper-commons .............................. SUCCESS [  1.579 s]
[INFO] SEVOD scraper for RDF dump ......................... SUCCESS [  0.690 s]
[INFO] SEVOD scraper for RDF dump in Apache Spark ......... FAILURE [ 11.918 s]
[INFO] SEVOD scraper from Apache Cassandra ................ SKIPPED
[INFO] SEVOD scraper from SPARQL endpoint ................. SKIPPED
[INFO] SEVOD scraper for Geospatial RDF dump .............. SKIPPED
[INFO] SEVOD scraper Command-Line Interface ............... SKIPPED
[INFO] SEVOD scraper assembly ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  14.405 s
[INFO] Finished at: 2023-04-26T10:57:55+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project sevod-scraper-rdf-spark: There are test failures.
[ERROR] 
[ERROR] Please refer to /GDD/RSFB/engines/semagrow/sevod-scraper/rdfdump-spark/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :sevod-scraper-rdf-spark

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.