TNTSPA (Translating Natural language To SPARQL)

SPARQL is a highly powerful query language for an ever-growing number of Linked Data resources and Knowledge Graphs. Using it requires a certain familiarity with the entities in the domain to be queried as well as expertise in the language's syntax and semantics, none of which average human web users can be assumed to possess. To overcome this limitation, automatically translating natural language questions to SPARQL queries has been a vibrant field of research. However, to this date, the vast success of deep learning methods has not yet been fully propagated to this research problem.

This paper contributes to filling this gap by evaluating the utilization of eight different Neural Machine Translation (NMT) models for the task of translating from natural language to the structured query language SPARQL. While highlighting the importance of high-quantity and high-quality datasets, the results show a dominance of a CNN-based architecture with a BLEU score of up to 98 and accuracy of up to 94%.

Research Paper

Title: Neural Machine Translating from Natural Language to SPARQL

Authors: Dr. Dagmar Gromann, Prof. Sebastian Rudolph and Xiaoyu Yin

PDF is available

@article{DBLP:journals/corr/abs-1906-09302,
  author    = {Xiaoyu Yin and
               Dagmar Gromann and
               Sebastian Rudolph},
  title     = {Neural Machine Translating from Natural Language to {SPARQL}},
  journal   = {CoRR},
  volume    = {abs/1906.09302},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.09302},
  archivePrefix = {arXiv},
  eprint    = {1906.09302},
  timestamp = {Thu, 27 Jun 2019 18:54:51 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-09302.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Master Thesis

Title: Translating Natural language To SPARQL

Author: Xiaoyu Yin

Supervisor: Dr. Dagmar Gromann, Dr. Dmitrij Schlesinger

The thesis is already finished. (8th January 2019) and has been turned into a paper (link above).

Find the thesis in thesis folder, and defense slides in presentation folder, both available in .tex and .pdf version.

Datasets

Downloads (Google drive)

Usages

The files ended with *.en (e.g. dev.en, train.en, test.en) contain English sentences, *.sparql files contain SPARQL queries. The ones with the same prefix name have 1-1 mapping that was used in the training as a English-SPARQL pair. vocab.* or dict. are vocabulary files. fairseq has its own special requirement of input files, therefore aforementioned files were not used directly by it but processed into binary formats stored in /fairseq-data-bin folder of each dataset.

Sources

The datasets used in this paper were originally downloaded from Internet. I downloaded them and have split them into the way I needed to train the models. The sources are listed as follows:

Neural SPARQL Machines Monument dataset
LC-QUAD (v2.0 is released! but we used 1.0)
DBpedia Neural Question Answering (DBNQA) dataset

Experimental Setup

Dataset splits and hyperparameters

see in paper

Hardware configuration

Results

Raw data

We kept the inference translations of each model and dataset which was used to generate BLEU scores, accuracy, and corresponding graphs in below sections. The results are saved in the format of dev_output.txt (validation set) & test_output.txt (test set) version and available here (compat version).

Full version containing raw output of frameworks is also available

Training

Plots of training perplexity for each models and datasets are available in a separate PDF here.

Test results

Table of BLEU scores for all models and validation and test sets

Table of Accuracy (in %) of syntactically correct generated SPARQL queries | F1 score

Please find more results and detailed explanations in the research paper and the thesis.

Trained Models

Because some models were so space-consuming (esp. GNMT4, GNMT8) after training for some sepecific datasets (esp. DBNQA), I didn't download all the models from the HPC server. This is an overview of the availablity of the trained models on my drive:

.	Monument	Monument80	Monument50	LC-QUAD	DBNQA
NSpM	yes	yes	yes	yes	yes
NSpM+Att1	yes	yes	yes	yes	yes
NSpM+Att2	yes	yes	yes	yes	yes
GNMT4	no	yes	no	no	no
GNMT8	no	no	no	no	no
LSTM_Luong	yes	yes	yes	yes	no
ConvS2S	yes	yes	yes	yes	no
Transformer	yes	yes	yes	yes	no

One More Thing

This paper and thesis couldn't have been completed without the help of my supervisors (Dr. Dagmar Gromann, Dr. Dmitrij Schlesinger and Prof. Sebastian Rudolph) and those great open source projects. I send my sincere appreciation to all of the people who have been working on this subject, and hopefully we will show the world its value in the near future.

By the way, I work as an Android developer now, although I still have passion with AI and may want to learn more and probably even find a career in it in the future, currently my focus is on Software Engineering. I enjoy any kind of experience or knowledge sharing and would like to have new friends! Connect with me on LinkedIn.

xiaoyuin / tntspa Goto Github PK

tntspa's Introduction

TNTSPA (Translating Natural language To SPARQL)

Research Paper

Master Thesis

Datasets

Downloads (Google drive)

Usages

Sources

Experimental Setup

Dataset splits and hyperparameters

Hardware configuration

Results

Raw data

Training

Test results

Trained Models

One More Thing

tntspa's People

Contributors

Stargazers

Watchers

Forkers

tntspa's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs