GithubHelp home page GithubHelp logo

xiaoyuin / tntspa Goto Github PK

View Code? Open in Web Editor NEW
16.0 5.0 12.0 155.74 MB

Official repository of "Neural Machine Translating from Natural Language to SPARQL"

License: MIT License

Shell 3.47% Python 4.88% Jupyter Notebook 66.20% TeX 25.12% Perl 0.33%
sparql natural-language-processing neural-machine-translation

tntspa's Introduction

TNTSPA (Translating Natural language To SPARQL)

SPARQL is a highly powerful query language for an ever-growing number of Linked Data resources and Knowledge Graphs. Using it requires a certain familiarity with the entities in the domain to be queried as well as expertise in the language's syntax and semantics, none of which average human web users can be assumed to possess. To overcome this limitation, automatically translating natural language questions to SPARQL queries has been a vibrant field of research. However, to this date, the vast success of deep learning methods has not yet been fully propagated to this research problem.

This paper contributes to filling this gap by evaluating the utilization of eight different Neural Machine Translation (NMT) models for the task of translating from natural language to the structured query language SPARQL. While highlighting the importance of high-quantity and high-quality datasets, the results show a dominance of a CNN-based architecture with a BLEU score of up to 98 and accuracy of up to 94%.

Research Paper

Title: Neural Machine Translating from Natural Language to SPARQL

Authors: Dr. Dagmar Gromann, Prof. Sebastian Rudolph and Xiaoyu Yin

PDF is available

@article{DBLP:journals/corr/abs-1906-09302,
  author    = {Xiaoyu Yin and
               Dagmar Gromann and
               Sebastian Rudolph},
  title     = {Neural Machine Translating from Natural Language to {SPARQL}},
  journal   = {CoRR},
  volume    = {abs/1906.09302},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.09302},
  archivePrefix = {arXiv},
  eprint    = {1906.09302},
  timestamp = {Thu, 27 Jun 2019 18:54:51 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-09302.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Master Thesis

Title: Translating Natural language To SPARQL

Author: Xiaoyu Yin

Supervisor: Dr. Dagmar Gromann, Dr. Dmitrij Schlesinger

The thesis is already finished. (8th January 2019) and has been turned into a paper (link above).

Find the thesis in thesis folder, and defense slides in presentation folder, both available in .tex and .pdf version.

Datasets

Downloads (Google drive)

Usages

The files ended with *.en (e.g. dev.en, train.en, test.en) contain English sentences, *.sparql files contain SPARQL queries. The ones with the same prefix name have 1-1 mapping that was used in the training as a English-SPARQL pair. vocab.* or dict. are vocabulary files. fairseq has its own special requirement of input files, therefore aforementioned files were not used directly by it but processed into binary formats stored in /fairseq-data-bin folder of each dataset.

Sources

The datasets used in this paper were originally downloaded from Internet. I downloaded them and have split them into the way I needed to train the models. The sources are listed as follows:

Experimental Setup

Dataset splits and hyperparameters

see in paper

Hardware configuration

hardware

Results

Raw data

We kept the inference translations of each model and dataset which was used to generate BLEU scores, accuracy, and corresponding graphs in below sections. The results are saved in the format of dev_output.txt (validation set) & test_output.txt (test set) version and available here (compat version).

Full version containing raw output of frameworks is also available

Training

Plots of training perplexity for each models and datasets are available in a separate PDF here.

Test results

Table of BLEU scores for all models and validation and test sets Bleu scores

Table of Accuracy (in %) of syntactically correct generated SPARQL queries | F1 score accuracy

Please find more results and detailed explanations in the research paper and the thesis.

Trained Models

Because some models were so space-consuming (esp. GNMT4, GNMT8) after training for some sepecific datasets (esp. DBNQA), I didn't download all the models from the HPC server. This is an overview of the availablity of the trained models on my drive:

. Monument Monument80 Monument50 LC-QUAD DBNQA
NSpM yes yes yes yes yes
NSpM+Att1 yes yes yes yes yes
NSpM+Att2 yes yes yes yes yes
GNMT4 no yes no no no
GNMT8 no no no no no
LSTM_Luong yes yes yes yes no
ConvS2S yes yes yes yes no
Transformer yes yes yes yes no

One More Thing

This paper and thesis couldn't have been completed without the help of my supervisors (Dr. Dagmar Gromann, Dr. Dmitrij Schlesinger and Prof. Sebastian Rudolph) and those great open source projects. I send my sincere appreciation to all of the people who have been working on this subject, and hopefully we will show the world its value in the near future.

By the way, I work as an Android developer now, although I still have passion with AI and may want to learn more and probably even find a career in it in the future, currently my focus is on Software Engineering. I enjoy any kind of experience or knowledge sharing and would like to have new friends! Connect with me on LinkedIn.

tntspa's People

Contributors

xiaoyuin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tntspa's Issues

Installation

Do you have installation instructions? I would like to try your small model.

Perplexity and Accuracy scores

Hi,
How to find the perplexity and accuracy scores for the datasets? Is there a readme I can follow to generate those scores from your codebase?

Where can I find the excat model architecture?

Hello,
I want to use your model in a small university project of mine (of course with proper citation).
If I got everything correct, the model was trained with the fairseq command line tool.
I would like to use it in a python script which I can call upon in my code.
Unfortunately, I haven't been able to get your model checkpoint to work with the ConS2S model from pytorch hub, because some keys are missing. If I'm not wrong there is some kind of architecture mentioned in your paper, but I can't figure out the exact details on how to remodel it so I could use your checkpoint (might be because I'm rather new to all this, so apologies for that). So I would like to ask if you could tell me how your model structure looks like e.g. in pytorch (or whatever you would prefer)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.