GithubHelp home page GithubHelp logo

konstantinoskokos / ud_dutch-alpino Goto Github PK

View Code? Open in Web Editor NEW

This project forked from universaldependencies/ud_dutch-alpino

0.0 3.0 0.0 11.55 MB

Dutch data.

License: Creative Commons Attribution Share Alike 4.0 International

ud_dutch-alpino's Introduction

# Summary

This corpus consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines.

# Introduction

The data consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines:

 * train consists of material from the original Alpino CD-ROM (file id 'cdb' 7000+ sentences from the Eindhoven corpus), questions using in a QA project (file ids with qa and wpspel), material from suites used for grammar maintenance (id: g_suite, h_suite, leuven_yellow_pages), example sentence from the Dutch reference grammar ANS (eans), and the WR-P-P-H section of the Lassy Small corpus
 * dev consists of material from the WR-P-P-H section of the Lassy Small corpus
 * test consists of material from the WR-P-P-H and WR-P-P-L sections of the Lassy Small corpus

The data was thoroughly revised by Gosse Bouma and Gertjan van Noord for UD 2.1 in November 2017.
The new version was created using the same conversion script as was used for Dutch LassySmall.
As sources, we used the (manually corrected) Alpino treebank annotation for this material as it is
available in Groningen. Links to original files have been added. Note that tokenization may differ
from the previous UD version.

The conversion script can be found here: https://github.com/gossebouma/lassy2ud

# Acknowledgements

# Older

Description of the material as it was included in UD 1.0 and 2.0:

The data were used in the CoNLL-X Shared Task in dependency parsing (2006); the CoNLL version
was taken and converted to the Prague dependency style as a part of HamleDT (since 2011).
Later versions of HamleDT added a conversion to the Stanford dependencies (2014) and to
Universal Dependencies (HamleDT 3.0, 2015). The conversion path from the original Alpino still
goes through the CoNLL-X format and the Prague dependencies, which may occasionally lead to
loss of information. The first release of Universal Dependencies that included this treebank
was UD v1.2 in November 2015. It was essentially the HamleDT conversion but the data was not
identical to HamleDT 3.0 because the conversion procedure had been further improved.


# References:

* http://odur.let.rug.nl/~vannoord/trees/
* http://ufal.mff.cuni.cz/hamledt ... HamleDT
* http://ufal.mff.cuni.cz/treex ... Treex is the software used for conversion
* http://ufal.mff.cuni.cz/interset ... Interset was used to convert POS tags and features

* Gosse Bouma and Gertjan van Noord Increasing Return on annotation investment: the automatic construction of a Universal Dependency treebank for Dutch in: Proceedings of the Universal Dependencies Workshop, Gothenburg, 22 May 2017
http://aclweb.org/anthology/W17-0403
* van Noord G. et al. (2013) Large Scale Syntactic Annotation of Written Dutch: Lassy. In: Spyns P., Odijk J. (eds) Essential Speech and Language Technology for Dutch. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg https://doi.org/10.1007/978-3-642-30910-6_9
* L. van der Beek, G. Bouma, R. Malouf, and G. van Noord. The alpino dependency treebank. In Computational Linguistics in the Netherlands (CLIN) 2001, Twente University, 2002. http://www.let.rug.nl/~gosse/papers/clin01c.pdf



# Changelog

* 2018-04-15 v2.2
  * Repository renamed from UD_Dutch to UD_Dutch-Alpino.
* 2017-11-15 v2.1
  * First version of thorougly revised data.
    It was created using the same conversion script as is being used for Dutch
    LassySmall. As sources, we used the (manually corrected) Alpino treebank
    annotation for this material as it is available in Groningen. Links to
    original files have been added. Issues:
    * tokenization may differ from the previous version.
    * some sentences are missing in the Alpino treebanks. In those cases UD 2.0
      annotation has been preserved.
* 2017-03-01 v2.0
  * Converted to UD v2 guidelines.
  * Reconsidered PRON vs. DET.
  * Changed advmod vs. obl distinction. This is a result of a general rule for
    the Prague deprel 'Adv'. However, we should extend it by a Dutch-specific
    rule for adjectives, which often act like adverbs and definitely not like
    obliques.
* 2016-05-15 v1.3
  * Multi-word expressions that were collapsed into one node (with underscores)
    are split again. This needs to be revisited and POS tags and MWE-internal
    relations improved.
  * Fixed adverbs that were attached as nmod; correct: advmod.
  * Copulas with clausal complements are now heads.
  * Relative pronouns are no longer treated as subordinating conjunctions.
    More work is needed: now all are attached as 'dobj', which may not be correct;
    noun phrases with relative determiners (welke boeken) and relative prepositional
    phrases (bij wat) are not handled properly.
  * Reversed relation finite auxiliary "heb" – participle.
  * Infinitives under modal verbs are now 'xcomp', not 'aux'.



<pre>
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v1.2
License: CC BY-SA 4.0
Includes text: yes
Genre: news
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Zeman, Daniel; Žabokrtský, Zdeněk; Bouma, Gosse; van Noord, Gertjan
Contributing: elsewhere
Contact: [email protected]
Paragraphs to web: 5
===============================================================================
</pre>

ud_dutch-alpino's People

Contributors

dan-zeman avatar gossebouma avatar fginter avatar

Watchers

James Cloos avatar Konstantinos Kogkalidis avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.