lr-por / cl-conllu Goto Github PK

View Code? Open in Web Editor NEW

12.0 7.0 5.0 566 KB

tool for working with conllu files in CL

License: Apache License 2.0

Common Lisp 99.81% Dockerfile 0.19%

nlp lisp library conll-u

cl-conllu's Introduction

Library for working with CoNLL-U files with CL

The cl-conllu is a Common Lisp library to work with CoNLL-U, licensed under the Apache license.

It is developed and tested with SBCL but should probably run with any other implementation.

Install

The cl-conllu library is now available from quicklisp distribution, if you are not planning to change the code, just use:

(ql:quickload :cl-conllu)

If you don’t have quicklisp installed already, follow these steps.

If you plan on contributing, clone this project to your local-projects quicklisp directory (usually at ~/quicklisp/local-projects/) and use the same command as above to load the code.

Documentation

See the https://github.com/own-pt/cl-conllu/wiki

How to cite

http://arademaker.github.io/bibliography/tilic-stil-2017.html

@inproceedings{tilic-stil-2017,
  author = {Muniz, Henrique and Chalub, Fabricio and Rademaker, Alexandre},
  title = {CL-CONLLU: dependências universais em Common Lisp},
  booktitle = {V Workshop de Iniciação Científica em Tecnologia da
                    Informação e da Linguagem Humana (TILic)},
  year = {2017},
  address = {Uberlândia, MG, Brazil},
  note = {https://sites.google.com/view/tilic2017/}
}

cl-conllu's People

Contributors

Stargazers

Watchers

Forkers

odanoburu joaomamorim gppassos cristiananc wllsena

cl-conllu's Issues

emacs integration

better integration with https://github.com/odanoburu/conllu-mode by @odanoburu

uma ideia que talvez melhore o seu work flow é ter o relatório num formato que o https://www.gnu.org/software/emacs/manual/html_node/emacs/Compilation-Mode.html entenda, aí vc pode clicar nos erros e já abrir a sentença. (vc teria de colocar o nome do arquivo e linha/coluna, acho)
draw need to call cl-conllu passing the sentence in the STDIN.

handling metadata

In http://universaldependencies.org/format.html, few information is provided about the comments and metadata:

Lines starting with the # character and preceding a sentence are considered as carrying comments or metadata relevant to the following sentence.

We need to handle both cases:

# value
# key = value

draw doesn't finish

# text = O compromisso com a precisão (e, por extensão, com o leitor) vale menos do que a torcida da imprensa nessas horas.
# sent_id = CF670-8
# source = CETENFolha n=670 cad=Brasil sec=pol sem=94b &D
# id = 2810
1	O	o	DET	<artd>|ART|M|S|@>N	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	2	det	_	_
2	compromisso	compromisso	NOUN	<np-def>|N|M|S|@SUBJ>	Gender=Masc|Number=Sing	16	nsubj	_	_
3	com	com	ADP	<first-cjt>|PRP|@N<	_	5	case	_	_
4	a	o	DET	<artd>|ART|F|S|@>N	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	5	det	_	_
5	precisão	precisão	NOUN	<first-cjt>|<np-def>|N|F|S|@P<	Gender=Fem|Number=Sing	2	nmod	_	_
6	(	(	PUNCT	PU|@PU	_	10	punct	_	ChangedBy=Issue165|SpaceAfter=No
7	e	e	CCONJ	<co-postnom>|KC|@CO	_	14	cc	_	ChangedBy=Issue165|SpaceAfter=No|d2d:#106
8	,	,	PUNCT	PU|@PU	_	9	punct	_	d2d:#106
9	por	por	ADP	PRP|@<ADVL	_	10	case	_	_
10	extensão	extensão	NOUN	<np-idf>|N|F|S|@P<	Gender=Fem|Number=Sing	5	nmod	_	ChangedBy=Issue165|SpaceAfter=No
11	,	,	PUNCT	PU|@PU	_	9	punct	_	_
12	com	com	ADP	<cjt>|PRP|@N<	_	14	case	_	_
13	o	o	DET	<artd>|ART|M|S|@>N	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	14	det	_	_
14	leitor	leitor	NOUN	<np-def>|N|M|S|@P<	Gender=Masc|Number=Sing	5	conj	_	ChangedBy=Issue165|SpaceAfter=No
15	)	)	PUNCT	PU|@PU	_	10	punct	_	_
16	vale	valer	VERB	<mv>|V|PR|3S|IND|@FS-STA	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
17	menos	pouco	ADV	<quant>|<KOMP>|<COMP>|ADV|@<ADVL	_	16	advmod	_	_
18-19	do	_	_	_	_	_	_	_	_
18	de	de	ADP	<sam->|PRP|@COM	_	22	case	_	MWE=do_que
19	o	o	PRON	<dem>|<-sam>|DET|M|S|@P<	Gender=Masc|Number=Sing|PronType=Dem	18	fixed	_	_
20	que	que	PRON	<rel>|INDP|M|S|@N<	Gender=Masc|Number=Sing|PronType=Rel	18	fixed	_	_
21	a	o	DET	<artd>|ART|F|S|@>N	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	22	det	_	_
22	torcida	torcida	NOUN	<first-cjt>|<np-def>|N|F|S|@KOMP<	Gender=Fem|Number=Sing	17	obl	_	_
23-24	da	_	_	_	_	_	_	_	_
23	de	de	ADP	<sam->|PRP|@N<	_	25	case	_	_
24	a	o	DET	<-sam>|<artd>|ART|F|S|@>N	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	25	det	_	_
25	imprensa	imprensa	NOUN	<np-def>|N|F|S|@P<	Gender=Fem|Number=Sing	22	nmod	_	_
26-27	nessas	_	_	_	_	_	_	_	_
26	em	em	ADP	<sam->|PRP|@<ADVL	_	28	case	_	_
27	essas	esse	DET	<-sam>|<dem>|DET|F|P|@>N	Gender=Fem|Number=Plur|PronType=Dem	28	det	_	_
28	horas	hora	NOUN	<np-def>|N|F|P|@P<	Gender=Fem|Number=Plur	16	obl	_	ChangedBy=Issue137|ChangedBy=Issue165|SpaceAfter=No
29	.	.	PUNCT	PU|@PU	_	16	punct	_	_

Implement a reader of CoNLL-U data in RDF format

We can now export CoNLL-U data a RDF. It'd be interesting to be able to read it in this format as well. We expected that wilbur would offer RDF parsing tools.

Should not lowercase lemmas

Problem line:

https://github.com/own-pt/cl-conllu/blob/master/conllu-prolog.lisp#L84

Test rules

In 4dca2b I solved the merge conflict and tested the corte-e-costura function.

The test was done using this conllu file and these rules:

(=> ((?a (= upostag "ADV") (= lemma "além"))
     *
     (?b (= lemma "ter")))
    ((?b (+ lemma "============"))))

(=> (?a (= lemma "liberdade") (= upostag "NOUN"))
    (?a (! upostag "===========")))

(=> ((?a (= lemma "trabalho"))
     ?
     (?b (= lemma "passar"))
     ?
     (?c (= lemma "em")))
    ((?a (! lemma "============"))))

Documentation

http://www.didierverna.net/blog/index.php?post/2017/12/13/Announcing-Quickref%3A-a-global-documentation-project-for-Common-Lisp

Functions for evaluating parsers

It's useful to have functions for evaluating parsers (UAS, LAS, precision, ecall, confusion matrix, etc), testing results of sentences against a golden dataset. I'll use them myself this week.

I'm starting the eval-functions branch in order to develop this and already did the first commit: 439d192.

Wrong metadata format

According to http://universaldependencies.org/format.html metadata needs to be separate between a = between key and value.

We need to:

support the new and old formats
write only the new

This way we can easily convert existing CONLL-U files.

Sentence with only "#" breaks READ-CONLLU

The following sentence breaks READ-CONLLU:

# text = #
# sent_id = 2016-06-22-F-zone-Geology-for-Rock-Mechanics-draft.pptx.potx34
1	#	#	NOUN	NN	Number=Sing	0	root	_	TokenRange=1479:1480

with the following backtrace:

  0: (SB-KERNEL:VECTOR-SUBSEQ* #<unavailable argument> #<unavailable argument> #<unavailable argument>)
  1: (CL-CONLLU::COLLECT-META ("# text = #" "# sent_id = 2016-06-22-F-zone-Geology-for-Rock-Mechanics-draft.pptx.potx34"))
  2: (MAKE-SENTENCE 302 ("# text = #" "# sent_id = 2016-06-22-F-zone-Geology-for-Rock-Mechanics-draft.pptx.potx34" "1	#	#	NOUN	NN	Number=Sing	0	root	_	TokenRange=1479:1480") #<FUNCTION CL-CONLLU::COLLECT-M..

read PALAVRAS output

In the https://github.com/own-pt/cl-conllu/blob/master/read-write.lisp we need a function to read the PALAVRAS output like https://github.com/cpdoc/dhbb/blob/master/pal/100.dep to CoNLL-U format.

more drawing outputs

A udapi-python ainda tem um modulo para gerar arvores em LaTeX, tikz.py . Saidas https://pt.wikipedia.org/wiki/SVG ou https://www.graphviz.org também seriam interessantes.

we allow invalid conllu to be written

if we set a token field to the empty string, it will be serialized as an empty string, which is an invalid format. It should be serialized as _.

Projective sentences

To add a function that verifies if a dependency parse of a sentence is projective or not.

I've done already two commits about it: 79bcc35 and 7c6f700, but testing (and possibly improving the code) is still necessary.

bug at sentence->deep

Fix bug at

https://github.com/own-pt/cl-conllu/blob/master/data.lisp#L63-L75

(defun sentence->deep (sentence &key fn-key)
  (labels ((ensure-list (key)
	     (if (symbolp key) (list fn-key) key)))
    (if (functionp fn-key)
	(deep-aux (sentence-root sentence) sentence fn-key)
	(if (or (symbolp fn-key)
		(listp fn-key))
	    (deep-aux (sentence-root sentence) sentence
		      (lambda (tk)
			(let ((out (loop for k in (ensure-list fn-key)
					 collect (slot-value tk k))))
			  (if (and (listp out) (= 1 (length out)))
(car out) out))))))))

(insert bug description or example of error?)

After including Wilbur in cl-conllu.asd, it conflicted with rules.lisp, probably due to changes in the readtable. Replacing some symbols' names fixed the problem, but a better solution should be found.

Is it possible to isolate these changes to the readtable? (or perhaps even avoid them altogether, as we currently don't use them in rdf-wilbur.lisp)

Functions for "exporting semantic roles" as a RDF

We'll need functions for manipulating CoNLL-U files in order to extract semantic tags to some kind of RDF structure. These tags are returned, for instance, from Palavras or as a result of rule applications. These semantic tags will be at the MISC field.

Does not build today - argument mismatch

I get this:

; in: DEFUN DEEP-AUX
;     (CL-CONLLU::DEEP-AUX CL-CONLLU::CHILD CL-CONLLU:SENTENCE CL-CONLLU::FN-KEY)
; 
; caught WARNING:
;   The function was called with three arguments, but wants exactly two.

Edit distance: use library

I don't see a need to roll our own edit distance function. There are plenty of libraries with a lot of different variants already available:

add parameters in the visualization

In #15 we added the vertical tree visualisation. It 'd be good to allow the user to specify what info from tokens he wants in the tree.

bug on drawing

CL-CONLLU> (conllu.draw:tree-sentence (sentence-by-id "CF155-1" "documents/CF0155.conllu"))
─┮ 
 │   ╭─╼ Esses det 
 │ ╭─┾ núcleos nsubj:pass 
 │ │ ├─╼ coloniais amod 
 │ │ │ ╭─╼ , punct 
 │ │ │ │ ╭─╼ entre case 
 │ │ ├ │─┶ eles nmod 
 │ │ ╰─┾ o appos 
 │ │   │ ╭─╼ de case 
 │ │   │ ├─╼ o det 
 │ │   ├─┾ Vale nmod 
 │ │   │ │ ╭─╼ de case 
 │ │   │ │ ├─╼ o det 
 │ │   │ ╰─┶ Ribeira nmod 
 │ │   ╰─╼ , punct 
 │ ├─╼ foram aux:pass 
 ╰─┾ ocupados root 
   │ ╭─╼ a case 
   ├─┾ partir obl 
   │ │ ╭─╼ de case 
   │ ╰─┾ terras nmod 
   │   ╰─┮ vendidas acl 
   │     │ ╭─╼ por case 
   │     │ ├─╼ o det 
   │     ╰─┶ governo obl:agent 
   ╰─╼ . punct

data

# text = Esses núcleos coloniais, entre eles o do Vale do Ribeira, foram ocupados a partir de terras vendidas pelo governo.
# source = CETENFolha n=155 cad=Informática sec=com sem=94b
# sent_id = CF155-1
# id = 656
1	Esses	esse	DET	<dem>|DET|M|P|@>N	Gender=Masc|Number=Plur|PronType=Dem	2	det	_	_
2	núcleos	núcleo	NOUN	<np-def>|N|M|P|@SUBJ>	Gender=Masc|Number=Plur	16	nsubj:pass	_	_
3	coloniais	colonial	ADJ	ADJ|M|P|@N<	Gender=Masc|Number=Plur	2	amod	_	ChangedBy=Issue165|SpaceAfter=No
4	,	,	PUNCT	PU|@PU	_	7	punct	_	_
5	entre	entre	ADP	PRP|@<ADVL	_	6	case	_	_
6	eles	eles	PRON	PERS|M|3P|NOM/PIV|@P<	Gender=Masc|Number=Plur|Person=3|PronType=Prs	2	nmod	_	_
7	o	o	DET	<dem>|DET|M|S|@N<PRED	Gender=Masc|Number=Sing|PronType=Dem	2	appos	_	_
8-9	do	_	_	_	_	_	_	_	_
8	de	de	ADP	<sam->|PRP|@N<	_	10	case	_	_
9	o	o	DET	<-sam>|<artd>|ART|M|S|@>N	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	10	det	_	_
10	Vale	Vale	PROPN	PROP|M|S|@P<	Gender=Masc|Number=Sing	7	nmod	_	MWE=Vale_do_Ribeira
11-12	do	_	_	_	_	_	_	_	_
11	de	de	ADP	<sam->|PRP|@N<	_	13	case	_	_
12	o	o	DET	<artd>|<-sam>|DET|M|S|@>N	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	13	det	_	_
13	Ribeira	Ribeira	PROPN	PROP|@P<	Number=Sing	10	nmod	_	ChangedBy=Issue165|SpaceAfter=No
14	,	,	PUNCT	PU|@PU	_	7	punct	_	_
15	foram	ser	AUX	<aux>|V|PS/MQP|3P|IND|@FS-STA	Mood=Ind|Number=Plur|Person=3|VerbForm=Fin	16	aux:pass	_	_
16	ocupados	ocupar	VERB	<pass>|<mv>|V|PCP|M|P|@ICL-AUX<	Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass	0	root	_	_
17	a	a	ADP	PRP|@<ADVL	_	18	case	_	MWE=a_partir_de
18	partir	partir	NOUN	N|@P<	_	16	obl	_	_
19	de	de	ADP	PRP|@N<	_	20	case	_	_
20	terras	terra	NOUN	<np-idf>|N|F|P|@P<	Gender=Fem|Number=Plur	18	nmod	_	_
21	vendidas	vender	VERB	<mv>|V|PCP|F|P|@ICL-N<	Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass	20	acl	_	_
22-23	pelo	_	_	_	_	_	_	_	_
22	por	por	ADP	PRP|@PASS	_	24	case	_	_
23	o	o	DET	<-sam>|<artd>|ART|M|S|@>N	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	24	det	_	_
24	governo	governo	NOUN	<np-def>|N|M|S|@P<	Gender=Masc|Number=Sing	21	obl:agent	_	ChangedBy=Issue165|SpaceAfter=No
25	.	.	PUNCT	PU|@PU	_	16	punct	_	_

draw breaks with

not using UD rels and postags but is should be an conllu valid file:

# text = In contrast to most other salt basins, internal layering makes visible the high degree of internal deformation found in the Santos Basin.
# sent_id = 0
2	In_contrast_to	in_contrast_to	prep	_	prep	10	vprep	_	TokenRange=0:7|compslots=psubj:10,objprep:7
4	most	most	qual	_	qual|superl	5	adjpre	_	TokenRange=3:4
5	other	other	adj	_	adj|superl|novadj|badcoordadj	7	nadj	_	TokenRange=3:5|compslots=asubj:7
6	salt	salt	noun	_	noun|cn|sg|massn|sbst|ent	7	nnoun	_	TokenRange=5:6|compslots=u
7	basins	basin	noun	_	noun|cn|pl|physobj|artf|inst|container|ent	2	objprep	_	TokenRange=3:7|compslots=u
8	internal	internal	adj	_	adj	9	nadj	_	TokenRange=7:8|compslots=asubj:9,u
9	layering	layering	noun	_	noun|cn|sg	10	subj	_	TokenRange=7:9|compslots=u,u
10	makes	make	verb	_	verb|vfin|vpres|sg|vsubj|badnen|vchng	0	top	_	TokenRange=0:22|compslots=subj:9,obj:14,comp:11
11	visible	visible	adj	_	adj	10	comp	_	TokenRange=10:11|compslots=asubj:14,u,u
12	the	the	det	_	det|sg|def|the|ingdet	14	ndet	_	TokenRange=11:12
13	high	high	adj	_	adj|erest|nqual|lmeasadj	14	nadj	_	TokenRange=12:13|compslots=asubj:14,u
14	degree	degree	noun	_	noun|cn|sg|locnoun|meas|lmeas|abst|property|massn|state	10	obj	_	TokenRange=11:22|compslots=u,nobj:15,u
15	of	of	prep	_	prep|pprefn|nonlocp	14	nobj	_	TokenRange=14:17|compslots=psubj:14,objprep:17
16	internal	internal	adj	_	adj	17	nadj	_	TokenRange=15:16|compslots=asubj:17,u
17	deformation	deformation	noun	_	noun|cn|sg|evnt|chng|hapning	15	objprep	_	TokenRange=15:17|compslots=u,u,u
18	found	find	verb	_	verb|ven|vpass|passen|sta|noptlo	14	nnfvp	_	TokenRange=17:22|compslots=u,obj:14,comp:19
19	in	in	prep	_	prep|staticp	18	comp	_	TokenRange=18:22|compslots=psubj:18,objprep:22
20	the	the	det	_	det|sg|pl|def|the|ingdet	22	ndet	_	TokenRange=19:20
22	Santos_Basin	Santos_Basin	noun	_	noun|propn|sg|pl|sgpl|glom|physobj|artf|inst|notfnd|container|ent	19	objprep	_	TokenRange=19:22

reader

instead of lazy-stream-reader calling read-stream, it should be the other way around. The primitive function should consume a sentence from an open stream. all other functions should be on top of that.

query language

Given that we have already the classes sentence and tokens, can we make a query interface?

Last link, query example:

#x:[pos="VERB" & word & lemma="ficar"] . #y:[pos="VERB" & word & lemma]

another alternative is to have a backend (ES, Solr, triple store) and a mapping from a expression lang to the backend query language.

http://wesearch.delph-in.net/deepbank/search.jsp

another option, convert to prolog and use any of CL implementations of PROLOG:

http://www.cliki.net/prolog

A query would be something like de expression below that could filter sentences from a list of sentences:

(lemma tk1 "fazer") and (pos tk1 VERB) and (pos tk2 VERB) and (aux tk1 tk2)

write-conllu does not work for this file

To reproduce:

(write-conllu (read-conllu #p"test") #p"test.out")

with the attached file. test.out will be empty.

conjunct.txt

optimize queries

The current implementation of queries do not compile them, the queries are evaluated for EVERY sentence. A compilation into a lambda expression would improve performance.

multiword (contractions)

http://universaldependencies.org/format.html

The read and write functions still do not deal with the multiword lines.

words and tokens

We are using the term token when we should be using word. This would require rename the token class and some names of functions and variables

http://universaldependencies.org/u/overview/tokenization.html

prolog representation

we need also to represent features / MISC and the complete info in the CoNLL-U file.

sentence validation needs to explain the errors

and not just return t or nil.

Also: ideally it should provide the data in a structured way so that visual editors can use it to highlight the errors in the original file, not just display the error message in the screen. This is intentionally vague -- we need to define this better, of course.

merge lib conll-prolog

https://github.com/own-pt/conll-prolog

conllu-prolog atoms for words aren't unique by sentence

As it stands now, atoms for sentences are ctestset_scf790_2 (context testset, sentence cf790_2), while word atoms are like ctestset_i25 (context testset, word 25). For instance, we have:

?- nlp_sentence(S), nlp_dependency(S,Y,Z,W).
S = ctestset_scf790_2,
Y = ctestset_i25,
Z = ctestset_i9,
W = punct

However, for analysing multiple sentences this isn't so great, as the 1st word from different sentences of same context are both called cCONTEXT_i1.

Perhaps we should add a sentence identifier on each word atom as well.

Função projective

No arquivo projective.lisp temos uma "tradução" para lisp da função validate_projective_punctuation feita em python que está no repositório tools do UD.

https://github.com/UniversalDependencies/tools/blob/master/validate.py

A ideia é usar a função para verificar se a não projetividade de uma sentença é causada por uma pontuação. O problema é que mesmo para as sentenças que foram dadas válidas pelo código em python.

Os reports podem ser encontrados aqui:

https://github.com/UniversalDependencies/UD_Portuguese-Bosque/tree/workbench/reports/validation

a função em lisp retorna que há não projetividade causada pela pontuação.

json-ld representation of sentences

read/write json-ld from the internal sentence structure.

tree visualization

https://github.com/udapi/udapi-python/blob/master/udapi/block/write/textmodetrees.py#L20

We need some similar feature for printing a sentence as a tree. This trees are very useful for visualize the data:

─┮
 ╰─┮ Gosto VERB root
   │ ╭─╼ de ADP mark
   ├─┾ levar VERB xcomp
   │ │ ╭─╼ a ADP case
   │ ├─┶ sério NOUN xcomp
   │ │ ╭─╼ o DET det
   │ │ ├─╼ meu DET det
   │ ╰─┾ papel NOUN obj
   │   │ ╭─╼ de ADP case
   │   ╰─┾ consultor NOUN nmod
   │     ╰─╼ encartado VERB acl
   ╰─╼ . PUNCT punct

alternatives:

convert conllu to tex and compile it

 udapy write.Tikz attributes=form,lemma,upos < my.conllu > my.tex

If needed I can add more features to
https://github.com/udapi/udapi-python/blob/master/udapi/block/write/tikz.py
e.g. printing multiword tokens and some default colors.
Of course, for camera-ready pictures a bit of manual fine-tuning of the layout will be needed.

You can try also

udapy write.TextModeTrees color=1 < my.conllu | less -R

output above.

https://github.com/udapi/udapi-python/blob/master/udapi/block/write/textmodetrees.py
either use in verbatim without colors, or subclass so it generates TeX syntax for colors (and texttt/verb), similarly as in
https://github.com/udapi/udapi-python/blob/master/udapi/block/write/textmodetreeshtml.py
https://github.com/udapi/udapi-python/blob/master/udapi/block/write/html.py
for html+javascript output like
http://ufallab.ms.mff.cuni.cz/~popel/czeng1.6-sample.html

There is a button for SVG export and you can use
inkscape -D -z --file=image.svg --export-pdf=image.pdf --export-latex
to export it to pdf and tex:
\begin{figure}
\centering
\def\svgwidth{\columnwidth}
\input{image.pdf_tex}
\end{figure}

Can I outout to LaTeX the
second command :

udapy write.TextModeTrees color=1 < my.conllu | less -R

Yes, but without the colors:
echo '\begin{verbatim}' > my.tex
udapy write.TextModeTrees < my.conllu >> my.tex
echo '\end{verbatim}' >> my.tex
and then use
\input{my.tex}

It would not be difficult to write a subclass of write.TextModeTrees
which would use some LaTeX markup like \lemma{I}, \upos{PRON}
instead of the ANSI color codes. So then you could define the colors&style
\def\lemma#1{\textcolor{red}{#1}}
If you are interested, I can implement it.

what I really missing is a simple way to display a fragment of a sentence

Now, I've added a Udapi block which allows to delete all nodes in a document
except for the subtrees matching a given condition, e.g.

udapy -s util.Filter subtree='node.upos == "NOUN"' < in.conllu > filtered.conllu

will print only noun phrases.
So you can use

udapy util.Filter subtree='node.form == "dog"' write.TextModeTrees < in.conllu

to get the subtree(s) headed by word "dog", or

udapy util.Filter subtree='node.ord == 2 and node.root.address() == "3"' write.TextModeTrees < in.conllu

to get the subtree headed by the second word in tree with sent_id = 3.

Yet another alternative to Tikz, Html and TextModeTrees would be to
use paste the CoNLL-U to the online Brat rendered
(e.g. click "edit" here http://universaldependencies.org/sandbox.html#pirate-example).
But then you would need to zoom, take a screenshot and include it as bitmap (png) into LaTeX,
which is not optimal.

If needed I can implement write.Sdparse which would print something like

Dogs run
nsubj(run-2, Dogs-1)

which would allow easier manual editing than the CoNLL-U format.

RDF generation is broken

Likely the presence of SUBJ_INDEF on the misc field, but it is expecting that values in that field are always of the form Key=Value, which is not the case.

debugger invoked on a SB-KERNEL::ARG-COUNT-ERROR in thread
#<THREAD "main thread" RUNNING {1001BB6A83}>:
  error while parsing arguments to DESTRUCTURING-BIND:
    too few elements in
      ("SUBJ_INDEF")
    to satisfy lambda list
      (NAME VALUE):
    exactly 2 expected, but got 1

test cases

In 0de6034 I added some files for test. Also in test.lisp we have some initial tests. We need more tests.

We need a good library for tests.

empty tokens

has support for empty tokens been added yet? I'm mirroring this issue from hs-conllu so we can fix this in both libraries. a CoNLL-U file that can be used for testing is this one.

(I'm asking and not testing myself because I'm getting this error when loading the library:

COMPILE-FILE-ERROR while
compiling #<CL-SOURCE-FILE "iterate" "iterate">
   [Condition of type UIOP/LISP-BUILD:COMPILE-FILE-ERROR]

)

local webinterface?

The idea is two be able to edit the sentences via a local web
interface. Working in progress.

Lisp Backend:

Javascript and CSS:

Table editor ideas:

https://editor.datatables.net/examples/inline-editing/simple
https://codepen.io/ashblue/pen/mCtuA
http://markcell.github.io/jquery-tabledit/#home
http://www.jtable.org

Create insert-token and remove-token

Functions for inserting and removing tokens from sentences are needed. It's necessary to not only to insert [remove] corresponding rows, but also re-enumerate ids (and references, at the head field) in other rows.

rules for batch changes

https://corpling.uis.georgetown.edu/depedit/

good ideas

make read-conllu more robust

We need better error handling and messages.

unify save API

currently we save in conllu, rdf, and prolog, through different and inconsistent methods. we should unify them under a single and uniform API.

Wilbur is not a ql package

cl-conllu fails to install on a pristine system since wilbur is not a quicklisp package.

* (ql:quickload :wilbur)

debugger invoked on a QUICKLISP-CLIENT:SYSTEM-NOT-FOUND in thread
#<THREAD "main thread" RUNNING {1001950083}>:
  System "wilbur" not found

move some functions from the scripts in Bosque-UD to this library

A number of useful methods written for the scripts to fix the issues in Bosque-UD can be imported to this library. Things like isolated?, add-feature, remove-feature, search-features, etc. They should be adapted for the general case and added here.

Optionally (and maybe this is a larger issue) we also started a very simple query language, implemented by the method t? in the scripts that can be finalized and imported here as well. This one, however, needs a better implementation as it is extremely simple, just enough functionality to handle the requirements of those scripts.

Recommend looking at the latest scripts (by date) to see the more fully featured versions of those methods.

lr-por / cl-conllu Goto Github PK

cl-conllu's Introduction

Library for working with CoNLL-U files with CL

Install

Documentation

How to cite

cl-conllu's People

Contributors

Stargazers

Watchers

Forkers

cl-conllu's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs