dracor-org / fredracor Goto Github PK
View Code? Open in Web Editor NEWFrench Drama Corpus
French Drama Corpus
This should be achieved by extending the tc2dracor.xq script and/or adding corrections to the dracor branch of the theatre-classique repo.
The current state:
$ ./validate
Validating files in ./tei...
Total number of documents: 1500
Number of invalid documents: 78
Number of unigue errors: 47
Here is an example taken from: Edme Boursault: "La Comédie sans Titre" (fre000173, show on staging):
Should be:
For each play we currently provide two URLs for the digital source, one is an HTML page and the other one is the actual TEI from Théâtre Classique. See for instance:
fredracor/tei/abeille-argelie.xml
Lines 28 to 31 in a9fa55c
While the issue #22 caused by this has been fixed on the API level, we should still stick to only one source URL. I would suggest to loose the reference to the HTML page and just keep the TEI URL, since these are actually the documents we use for the import to FreDraCor. What do you think @lehkost?
See original post by @cmil in #22 (comment)
FreDraCor networks seem to be plagued with glued-together characters like "Ugande et Alcif", "Corisande et Florestan". In some 17 century plays such characters easily make up 30-50% of the network nodes which basically renders the whole network false. See example network for Amadis by Philippe Quinault attached.
@lucagiovannini7 is going to check some of the plays relevant to his PhD research BUT we need a more systematic solution for the whole corpus. Especially since this looks like an automatable thing (split by ' et ').
Of course, there are also many harder cases such as 'first african', 'second african', and 'both africans' making 3 nodes instead of 2, which also affects network metrics... But even resolving all "A et B" would be a giant leap for FreDraCor
As of now, it is not possible to download the csv metadata table at https://dracor.org/api/corpora/fre/metadata/csv. The JSON table is working, though. Error message:
HTTP ERROR 500 javax.servlet.ServletException: javax.servlet.ServletException: An error occurred while processing request to /exist/restxq/v0/corpora/fre/metadata/csv: err:XPTY0004 checking function parameter 1 in call string-join(untyped-value-check[xs:anyAtomicType, for <697> $c in $api:metadata-columns return <698> if ( count($m($c)) = 0 ) then "" else dutil:csv-escape(untyped-value-check[xs:string, $m($c)]) ], "",""): XPTY0004: The actual cardinality for parameter 1 does not match the cardinality declared in the function's signature: dutil:csv-escape($string as xs:string) xs:string. Expected cardinality: exactly one, got 2. [at line 698, column 74, source: /db/apps/dracor-v0/modules/api.xqm] In function: dutil:csv-escape(xs:string) [698:51:/db/apps/dracor-v0/modules/util.xqm] api:get-corpus-meta-data-csv(item()) [732:3:/db/apps/dracor-v0/modules/api.xqm] api:corpus-meta-data-csv-endpoint(item()) [-1: -1:/db/apps/dracor-v0/modules/api.xqm]
Print dates are lacking the when
attribute even if there is a year which is given as text content of the respective date
element. See for instance:
fredracor/tei/racine-bajazet.xml
Line 38 in af46052
This should be fixed in tc2dracor.xq
.
In order to be able to add missing dates (or correct erroneous ones), we should introduce a corresponding option in "ids.xml".
The same goes for a better readable slug that divides the words in a meaningful way.
Both demonstrated with this example of "Sermon Joyeux de Bien Boire" (DraCor ID: fre000037):
<play id="fre000037" file="ANONYME_SERMONJOYEUX.xml"/>
will be:
<play id="fre000037" file="ANONYME_SERMONJOYEUX.xml" slug="anonyme-sermon-joyeux-de-bien-boire" print="1545" premiere="" written=""/>
If available, this additional data would be written into the DraCor files when transforming the original files (if available, these dates would also override dates from the original files, since sometimes there are discrepancies between first print edition [whose date we collect, i.e., Datum des Erstdrucks] and the edition used by TC).
Also, ranges à la notBefore
and notAfter
are possible, in these cases the two year numbers are spearated by an en dash (–).
The ID assignment workflow requires IDs for new documents to be added to ids.xml. This should be documented in the README.
Taken from:
Line 63 in fd508a0
The transformation at
Lines 777 to 782 in 55e2b63
classCode
elements where genre information found in the originals matches the recognised text classes defined in dracor-org/dracor-api#122.
See also original discussion in dracor-org/dracor-api#120.
The codes with matching genre attributions (incomplete suggestions) would be:
Q40831
for Comedy
Q80930
for Tragedy
Q192881
for Tragicomedy
Q131084
for Libretto
I would also suggest to add a scheme
attribute to the keywords
element and omit the term/@type
in order to make clear where this classification comes from and avoid confusing it with the keywords we recently added to GerDraCor and RusDraCor.
The textClass
markup could then look like this:
<textClass>
<keywords scheme="http://theatre-classique.fr">
<term>Tragédie</term>
<term>vers</term>
</keywords>
<classCode scheme="http://www.wikidata.org/entity/">Q80930</classCode>
</textClass>
or for a libretto (e.g. moliere-bourgeoisgentilhomme.xml)
<textClass>
<keywords scheme="http://theatre-classique.fr">
<term>Comédie-ballet</term>
<term>mixte</term>
</keywords>
<classCode scheme="http://www.wikidata.org/entity/">Q131084</classCode>
<classCode scheme="http://www.wikidata.org/entity/">Q40831</classCode>
</textClass>
The wikipediaLinkCount
column always equals "0" in the FreDraCor metadata file, even in cases where it shouldn't be (i.e., if there's a Wikidata ID for a play AND sitelinks >= 1). I also tried Ger and Rus, works for these corpora.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.