jean-baptiste-camps / stemmatology Goto Github PK

View Code? Open in Web Editor NEW

14.0 6.0 3.0 1.45 MB

Stemmatological Analysis of Textual Traditions

License: GNU General Public License v3.0

R 100.00%

r stemma stemmatology philology

stemmatology's People

Stargazers

Watchers

Forkers

floriancafiero ciranahtadrahu jabnik

stemmatology's Issues

Class system

The PCC.Stemma output does not seem to have a class associated to it. I need to create one, and verify globally that the class system is consistent.

Submitting to CRAN

Soon, we'll want to submit to CRAN. Here is some doc to read before:

Package vignette

Regenerate package vignette, for instance with

devtools::use_vignette()

and finish writing it. This step is necessary for packaging in good conditions.

Unit testing

We should implement some unit testing in the code, possibly using testthat or RUnit.
Some doc:

testthat

RUnit

https://cran.r-project.org/web/packages/RUnit/vignettes/RUnit.pdf

Les deux

http://www.johnmyleswhite.com/notebook/2010/08/17/unit-testing-in-r-the-bare-minimum/

Avec en plus Travis et Coveralls

https://www.r-bloggers.com/testing-testing-testing/

Error in PCC.disagreement: Input is not a numeric matrix.

Thank you for this R package, it looks like an interesting project.
I have started playing around with it (without much insight into the stemma creation method yet and with no experience in R at all)

I have tried to use data with multiple readings the parameter alternateReadings=TRUE

In the interactive mode the first few steps work fine but then the error

Error in PCC.disagreement(tableVariantes, omissionsAsReadings = omissionsAsReadings) :
Input is not a numeric matrix.

is thrown.

I have tried with a real data set first but then used the example matrix from the documentation as test data.

I had to duplicate the matrix a few times, otherwise I got the error

Error in cluster::pam(ordConflTot[, 1], numberOfClasses) :
Number of clusters 'k' must be in {1,2, .., n-1}; hence n >= 2

So my minimal data example for the error would be:

 A D F T P
1 "1" "2" "2" "2" "1,2"
2 "1" "2" "1,2" "2" "1"
3 "1" "1" "1" "1" "2"
4 "1,3" "1,2" "1" "2" "3"
5 "1" "2" "2" "2" "1,2"
6 "1" "2" "1,2" "2" "1"
7 "1" "1" "1" "1" "2"
8 "1,3" "1,2" "1" "2" "3"
9 "1" "2" "2" "2" "1,2"
10 "1" "2" "1,2" "2" "1"
11 "1" "1" "1" "1" "2"
12 "1,3" "1,2" "1" "2" "3"
13 "1" "2" "2" "2" "1,2"
14 "1" "2" "1,2" "2" "1"
15 "1" "1" "1" "1" "2"
16 "1,3" "1,2" "1" "2" "3"

I have loaded it from a txt file with mydata = read.table("filename.txt") and mydata = as.matrix(mydata) and then used PCC(mydata,alternateReadings=TRUE).

PCC.reconstructModel : comparisons with witnesses outside the group

When comparing reconstructed models to extant witnesses, do we need to look outside the cluster ? It is included in the paper and maybe in Poole too, but it does not seem algorithmically consistent (except in cases without severe disagreements), because clusters will be made based on severe disagr. with other witnesses, and, as such, no outsider could be their model.

Comments in code:

        # NB & TODO(GLOBAL): this step (that we included in
        # the paper) is PROBABLY not necessary, nor algorithmically consistent.
        # How can the model be outside the group and
        # have no disagreement with the model, knowing that the virtual model is
        # reconstructed based on common readings to the mss of the group, and
        # that these are, at least once, unique to this group? Yet the complexity
        # of this principle is very high, and intuition hard, so we need to
        # check it.

Centrality index

For now, we use the index offered in the paper

deg(u) / e - deg(u)

I'm asking myself questions on two aspects:

Is there a more classic calculation of centrality that could make sense (this is more of a long term question);
more prosaically, how to avoid infinite result, when e = deg(u) ? For now, the code on this point is a bit of a hack. If the result is infinite, I normalise it to 2… We could always do, deg(u) / e (perhaps better than deg(u)/ e - deg(u) + 1), which would normalise the result on 0 … 1 ?

The current code, that can be really enhanced:

        centrality = conflictsTotal  ##Computing the centrality index as described in CC 2013
        ## We have to test first that there actual are conflicts in the database
        if (sum(conflictsTotal) > 0) {
            sumConflicts = sum(conflictsTotal)/2
            for (z in 1:nrow(centrality)) {
                # Another test, to avoid division by zero (perhaps the computation of the
                # centrality index should be adapted. Discuss this with Florian. Or, we
                # could accept to have infinite numbers... does it makes sense ? They
                # sure are superior to any centrality threshold we could choose... if()
                centrality[z, ] = centrality[z, ]/(sumConflicts - centrality[z, 
                                                                             ])  # added an option to remove infinity and to replace it with 2
                if (is.infinite(centrality[z, ])) {
                    centrality[z, ] = 2
                }
            }
        } else {
            for (z in 1:nrow(centrality)) {
                centrality[z, ] = centrality[z, ] = 0
            }
        }

Non ASCII characters

* checking R files for non-ASCII characters ... WARNING
Found the following files with non-ASCII characters:
  PCC.Stemma.R
  PCC.overconflicting.R
Portable packages must use only ASCII characters in their R code,
except perhaps in comments.

Ça a l'air d'être dans les commentaires en effet, mais à vérifier.

Renaming functions

Two functions have been renamed,

PCC.elimination > PCC.overconflicting
PCC.doElimination > PCC.elimination

The names now make more sense.
We need to:

rename them everywhere in the code;
rename the appropriate classes;
verify everything still works:
check every names are satisfying and stop renaming functions…

As to what concerns the classes, they should change too in the following fashion:

pccElimination > pccOverconflicting

Documentation and packaging

We have to finish documentation and package the code.

Manual

When R CMD check is ran, it builds the rnw, tex and pdf of the manual (stemmatology-manual.pdf). Should we use this as base for the vignette ? Should we include it in the package tarball ?

VL.pValues

Decide what to do with this function, and if we keep it, document, debug and test it.

Write tests for each function

Write tests for each function, in tests/testThat

To create the skeleton for a test:

library(devtools)
use_test(name ="maFonction")

then use the syntax from the testthat package. See the documentation of the function test_that().

NA treatment, omissions, modeling

There is some thinking to be made towards the handling of NA, their recovering, and the way texts are encoded, to avoid confusion between NA and omission, and to avoid abusive recovering of isolated readings, that would transform the genealogy.

Better tests for PCC.reconstructModel

The tests for this function should be more comprehensive. Some verbose options or in-function bug tests should perhaps be performed as part of the testing instead.

Namespaces

Look at and fix namespaces issues.

NB:

Namespaces in Imports field not imported from:
  ‘cluster’ ‘network’ ‘sna’
  All declared Imports should be used.

Import.TEIApparatus

Should we keep this function ?
If yes, does it truly work ?
If yes, document it !

Write a more extensive vignette

We should write a more extensive vignette, taking elements from the documentation of the various functions, in a form that could be an alpha version of the paper presenting the package.

Better tests for PCC, PCC.Exploratory, PCC.equipollent

For now, some tests use 'weak' testing, with expect_equal_to_reference. We should add tests with more robust tests with expect_equal.

export functions

Need to export functions from the package that we want to be usable outside of it:

http://r-pkgs.had.co.nz/namespace.html

Stemma plotting

Should we deport stemma plotting from PCC.Stemma to a plot.stemma function to avoid redundancy ?
This could mean creating a stemma class, and a plot.stemma.

PCC.equipollent

This function needs to have some simple tests just to check if the values inputed by the user exist in the database, and to avoid unexpected function stops due to user typo.

Examples

Write examples in the documentation, whenever possible (no interactive functions).

Options to implement

Some options are declared but not used, or not used at every level where they are needed. We need to finish implementing:

alternateReadings at every level in the PCC.Exploratory group;
limit > 0 in the PCC.Stemma group;
interactive = TRUE everywhere.

Conflicts graph

We talked about improving the visualization of our graphs of conflicts. The current network is plotted with gplot (lines 284-285 in PCC.conflicts, as of today).

gplot(myNetwork, displaylabels, label = network.vertex.names(myNetwork), gmode = "graph", boxed.labels = TRUE)

Other packages could give a more interesting output:

The networkD3 package allows for interactive handling of networks, which could be useful to users handling a large database with many conflicting variant locations.
The ggplot2 package has become a reference, allows for extended possibilities compared to gplot. It could also be more perennial than networkD3, and produce easier to publish graphs.

Licence

Our current license is deprecated and not so standard for R. Here is what R CMD check has to say:

Non-standard license specification:
  CC BY-NC-SA 2.0
Standardizable: FALSE
Deprecated license: CC BY-NC-SA 2.0

Should we switch to GPL or something like that ?

Create a TEI export

Create a TEI export for PCC.Stemma object, with the edgelist, the variant table, etc.

PCC.distribute

I am not sure the use of this function is implemented. It is supposed to serve for the option alternateReadings=TRUE but I do not see it used in the code.

Modifications to layout_as_stemma

This function allows to have a stemma where heights of nodes are determined according to number of disagreements and omission/additions towards the model.

For now, it is in absolute value.

Should we modify it to express it as a ratio to the comparable lines (omit lines with NA) ? Because, for now, witnesses with lots of NA are, as an effect, close to their models.

PCC.reconstructModel: 'limit'

Would it make sense to add a "limit" argument to this function ?

interactive=TRUE

In my opinion, we have to alternatives if we want to implement the interactive=TRUE option:

ask the user to set all necessary input at the command level;
Offer some default values or choices, in the hope they will have an average efficiency (risky).

Network plotting: switching from `sna`+`network` to `igraph` ?

I wonder whether to switch from sna (+network) to igraph for network plotting

Some reading:

Problems and solutions

For the moment, the problem with sna::gplot is that it does not scale graphics properly in R studio, and graphs have readability issues.

A side benefit seems to be that igraph has built-in tree objects and methods (such as layout=layout_as_tree, etc.).

NB: the default placement algorithm (modifiable) for sna is 'fruchtermanreigold', and for igraph, `` layout_nicely: a smart function that chooses a layouter based on the graph.''

Also, maybe graphViz could be an interesting solution. It seems quite used in the stemmatology community and has R implementations.

Performance

In terms of performance, I have run profiling on the following lines:

##network + SNA
myNetwork = as.network(edgelist, directed = FALSE, matrix.type = "edgelist")  #Important remark here : not specifying matrix.type = edgelist gave, occasionnaly, weird errors, mainly 'Erreur dans abs(x) : argument non numérique pour une fonction mathématique'... So, I am expliciting this option everywhere
  gplot(myNetwork, displaylabels, label = network.vertex.names(myNetwork), 
        gmode = "graph", boxed.labels = TRUE)#default mode: mode = "fruchtermanreingold", 
  
#igraph
  myNetwork = igraph::graph_from_edgelist(edgelist, directed = FALSE)
  plot(myNetwork)#, layout=layout_as_tree)#The default value is layout_nicely, 
  #a smart function that chooses a layouter based on the graph.

Result:

sna::gplot takes 25.9 MB of memory and 210 ms
igraph::plot takes 0.8 MB of memory and 40 ms

Graphics

In terms of graphics (in Rstudio, without reconfiguring the graphics interface)

SNA

igraph

What do you think, @floriancafiero ? I am tempted to switch to igraph, which means some important changes in code and visualisation (as compared to what we've been doing until now).

Check for colnames / rownames

A lot of functions are dependent on them. They should be checked by the various functions.

PCC.conflicts bug with layout

Trying to run the example as in:

data(fournival)
myConflicts = PCC.conflicts(fournival)

I have encountered the following bug:

 Error in gplot(myNetwork, displaylabels, label = network.vertex.names(myNetwork),  :   
 Error in gplot: no layout function for mode fruchtermanreingold