GithubHelp home page GithubHelp logo

jean-baptiste-camps / stemmatology Goto Github PK

View Code? Open in Web Editor NEW
14.0 6.0 3.0 1.45 MB

Stemmatological Analysis of Textual Traditions

License: GNU General Public License v3.0

R 100.00%
r stemma stemmatology philology

stemmatology's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

stemmatology's Issues

Class system

The PCC.Stemma output does not seem to have a class associated to it. I need to create one, and verify globally that the class system is consistent.

Package vignette

Regenerate package vignette, for instance with

devtools::use_vignette()

and finish writing it. This step is necessary for packaging in good conditions.

Error in PCC.disagreement: Input is not a numeric matrix.

Thank you for this R package, it looks like an interesting project.
I have started playing around with it (without much insight into the stemma creation method yet and with no experience in R at all)

I have tried to use data with multiple readings the parameter alternateReadings=TRUE

In the interactive mode the first few steps work fine but then the error

Error in PCC.disagreement(tableVariantes, omissionsAsReadings = omissionsAsReadings) :
Input is not a numeric matrix.

is thrown.

I have tried with a real data set first but then used the example matrix from the documentation as test data.

I had to duplicate the matrix a few times, otherwise I got the error

Error in cluster::pam(ordConflTot[, 1], numberOfClasses) :
Number of clusters 'k' must be in {1,2, .., n-1}; hence n >= 2

So my minimal data example for the error would be:

 A D F T P
1 "1" "2" "2" "2" "1,2"
2 "1" "2" "1,2" "2" "1"
3 "1" "1" "1" "1" "2"
4 "1,3" "1,2" "1" "2" "3"
5 "1" "2" "2" "2" "1,2"
6 "1" "2" "1,2" "2" "1"
7 "1" "1" "1" "1" "2"
8 "1,3" "1,2" "1" "2" "3"
9 "1" "2" "2" "2" "1,2"
10 "1" "2" "1,2" "2" "1"
11 "1" "1" "1" "1" "2"
12 "1,3" "1,2" "1" "2" "3"
13 "1" "2" "2" "2" "1,2"
14 "1" "2" "1,2" "2" "1"
15 "1" "1" "1" "1" "2"
16 "1,3" "1,2" "1" "2" "3"

I have loaded it from a txt file with mydata = read.table("filename.txt") and mydata = as.matrix(mydata) and then used PCC(mydata,alternateReadings=TRUE).

PCC.reconstructModel : comparisons with witnesses outside the group

When comparing reconstructed models to extant witnesses, do we need to look outside the cluster ? It is included in the paper and maybe in Poole too, but it does not seem algorithmically consistent (except in cases without severe disagreements), because clusters will be made based on severe disagr. with other witnesses, and, as such, no outsider could be their model.

Comments in code:

        # NB & TODO(GLOBAL): this step (that we included in
        # the paper) is PROBABLY not necessary, nor algorithmically consistent.
        # How can the model be outside the group and
        # have no disagreement with the model, knowing that the virtual model is
        # reconstructed based on common readings to the mss of the group, and
        # that these are, at least once, unique to this group? Yet the complexity
        # of this principle is very high, and intuition hard, so we need to
        # check it.

Centrality index

For now, we use the index offered in the paper

deg(u) / e - deg(u)

I'm asking myself questions on two aspects:

  1. Is there a more classic calculation of centrality that could make sense (this is more of a long term question);
  2. more prosaically, how to avoid infinite result, when e = deg(u) ? For now, the code on this point is a bit of a hack. If the result is infinite, I normalise it to 2… We could always do, deg(u) / e (perhaps better than deg(u)/ e - deg(u) + 1), which would normalise the result on 0 … 1 ?

The current code, that can be really enhanced:

        centrality = conflictsTotal  ##Computing the centrality index as described in CC 2013
        ## We have to test first that there actual are conflicts in the database
        if (sum(conflictsTotal) > 0) {
            sumConflicts = sum(conflictsTotal)/2
            for (z in 1:nrow(centrality)) {
                # Another test, to avoid division by zero (perhaps the computation of the
                # centrality index should be adapted. Discuss this with Florian. Or, we
                # could accept to have infinite numbers... does it makes sense ? They
                # sure are superior to any centrality threshold we could choose... if()
                centrality[z, ] = centrality[z, ]/(sumConflicts - centrality[z, 
                                                                             ])  # added an option to remove infinity and to replace it with 2
                if (is.infinite(centrality[z, ])) {
                    centrality[z, ] = 2
                }
            }
        } else {
            for (z in 1:nrow(centrality)) {
                centrality[z, ] = centrality[z, ] = 0
            }
        }

Non ASCII characters

* checking R files for non-ASCII characters ... WARNING
Found the following files with non-ASCII characters:
  PCC.Stemma.R
  PCC.overconflicting.R
Portable packages must use only ASCII characters in their R code,
except perhaps in comments.

Ça a l'air d'être dans les commentaires en effet, mais à vérifier.

Renaming functions

Two functions have been renamed,

PCC.elimination > PCC.overconflicting
PCC.doElimination > PCC.elimination

The names now make more sense.
We need to:

  • rename them everywhere in the code;
  • rename the appropriate classes;
  • verify everything still works:
  • check every names are satisfying and stop renaming functions…

As to what concerns the classes, they should change too in the following fashion:

pccElimination > pccOverconflicting

Manual

When R CMD check is ran, it builds the rnw, tex and pdf of the manual (stemmatology-manual.pdf). Should we use this as base for the vignette ? Should we include it in the package tarball ?

VL.pValues

Decide what to do with this function, and if we keep it, document, debug and test it.

Write tests for each function

  • Write tests for each function, in tests/testThat

To create the skeleton for a test:

library(devtools)
use_test(name ="maFonction")

then use the syntax from the testthat package. See the documentation of the function test_that().

NA treatment, omissions, modeling

There is some thinking to be made towards the handling of NA, their recovering, and the way texts are encoded, to avoid confusion between NA and omission, and to avoid abusive recovering of isolated readings, that would transform the genealogy.

Better tests for PCC.reconstructModel

The tests for this function should be more comprehensive. Some verbose options or in-function bug tests should perhaps be performed as part of the testing instead.

Namespaces

Look at and fix namespaces issues.

NB:

Namespaces in Imports field not imported from:
  ‘cluster’ ‘network’ ‘sna’
  All declared Imports should be used.

Import.TEIApparatus

  • Should we keep this function ?
  • If yes, does it truly work ?
  • If yes, document it !

Write a more extensive vignette

We should write a more extensive vignette, taking elements from the documentation of the various functions, in a form that could be an alpha version of the paper presenting the package.

Stemma plotting

Should we deport stemma plotting from PCC.Stemma to a plot.stemma function to avoid redundancy ?
This could mean creating a stemma class, and a plot.stemma.

PCC.equipollent

This function needs to have some simple tests just to check if the values inputed by the user exist in the database, and to avoid unexpected function stops due to user typo.

Examples

  • Write examples in the documentation, whenever possible (no interactive functions).

Options to implement

Some options are declared but not used, or not used at every level where they are needed. We need to finish implementing:

  • alternateReadings at every level in the PCC.Exploratory group;
  • limit > 0 in the PCC.Stemma group;
  • interactive = TRUE everywhere.

Conflicts graph

We talked about improving the visualization of our graphs of conflicts. The current network is plotted with gplot (lines 284-285 in PCC.conflicts, as of today).

gplot(myNetwork, displaylabels, label = network.vertex.names(myNetwork), gmode = "graph", boxed.labels = TRUE)

Other packages could give a more interesting output:

  • The networkD3 package allows for interactive handling of networks, which could be useful to users handling a large database with many conflicting variant locations.
  • The ggplot2 package has become a reference, allows for extended possibilities compared to gplot. It could also be more perennial than networkD3, and produce easier to publish graphs.

Licence

Our current license is deprecated and not so standard for R. Here is what R CMD check has to say:

Non-standard license specification:
  CC BY-NC-SA 2.0
Standardizable: FALSE
Deprecated license: CC BY-NC-SA 2.0

Should we switch to GPL or something like that ?

Create a TEI export

Create a TEI export for PCC.Stemma object, with the edgelist, the variant table, etc.

PCC.distribute

I am not sure the use of this function is implemented. It is supposed to serve for the option alternateReadings=TRUE but I do not see it used in the code.

Modifications to layout_as_stemma

This function allows to have a stemma where heights of nodes are determined according to number of disagreements and omission/additions towards the model.

For now, it is in absolute value.

Should we modify it to express it as a ratio to the comparable lines (omit lines with NA) ? Because, for now, witnesses with lots of NA are, as an effect, close to their models.

interactive=TRUE

In my opinion, we have to alternatives if we want to implement the interactive=TRUE option:

  1. ask the user to set all necessary input at the command level;
  2. Offer some default values or choices, in the hope they will have an average efficiency (risky).

Network plotting: switching from `sna`+`network` to `igraph` ?

I wonder whether to switch from sna (+network) to igraph for network plotting

Some reading:

Problems and solutions

For the moment, the problem with sna::gplot is that it does not scale graphics properly in R studio, and graphs have readability issues.

A side benefit seems to be that igraph has built-in tree objects and methods (such as layout=layout_as_tree, etc.).

NB: the default placement algorithm (modifiable) for sna is 'fruchtermanreigold', and for igraph, `` layout_nicely: a smart function that chooses a layouter based on the graph.''

Also, maybe graphViz could be an interesting solution. It seems quite used in the stemmatology community and has R implementations.

Performance

In terms of performance, I have run profiling on the following lines:

##network + SNA
myNetwork = as.network(edgelist, directed = FALSE, matrix.type = "edgelist")  #Important remark here : not specifying matrix.type = edgelist gave, occasionnaly, weird errors, mainly 'Erreur dans abs(x) : argument non numérique pour une fonction mathématique'... So, I am expliciting this option everywhere
  gplot(myNetwork, displaylabels, label = network.vertex.names(myNetwork), 
        gmode = "graph", boxed.labels = TRUE)#default mode: mode = "fruchtermanreingold", 
  
#igraph
  myNetwork = igraph::graph_from_edgelist(edgelist, directed = FALSE)
  plot(myNetwork)#, layout=layout_as_tree)#The default value is layout_nicely, 
  #a smart function that chooses a layouter based on the graph.

Result:

  • sna::gplot takes 25.9 MB of memory and 210 ms
  • igraph::plot takes 0.8 MB of memory and 40 ms

Graphics

In terms of graphics (in Rstudio, without reconfiguring the graphics interface)

SNA

snagplot

igraph

igraphplot

What do you think, @floriancafiero ? I am tempted to switch to igraph, which means some important changes in code and visualisation (as compared to what we've been doing until now).

PCC.conflicts bug with layout

Trying to run the example as in:

data(fournival)
myConflicts = PCC.conflicts(fournival)

I have encountered the following bug:

 Error in gplot(myNetwork, displaylabels, label = network.vertex.names(myNetwork),  :   
 Error in gplot: no layout function for mode fruchtermanreingold 

import.TEIApparatus

For now, this function expects both a proper listWit as well as app entries. Should I modify it to allow for no listWit, like the CollateX output?
Other question: add a way to import only a selection of @type values and not(@type) ?

Rewrite PCC.conflicts to make it more efficient

PCC.conficts is the more intensive function in the package, and has already concentrated much of profiling effort, but, we should see if we can make it more efficient, especially for the cases of alternateReadings.

Perhaps, for these, we could treat the full matrix to turn it into a list, containing lists for row, containing list for each cell, instead of splitting it on the fly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.