GithubHelp home page GithubHelp logo

ropensci / taxa Goto Github PK

View Code? Open in Web Editor NEW
48.0 14.0 12.0 21.16 MB

taxonomic classes for R

Home Page: https://docs.ropensci.org/taxa

License: Other

R 100.00%
taxonomy taxon data-cleaning rstats r r-package

taxa's Introduction

taxa

Project Status: WIP - Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. cran version

This is an R package that provides classes to store and manipulate taxonomic data. Most of the classes can be used like base R vectors. This project is a partial rewrite of the previous version of taxa and is currently under development.

A note about recent changes:

This is beginning of a complete rewrite of the previous taxa package to make the more basic component classes more like base R vectors. The taxmap class is not yet reimplemented, but will be similar to the class in the previous versions of taxa. The old version of taxa has been incorporated into the metacoder package until this version of taxa is mature, at which time metacoder will also use this version.

Contributors

Comments and contributions

We welcome comments, criticisms, and especially contributions! GitHub issues are the preferred way to report bugs, ask questions, or request new features. You can submit issues here:

https://github.com/ropensci/taxa/issues

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for taxa in R doing citation(package = 'taxa')
  • Please note that this project is released with a Contributor Code of Conduct (see CONDUCT.md). By participating in this project you agree to abide by its terms.

taxa's People

Contributors

grabear avatar lionel- avatar marsnone avatar rekyt avatar sckott avatar zachary-foster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

taxa's Issues

New utility functions

There are a few "utility" functions I am thinking about adding in the same family as subtaxa, supertaxa, and roots.

tips or leaves

This would return all the taxa with no subtaxa. It would be useful when transforming a taxonomy into a hierarchies.

Which name do you like better? leaves is consistent with roots and stems, but tips is shorter...

stems

This would return only taxa between a root (NA) and the first taxon with more than one subtaxon. I find I often want to quickly remove these taxa since their information is not needed in many contexts (e.g. you have a dataset with only animals, but you still have "cellular organisms; Eukaryota; Opisthokonta" in front of everything.

Make `subset` option accept NSE like `filter_taxa`

All the functions that have the subset option return values per-taxon.
The same code that filter_taxa uses to parse NSE should work here too.
Let you do stuff like:

supertaxa(ex_taxmap, taxon_ranks == "species")

instead of

supertaxa(ex_taxmap, ex_taxmap$taxon_ranks() == "species")

  • obs
  • subtaxa
  • leaves
  • roots
  • stems
  • supertaxa

Make sure taxmap functions handle lists and vectors correctly

Data frames in taxmap$data are pretty well tested, but not lists and vectors. Lists and vectors should be even easier than tables and the functions have been desinged with this in mind, but they have not been tested much with lists and vectors.

  • filter_taxa
  • filter_obs
  • select_obs
  • obs
  • mutate_obs
  • transmute_obs
  • arrange_obs
  • sample_n_obs
  • sample_frac_obs
  • sample_n_taxa
  • sample_frac_taxa
  • tests
  • man examples
  • mention in vignette

Add `branches` function

In the group of functions roots, stems, and leaves I think there is a place for everything else, which I was thinking about calling branches.

There is already an is_branch function, so I figure there could be a branches function to make things consistent. It would return info on everything that is not a root, stem, or leaf. The four together would be the whole tree.

Databases stored as character cause error for print methods

This affects taxon_id, taxon_rank, and taxon_name. @sckott, at one point I think we discussed allowing the taxon database field of the above functions to be a character vector matching a name in database_list. Doing that currently causes an error.

I think allowing characters would save a lot of RAM for large datasets, since each taxon object could have 3 database objects, each with the tables mentioned in issue #40. Does that work for you?

<TaxonName> Poa
 Show Traceback
 
 Rerun with Debug
 Error in self$database$name : $ operator is invalid for atomic vectors > taxon_name("Poa", database = ncbi)
<TaxonName> Poa
  database: ncbi
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] taxa_0.0.4.9105

loaded via a namespace (and not attached):
 [1] magrittr_1.5     assertthat_0.2.0 R6_2.2.0         DBI_0.6-1        tools_3.3.1      dplyr_0.5.0      tibble_1.3.0    
 [8] Rcpp_0.12.10     knitr_1.15.1     jsonlite_1.4    

Finish overview vignette

I think we should have a solid introduction vignette that briefly demonstrates the main functionality of the package before CRAN. Eventually, a few more vignettes that deal with things in more detail would be good, but those can be done later?

Add a option similar to `value` that controls how results are named

Currently, results of functions like supertaxa return taxon information named by taxon id:

> ex_taxmap$supertaxa(subset = taxon_ranks == "species", value = "taxon_names")
$`12`
         7          3          1 
"Panthera"  "Felidae" "Mammalia" 

$`13`
         8          3          1 
   "Felis"  "Felidae" "Mammalia"

...

It would be nice to reuse the code for value to add an option like name_value that controls how each result is named.

It would work like this:

> ex_taxmap$supertaxa(subset = taxon_ranks == "species", value = "taxon_names", name_value = "taxon_ranks")
$`12`
         genus          family         class 
"Panthera"  "Felidae" "Mammalia" 

$`13`
         genus          family         class 
   "Felis"  "Felidae" "Mammalia"

...

This would allow for making mappings between any two variables in all_names()

  • obs
  • subtaxa
  • leaves
  • roots
  • stems
  • supertaxa
  • classifications

Perhaps add a `return_type`: "name"

This would apply to functions that use taxonomy's private$get_return_type to return a result, like subtaxa and supertaxa. Instead of returning an ID or whatevs, it would return taxon names.

Seems like it would be an easy add.

Thoughts?

How are we handling rank validation?

Hi @sckott,

I am working on the vignette and I started thinking about ranks. I remember that the ranks used to have to match something in /data/ranks_ref.rda, but I suggested removing that validation since there is too much diversity in rank names to encode easily. Now I am thinking that we can do something in between.

What if the valid ranks were associated with the database class, like the id_regex. We could add a rank_regex option that takes one or more regexs that ranks have to match if a database is defined? Alternatively, if we want to encode rank order as well, then maybe an ordered factor (not of regex) of possible ranks called valid_ranks? In both cases, if a database is defined, then the rank names must be valid (rank constructor) and in a logical order (hierarchy and taxonomy constructors) or an error is thrown; if the database is not defined, then anything goes.

In this design /data/ranks_ref.rda would be removed and perhaps replaced with a list of database objects included with the package.

Thoughts?

Also, what is going on with the replication in ranks_ref?

   rankid                                                             ranks
1      05                                                      superkingdom
2      10                   kingdom,kingdom,kingdom,kingdom,kingdom,kingdom
3      20 subkingdom,subkingdom,subkingdom,subkingdom,subkingdom,subkingdom
4      25                                                      infrakingdom
5      30                   phylum,phylum,division,division,phylum,division
6      40 subphylum,subphylum,subdivision,subdivision,subphylum,subdivision
7      45                                                     infradivision
8      50            superclass,superclass,superclass,superclass,superclass
9      60                               class,class,class,class,class,class
10     70             subclass,subclass,subclass,subclass,subclass,subclass ...

Thanks

fxn name conflicts

there are two:

  • for taxa classes we talked about using hierarchies, but there is already a hierarchies fxn for taxmap classes - currently i'm using hierarchies_ for the version for taxa classes
  • for the function taxa for taxa classes, the pkg is named taxa, so i'm using taxa_ right now

thoughts on what to do? probably best to rename these things.

One option is to simply merge binomen and taxa packages under the package name binomen - that means the classes in binomen are swept away, and we use the manipulation methods there for manipulating taxa classes, and there's no collision i don't think with taxmap classes

Class relationships

I am trying to think of what classes are needed and how they should be related to allow for a flexible way of manipulating taxa and items that makes the fewest assumptions about the data and the users intentions. Below is the best I can come up with so far.

class_concept_graph

The ovals represent classes.

  • Database_ID: Simple class to store known databases (e.g. ncbi). It might have a url, a description, and some rules on ID syntax.
  • Taxon_ID: Similar to what you proposed here. Arbitrary IDs should be allowed when no database is used.
  • Item_ID: Similar to Taxon_ID, but for sequences (and perhaps other things). Arbitrary IDs should be allowed when no database is used.
  • Taxon: A Taxon_ID with extra information, such as the name, hierarchy, and user data. Similar to taxa::taxon.
  • Item: A Item_ID with extra information, such as the name, sequence, and user data. There could be a function like metacoder::extract_taxonomy that uses information in user data to assign item and taxon ids from databases.
  • Taxonomy: A unique set of taxa. When combining taxa with different Database_ID, an separate arbitrary ID might be required.
  • Classified: A set of items classified by a unique set of taxa. Not all taxa need have an item; this distinguishes it from a set of items with taxonomy information. Having this information associated will allow for modifications of the taxonomy to effect the items and visa versa. Similar to taxa/metacoder::classified

Add tests or more tests for `taxon` classes

  • hierarchies
  • hierarchy - some coverage already - needed: print method
  • taxa
  • taxon_database
  • taxon - some coverage already - needed: print method - current tests very thin - do more
  • taxon_id - some coverage already - needed: print method
  • taxon_name - some coverage already - needed: print method

across all - do more failure testing - to make sure functions are failing well

Add `exp` option that acts like `value` but with NSE

obs currently does this kind of thing:

> # Return values from a dataset instead of indexes
> ex_taxmap$obs("info", value = "name")
$`1`
[1] tiger cat   mole  human
Levels: cat human mole potato tiger tomato

$`2`
[1] tomato potato
Levels: cat human mole potato tiger tomato

$`3`
[1] tiger cat  
Levels: cat human mole potato tiger tomato

$`4`
[1] mole
Levels: cat human mole potato tiger tomato

...

It might be useful to do something like this instead:

ex_taxmap$obs("info", value = name)

Then, it would allow expressions:

ex_taxmap$obs("info", value = sample_1 + sample_2)

Rebuild `taxmap` on top of `taxonomy`

This will be an upgrade of taxmap, but with taxonomy already doing a lot of the heavy lifting, I think it will actually reduce the complexity of taxmap some. The basic idea is to make taxmap inherit taxonomy, but add a list of user-defined tables. When these tables have a taxon_id column, that column will be used to map rows to the taxa and the edgelist, so modifications of the edgelist/taxa can affect the content of the tables. For example, removing a taxon might remove all of rows corresponding to that taxon as well as (optionally) all of its subtaxa. These kinds of operations will be done with a set of functions modeled after the dplyr functions, similar to filter_taxa and filter_obs now (see ?filter_taxa for details). filter_taxa will work pretty much the same, but filter_obs will need to be reworked to use multiple user-defined tables instead of just one "observation" table.

The metacoder taxmap also had a list of user-definable functions that could be added that simulate columns of data, but were calculated every time they were referenced. I find these very useful, but im not sure yet how to adapt them to multiple tables.

  • Make new taxmap class
  • Make sure subtaxa, roots, etc work with taxmap
  • Make taxmap print method
  • Add function to replace obs (see ?obs) that can be used with multiple tables.
  • Rework filter_taxa. The taxonless and reassign_obs options will need to be adapted to multiple tables. I am thinking of allowing for a named logical vector as well as just TRUE/FALSE. For example reassign_obs could be just TRUE to reassign observations in all user-defined tables or something like c(abundance = TRUE, stats = FALSE) to reassign observations in the table called "abundance", but not in the table called "stats". Another thing to consider is that the old taxmap had a table dedicated to taxon statistics (one row per taxon), so those columns were most commonly used as filtering conditions. Without a dedicated table that is guaranteed to line up with the edgelist 1 to 1, only user-defined tables with one row per taxon will work as filtering conditions. The reworked obs could be used to be used to consolidate data from tables with any number of rows per taxon to be used for filtering.
  • Rework filter_obs, select_obs, arrange_obs, mutate_obs, and tansmute_obs and possibly rename. This will require adding a new argument specifying which table to manipulate, but not too much work.
  • Remove mutate_taxa, select_taxa, and transmute_taxa since there will no longer be a dedicated taxon statistics table. This functionality wil be replaced by the *_obs functions above.
  • Rework arrange_taxa to only effect the order of the edgelist.
  • Rework sample_n_obs, sample_n_taxa. These changes will be similar to others above.

Verify and add tests for character as input for class constructors

Hi @sckott, I was thinking that it would be useful if all of the classes allowed for character input when being initialized as well as objects as they currently do. I started doing this for taxonomy and therefore most of the others as well since the taxonomy constructor now calls their constructors to do the character to object conversion, but I bet there are edge cases still unhanded.

For example, the following code is used in the tests of taxonomy:

> taxonomy(c("a", "b", "c"), c("a", "d"))
<Taxonomy>
  4 taxa: 1. a, 2. b, 3. d, 4. c
  4 edges: NA->1, 1->2, 1->3, 2->4

I did have to reduce the stringency of some of the constructors to do this, particularly taxon which required the rank and hierarchy, which used the rank to sort taxa. I modified hierarchy to only sort taxa when all had ranks and otherwise retain input order.

Is all this OK with you?

Issue errors if a user tries to uses a `taxmap` option with `taxonomy` object

There are 4 functions that have been moved from taxmap to taxonomy:

  • filter_taxa
  • arrange_taxa
  • sample_n_taxa
  • sample_frac_taxa

These have a few options specific to taxmap. For the purposes of DRY, I will try to have the full functions in taxonomy and check the class used before allowing the options specific to taxmap to be used.

Make sure `taxmap` functions work for datasets with multiple obs per taxon

All of the example data in ex_taxmap currently has a 1:1 relationship with taxa (info does not have values for all taxa). A lot of data in the real world will have multiple observations per taxon. The functions are designed for this, but it has not been tested much.

  • add dataset to ex_taxmap with multiple observations per taxon.
  • add tests that use the new dataset for major functions
  • check that validate_taxmap_data() can understand groups of rows for each taxon. The column with the grouping factor should have consecutive value for each taxon, but the "ids" at this point do not need to match the ids ultimately assigned by the function. This might warrant its own issue..

Make sure `value` option in `obs` works with lists and vectors

Currently, the value option of obs takes the name of a column in a table to extract values from for each taxon.
This does not make sense for lists/vectors.
I am thinking of modifying so it accepts TRUE/FALSE.
Alternatively, it might work already if you use the same value as data.

Use `rlang` instead of `lazyeval` for NSE

After running into some mysterious bugs, I made a reproducibility example and asked people on the ropensci slack to look at it. Jim Hester figured out the cause:

@zachary-foster It is because devtools::test() runs devtools::load_all() which puts all function objects in the evaluation environment. You get the same behavior if you run library(bugtest);func_a <- bugtest::func_a; func_b(func_a). But the real solution is to switch to using the newer iteration of lazy evaluation found in rlang instead, e.g. use rlang::eval_tidy(rlang::enquo(x), data = my_data) instead of lazyeval::lazy_eval(lazyeval::lazy(x), data = my_data). The rlang version works in all cases for this example and is what everything in the tidyverse will be migrated to.

Adapt `taxmap` utility functions to `taxonomy`

The following functions used to get information from taxmap objects should be easily adaptable to taxonomy objects:

One question is, what should these return by default? Currently, they return either indexes or IDs, depending on the index option. Indexes are useful because they are the fastest way to access information and make a big difference for large datasets, but they don’t respond to changes to the taxonomy. IDs are useful because they are not affected by changes to the taxonomy and can be mapped to other objects/tables. Now that there are taxon objects, we have the option of returning taxon objects or a taxa object. So, I’m thinking of replacing the index option with something like return_type that takes the following values: "index", "id", "taxa", or "hierarchies". I have found being able to choose ID or indexes very useful when using these functions.

By the way @sckott, do you mind adding me as a collaborator for this repository so I can add issue tags and assign myself to things? I wanted to assign myself this issue for example and I dont think I can as it is.

Behavior of taxmap$func() vs func(taxmap)

@sckott, currently, I have set up the two ways of calling functions to behave differently.
Calling a function in the classical R way imitates traditional no-side-effects copy-on-change behavior by cloning the object before returning the changed clone version. For example, filter_taxa(ex_taxmap, 1:3) will not change ex_taxmap, but ex_taxmap$filter_taxa(1:3) will. Both return the modified ex_taxmap.

Do you like this convention, or do you think it will confuse users? I suspect that each user will pick the style of calling they like and stick to it for the most part.

Do something about how `value` can now return nonsensical output

Since the changes in #51, value is a much more flexible version of what return_type was.
However, it is now really easy for users to do irrational things without a warning.
Since its currently implementation just looks up the data given and subsets it by the result of the function it is in. In some cases the result of the function is observation indexes (obs) and other times it is taxon indexes (subtaxa, roots, etc). It is now easy to subset taxa info with obs indexes and visa versa, which does not make sense:

For example:

> ex_taxmap$obs("info", value = "taxon_names")
$`1`
             1              2              3              4 
    "Mammalia"      "Plantae"      "Felidae" "Notoryctidae" 

$`2`
           5            6 
 "Hominidae" "Solanaceae" 

....

and

> ex_taxmap$supertaxa(value = "name")
$`1`
factor(0)
Levels: cat human mole potato tiger tomato

$`2`
factor(0)
Levels: cat human mole potato tiger tomato

$`3`
[1] tiger
Levels: cat human mole potato tiger tomato

$`4`
[1] tiger
Levels: cat human mole potato tiger tomato

$`5`
[1] tiger
Levels: cat human mole potato tiger tomato

...

I think either a warning should be issued or the function should do its best to convert between taxon indexes and obs indexes when possible and error when not.

For example, ex_taxmap$obs("info", value = "taxon_names") could look up the taxon_names associated with the observations indexes, assuming a taxon_id column exists in ex_taxmap$data$info.

Also, ex_taxmap$supertaxa(value = "n_legs") could look for rows in ex_taxmap$data$info assigned to taxa in the output of supertaxa and use the n_legs values there. This would return NA for taxa not in ex_taxmap$data$info and error if there is more than one entry per taxon, which would be common in many situations.

Add tests for `taxmap`

This could take a while since there are a lot of functions with flexible input/output associated with taxmap.
The list below includes the relatively complex functions that are not just wrappers for other functions:

  • all_names
  • names_used
  • get_data
  • obs
  • filter_taxa
  • filter_obs
  • select_obs
  • mutate_obs
  • transmute_obs
  • arrange_obs
  • arrange_taxa
  • sample_n_obs
  • sample_n_taxa

Make `value` in `obs` and `return_type` in other funcs work the same

These do very similar things, so they might as well have the same name and implementation.
Maybe rename return_type to value and make it work like value currently does, which is more consistent with how other functions work. This will however remove return_type's ability to return taxa and hierarchies objects. I think the added flexibility of being able to use everything in all_names makes it worth it.

  • obs
  • subtaxa
  • leaves
  • roots
  • stems
  • supertaxa

Add function for tabular output of `taxonomy` and `taxmap`

This would try to pack all the information in taxmap or taxonomy into a table, repeating values when necessary.
This could be done using the output of get_data().

export_data = function(obj, cols) {
   ...
}

cols would be any set of values in all_names(). I hesitate to have cols output everything by default, because there is a lot in all_names() that most people would not want exported (e.g. is_stem). I also don't want to choose for the user a default set of columns because, in the case of taxmap, most of the interesting stuff will be user-defined. So I am thinking have not default and make the user decide what to export.

Make `obs` have the option to return data instead of indexes

Currently, obs just returns indexes of observations for each taxon. Most of the time that is used to look up some set of values for each taxon, so it would be nice if obs could return that directly.

What had to be done this way:

vapply(obs(data, "my_table"), # For each taxon...
         function(index) sum(data$obs_data[index, id]), numeric(1)) # sum the proportions

Could be done this way:

vapply(obs(data, "my_table", data_col = id), sum, numeric(1)) 

Add taxon ids for names to table column output in `get_data`

get_data returns this:

> ex_taxmap$get_data()
$taxon_names
             1              2              3              4              5              6              7              8              9 
    "Mammalia"      "Plantae"      "Felidae" "Notoryctidae"    "Hominidae"   "Solanaceae"     "Panthera"        "Felis"   "Notoryctes" 
            10             11             12             13             14             15             16             17 
        "homo"      "Solanum"       "tigris"        "catus"     "typhlops"      "sapiens" "lycopersicum"    "tuberosum" 

$taxon_ids
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17"

$taxon_indexes
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

$n_supertaxa
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 
 0  0  1  1  1  1  2  2  2  2  2  3  3  3  3  3  3 

$n_subtaxa
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 
11  4  4  2  2  3  1  1  1  1  2  0  0  0  0  0  0 

$n_subtaxa_1
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 
 3  1  2  1  1  1  1  1  1  1  2  0  0  0  0  0  0 

$name
[1] tiger  cat    mole   human  tomato potato
Levels: cat human mole potato tiger tomato

$n_legs
[1] 4 4 4 2 0 0

$dangerous
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE

$phylopic_ids
                                    12                                     13                                     14 
"e148eabb-f138-43c6-b1e4-5cda2180485a" "12899ba0-9923-4feb-a7f9-758c3c7d5e13" "11b783d5-af1c-4f4e-8ab5-a51470652b47" 
                                    15                                     16                                     17 
"9fae30cd-fb59-4a81-a39c-e1826a35f612" "b6400f39-345a-4711-ab4f-92fd4e22cb1a" "63604565-0406-460b-8cb8-1abe954b3f3a" 

$foods
$foods$`12`
[1] "mammals" "birds"  

$foods$`13`
[1] "cat food" "mice"    

$foods$`14`
[1] "insects"

$foods$`15`
[1] "Most things, but especially anything rare or expensive"

$foods$`16`
[1] "light" "dirt" 

$foods$`17`
[1] "light" "dirt" 


$reaction
[1] "Watch out! That tiger might attack!" "No worries; its just a cat."         "No worries; its just a mole."       
[4] "Watch out! That human might attack!" "No worries; its just a tomato."      "No worries; its just a potato." 

Note how entries like "dangerous" that came from tables in ex_taxmap$data are not named by taxon id.
Not all tables have a "taxon_id" column, but if it is there, then they should be named.

  • make private function to get the taxon ids for things in ex_taxmap$data by name
  • Add taxon ids to output when possible

"TaxonAuthority" class?

Hi @sckott, I notice that the taxon man pages mention a TaxonAuthority object and the print method seems to expect one, but I do not see the code for one. Should there be such a class? It would be consistent with name, rank, and id, but I am not sure what we would gain by making it a class vs just a character. Do we want to associate the authority with a database for example?

`taxonomy` questions

Hi @sckott, nice work with the taxonomy class! I am looking through the code and trying to figure a few things out:

  1. Whats going on here? I dont see a taxa variable. It results in no. hierarchies: 0 when the print method is used.
  2. Do we need to store the edgelists and graph variables? They seem redundant to me and would require recalculation every time a change is made. make_graph could be used in the print method directly instead of graph. The contents of edgelists could be inferred from edgelist, except for multiple instances of the same edge list, but if users are interested in that they should use hierarchies or taxmap (with a column for taxon counts).
  3. How about calling uniqtaxa taxa or changing the private function unique_taxa to something like get_unique_taxa, so that the public variable can be unique_taxa?

I am going to play with it on a new branch and submit a PR if I come up with anything good.

Add NSE variables for dataset taxon ids

I found a situation were it would be useful for the taxon ids of observations in a table be available to NSE in all_names(). For observation taxon ids of lists and vectors, it is easy to just use names to get taxon ids. For example, in ...

> ex_taxmap
<Taxmap>
  17 taxa: 1. Mammalia, 2. Plantae, 3. Felidae, 4. Notoryctidae ... 14. typhlops, 15. sapiens, 16. lycopersicum, 17. tuberosum
  17 edges: NA->1, NA->2, 1->3, 1->4, 1->5, 2->6, 3->7, 3->8, 4->9, 5->10, 6->11, 7->12, 8->13, 9->14, 10->15, 11->16, 11->17
  3 data sets:
    info:
      # A tibble: 6 x 4
          name n_legs dangerous taxon_id
        <fctr>  <dbl>     <lgl>    <chr>
      1  tiger      4      TRUE       12
      2    cat      4     FALSE       13
      3   mole      4     FALSE       14
      # ... with 3 more rows
    phylopic_ids:  e148eabb-f138-43c6-b1e4-5cda2180485a ... b6400f39-345a-4711-ab4f-92fd4e22cb1a, 63604565-0406-460b-8cb8-1abe954b3f3a
    foods: a list with 6 items
  1 functions:
 reaction

The taxon ids of ex_taxmap$data$foods could be found by NSE with names(foods), but there is no easy way to get the taxon ids of rows in the ex_taxmap$data$info dataset. names(n_legs) would work, but pick an arbitrary column like that is a it hackish.

So, i am thinking about modifying all_names() to include something like info_taxon_ids for each table in data. This will help in implementing a function for making mapping between any two variables with associated taxon ids. That function might held DRYing out the code for the value option and make #55 easier to implement.

Make `classifications` function

This would be an abstraction of id_classifications and name_classifications, that allow constructing classifications from anything in all_names()

 ex_taxmap$name_classifications()
                                          1                                           2                                           3 
                                 "Mammalia"                                   "Plantae"                          "Mammalia;Felidae" 
                                          4                                           5                                           6 
                    "Mammalia;Notoryctidae"                        "Mammalia;Hominidae"                        "Plantae;Solanaceae" 
                                          7                                           8                                           9 
                "Mammalia;Felidae;Panthera"                    "Mammalia;Felidae;Felis"          "Mammalia;Notoryctidae;Notoryctes" 
                                         10                                          11                                          12 
                  "Mammalia;Hominidae;homo"                "Plantae;Solanaceae;Solanum"          "Mammalia;Felidae;Panthera;tigris" 
                                         13                                          14                                          15 
             "Mammalia;Felidae;Felis;catus" "Mammalia;Notoryctidae;Notoryctes;typhlops"           "Mammalia;Hominidae;homo;sapiens" 
                                         16                                          17 
  "Plantae;Solanaceae;Solanum;lycopersicum"      "Plantae;Solanaceae;Solanum;tuberosum" 

would be the same as:

ex_taxmap$classifications(value = "taxon_names")

Finalize function and option names

If we want to rename any functions or arguments, we should do that before CRAN. If there are options that do similar things between functions we should make sure they are the same.

Standardizing the vocabulary we use in the man pages and vignettes is part of this too. For example, I tend to describe taxonomic ranks as "levels" sometimes or I use "parent taxa" and "supetaxa" the same way. I think this might confuse people who are new to these concepts, so I think we should pick one and stick to it. We could even add a glossary in the man pages and link words like "supertaxa" in other man pages to it?

Go through docs right before CRAN

I am noticing some outdated and inconsistent documentation.
We should probably go through it before the CRAN submission, although I fixed some of it.

`filter_taxa` throws error when `funcs` is empty

> my_taxmap <- taxmap(tiger, cat, mole, human, tomato, potato,
+                     data = list(info = info,
+                                 phylopic_ids = phylopic_ids,
+                                 foods = foods))
> my_taxmap
<Taxmap>
  17 taxa: 1. Mammalia, 2. Plantae, 3. Felidae ... 14. typhlops, 15. sapiens, 16. lycopersicum, 17. tuberosum
  17 edges: NA->1, NA->2, 1->3, 1->4, 1->5, 2->6, 3->7 ... 5->10, 6->11, 7->12, 8->13, 9->14, 10->15, 11->16, 11->17
  3 data sets:
    info:
      
    phylopic_ids:  e148eabb-f138-43c6-b1e4-5cda2180485a ... 63604565-0406-460b-8cb8-1abe954b3f3a
    foods: a list with 6 items
  0 functions:
> filter_taxa(my_taxmap, startsWith(name, "t"))
 Show Traceback
 
 Rerun with Debug
 Error in names(func_names) <- rep("funcs", length(func_names)) : 
  attempt to set an attribute on NULL 
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] taxa_0.0.4.9105

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10     digest_0.6.12    dplyr_0.5.0      rprojroot_1.2    assertthat_0.2.0 R6_2.2.0         jsonlite_1.4    
 [8] DBI_0.6-1        backports_1.0.5  magrittr_1.5     evaluate_0.10    stringi_1.1.5    lazyeval_0.2.0   rmarkdown_1.4   
[15] tools_3.3.1      stringr_1.2.0    yaml_2.1.14      htmltools_0.3.5  knitr_1.15.1     tibble_1.3.0    

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.