chop-dbhi / avocado Goto Github PK

View Code? Open in Web Editor NEW

41.0 41.0 10.0 5.37 MB

Metadata APIs for Django

Home Page: http://avocado.harvest.io

License: Other

Makefile 0.01% Python 97.97% HTML 1.70% R 0.03% SAS 0.07% Shell 0.22%

avocado's People

Contributors

Stargazers

Watchers

Forkers

sicilianoe leipzig gracebrownecodes murphyke sgithens hassanns apendleton goldenhelix zhezhangsh

avocado's Issues

Document abstract base classes

#44 may be rolled up into this.

Code of interest:

https://github.com/cbmi/avocado/blob/2.x/avocado/core/models.py

Document and better integrate composite DataContexts

A hidden feature that was implemented a while back (1dacc30) is this notion of composite DataContexts. Normally if two sets of query conditions are combined, the resulting is a new one that has copied the contents of the source objects. A composite DataContext simply references the objects that make it up. This has the benefit (and side effect) of being a dynamic set of conditions that are made up of one or more other sets of conditions.

This is primarily useful for creating building blocks that are used in combination to create something more complex:

A ----\
       C
B ----/

The problem with simply copying a set of conditions into a more complicated query is that you need to go back and update them if your conditions have changed. If you changed A or B you must change C. By simply referencing A and B in C, those changes are seen automatically since the conditions are compiled at runtime.

Feature request: default sub-category

It might be nice to not only order categories but to mark sub-categories as "default", which would cause those to be initially displayed instead of "All". This would allow a category to be broken up into basic/advanced or common/uncommon subcategories, with the most common (less overwhelming) subset being displayed to the user first.

SELECT DISTINCT ... ORDER BY may result in redundant rows

This isn't a bug, but rather a constraint of how DISTINCT .. ORDER BY queries are performed in (most?) database implementations. The problematic side effect is the redundancy of output caused by ordering across a one-to-many relationship without actually wanting the related column in the SELECT clause.

A statement like

SELECT DISTINCT foo
...
ORDER BY bar, baz

is not valid because all ORDER BY columns must exist in the DISTINCT clause. An approach is for Avocado to handle this in the exporter API. This method would need to prune off added columns transparently (which it actually does anyway..) and excludes redundant rows by checking against the previously iterated row.

Another consideration is using an ordered subquery, but there is no guarantee the ordering remains on the outer query (http://stackoverflow.com/a/2101925 and http://stackoverflow.com/a/5119308).

Add order to Category Model

Query history

Log history of query generation. This includes:

adding, removing and changing query conditions (DataContext)
adding and removing output data fields (DataView)
naming and archiving queries (DataContextView)

Integrate django-guardian for object-level permissions

Document PassThroughManager

Code of interest:

https://github.com/cbmi/avocado/blob/2.x/avocado/core/managers.py

Add test cases for lexicon formatter and export behavior

Add mechanism for Field data cache invalidation

Specifically, the properties values, coded_values, and distribution can be cached for as long as the underlying data does not change. If an instance is saved or deleted, these properties may no longer be valid.

A potential (naïve) solution would be add post_delete and post_save signal handlers, but in the case of a batch update these handlers would get executed potentially thousands of times which would be redundant.

Another thing to consider is the means in which the underlying data will change. For large data amounts of data, multiple rows may be altered using the QuerySet methods bulk_create, update, and delete, but rarely would the equivalent Model methods be used. A more likely scenario is using external ETL processes for altering the data.

At a bare minimum, a instance method should be defined on the Field model for invalidating the cache.

Outliers are not being properly determined

Add mechanism for performing re-counts on DataContext objects

The use case is primarily when the data changes or translators are modified the resulting counts may change as well.

One side effect of this is the user noticing the count changes, but this should be communicated in some other means (e.g. message, changelog, etc.).

Most likely a command will suffice which can optionally take specific app, model or field names that target specific datacontexts. This has the performance benefit of not having to perform a recount for every query.

Implement coded values

For consistent exports over time various software (e.g. Sas, R) require a particular coding scheme for strings. Rather than having the downstream process keep track of the mapping, each value for all string-based data can be coded ahead of time.

This would entail a single model:

class CodedValue(models.Model):
    field = models.ForeignKey(Field)
    value = models.CharField(max_length=100)
    coded = models.IntegerField()

    class Meta(object):
        unique_together = ('field', 'value')
        verbose_name = 'coded value'
        verbose_name_plural = 'coded values'

field can be nullable in the case that a coded value can be shared across multiple fields. Thus a lookup would try to find a coded value for a particular field them fallback to non-field-specific coding for that value.

Define "Concept" model

A Concept represents one or more Definition instances providing various query and output customizations.

Flag broken/deprecated nodes in DataContext

Each time the DataContext is validated, the respective node can be flagged with a flag that informs clients downstream the node will not be used when the query is executed.

This will ensure errors are not thrown on the server as a result of a missing datafield.

Implement search engine for indexed metadata and data values

http://haystacksearch.org/

Implement the index documents for DataConcept and DataField (see https://github.com/cbmi/avocado/blob/develop/avocado/templates/search/indexes/avocado/dataconcept_text.txt)
Implement a SearchManager of sorts using the haystack APIs for a clean API, for example DataConcept.objects.search('male') should find the sex DataConcept object and any other concepts associated with that text
See what APIs are available for getting a confidence score (for doing best guess, e.g. I'm Feeling Lucky)
- This could provide a rudimentary avenue for doing high-level Google-style querying of data

Add support for internationalization

As the title states..

Abstract ObjectSet model and components

It is common for saving off sets of objects (e.g. patients) for later reference. This enables performing actions to the set without the constraints of the query conditions as well as performing basic set operations such as union and intersection between sets of the same type.

The main benefits include:

API for creating persistant, on-the-fly sets of objects (effectively materialized views)
Performance benefits of joining on a fixed set of a objects without the encumbrance of query conditions
Using sets as building blocks for creating more complicated queries that may be virtually impossible for end users to construct manually otherwise

For models that need to be set-enabled, the model should subclass ObjectSet which will make it easier elsewhere to make use of such models. The subclass must implement a ManyToManyField pointing to the model of interest.

The pseudo-implementation is as follows:

ObjectSet must implement the __and__ and __or__ hooks for set-like behavior between instances of the same type
ObjectSet must have name, description, and user fields as well as auto-updating created and modified datetime fields
ObjectSet should store the current size and is auto-updated on every change
ObjectSet should have a foreign key to avocado.models.DataContext to optionally keep track of the original set of conditions used to construct the set
Implement an ObjectSetThrough class (for the M2M relationship) that keeps track of added and removed objects from the set that were not originally in the set (upon initial creation)

The usage should look like this:

from avocado.sets import ObjectSet, ObjectSetThrough

class PatientSet(ObjectSet):
    patients = models.ManyToManyField(Patient, through='PatientSetThrough')


class PatientSetThrough(ObjectSetThrough):
    # shadowing the built-in is ok here..
    set = models.ForeignKey(PatientSet)
    item = models.ForeignKey(Patient)

Things to consider:

A convention for using certain terms in subclasses i.e. set and item may be useful for generically making use of ObjectSet subclasses across the API
It should be easy to make copies of sets, should this persist the added and removed flags as well?
It may be desirable to have a hard limit on the allowed size of a set. How should this be handled?

The main integration point is exposing sets as a means of filtering the object of interest. An ObjectSetTranslator must be implemented to apply the filter. For example, if I have a query that says:

female patients
less than 5 years of age
who are in my "african americans with conductive hearing loss" set

This translates to a DataContext that looks like this:

{
    "type": "and",
    "children": [{
        "id": 1, // sex datafield
        "operator": "exact",
        "value": "female"
    }, {
        "id": 2, // age datafield
        "operator": "lt"
        "value": 5
    }, {
        "id": 3, // patientset datafield
        "operator": "exact",
        "value": 30 // patientset id
    }]
}

Although it may be queried and exposed as querying on the set itself, the actually query being executed must join from Patient → PatientSetThrough where set_id = 30. The translator must map from an ObjectSet subclass to it's respective M2M through model. The output SQL would look something like:

SELECT ... some columns ...
FROM "patient" INNER JOIN "patient_set_through" ON ("patient"."id" = "patient_set_through"."item_id")
WHERE "patient"."sex" = 'female' AND "patient"."age" < 5 AND "patient_set_through"."set_id" = 30

DataField "scorecard"

This will help people brand new to the data to get a sense of what the data looks like.

Numerical Data - max, min, mean, median, std
Categorical Data - distinct list of choices
Aggregate Counts (2-D)
- max, min, mean, median, std
- outliers (relative to the median
Document-based Data
- character and word counts
- max, min, mean, median, std of word counts (2-D)

Rename `Field` to `DataElement`

"Field" not only has meaning in the context of model fields and form fields, but does not encapsulate the raw nature of the data for Avocado. Although the data model is tied to a relational database, which is tied to Django model classes, DataElement feels more descriptive and appropriate for the intended use of the API.

Add AVOCADO_HISTORY_SIZE to limit the size of a user's archived objects

This applies to DataContext and DataView objects to limit the number of items in history.

Add Freedman-Diaconis binning documentation

Remove requirement in Formatter to have a set of keys defined

This is partly to support #35, but also enables writing formatters dealing with variably-sized arrays of values.

Add more test cases for cache API for invalidation

Code of interest:

Define "Domain" model for high-level organization

Every definable component in Avocado revolves around the notion of a domain. A domain is merely a high level of organization and up to this point has been implicit to the project and implementation. For a project focused on nutrition, the implicit domain is "nutrition", but there may be a desire for a multi-domain project which could include co-operative domains or domain supersets.

The role of the Domain has changed since the initial inception. It is indeed a high-level (the highest at the current time) of organization, but it is required now. Concepts are associated with a particular domain which provides additional context to the intended audience.

DataContext text renderer

A text renderer takes a DataField instance, operator and value and composes a human-readable representation. The simplest task is rendering a single condition. Representations can become a bit more complicated with deeply nested conditions. One representation could be a bulleted list where the top-level list header defines the logical operator between items in the immediate list.

Refactor DataField searchable vs. enumerable vs. nothing

These attributes are mutually exclusive but are represented as separate model fields.. not cool.

Code of interest:

https://github.com/cbmi/avocado/blob/2.x/avocado/models.py#L81-L88

Find or implement k-means algorithm to drop the SciPy dependency

Define "Definition" model

The Definition model describes a single part of the data model integrated with Avocado.

Document what `published` and `archived` means

Code of interest is here:

Add model instance level cache

Integrated in the old Avocado codebase: https://github.com/cbmi/django-avocado/blob/master/avocado/cache.py

Repurpose Scope, Perspective and Report models

Clean up unnecessary complexity and give them better names.

Add `sortable` field on `DataField` to allow/disallow sorting

Add Formatter templatetag and templatefilter

At a minimum, a valid formatter name must be defined and the data to be formatted is required. A template filter could accomplish this:

{{ data|formatter:"Fancy Mapping" }}

where data can be a single value, an array or dict. This will simply echo back the value (without any keys associated it) and is most appropriate for simple formatters that map raw values or list processing.

For more robust uses, the templatetag can be used:

{% avocado format "Fancy Mapping" data as mapped %}

{{ mapped.key1 }} / {{ mapped.key2 }}

This will return the full object with keys set to the mapped context variable.

Improvement to human-readable scope conditions in case of value equal to None

Currently the conditions property on Scope (logic tree.transform(self._get_obj()).text)) returns 'is equal to has no value' for equality criteria matching 'No Data'. IMHO, this should either be 'is equal to no value' or even better re-written as 'has no value'. Similar for the negation. In fields/operators.py, the Exact and iExact classes could override the text method.

Also, the stringify method of Operator is buggy when the value is None and the operator is NotExact or NotiExact; such a condition will read 'is not equal to has any value', which is the opposite of the correct meaning. It should read something like 'is not equal to no value', or better, 'has any value'.

(I notice the existence of Null and NotNull operator classes; I'm not sure how they play into all this.)

I can fork and make changes if you want.

Change Domain model back to the name 'Category'

Schema documentation generator

A standalone app e.g. avocado-docs

Utilize `prefetch_related` for fetching `concept_fields`

See prefetch_related

DataFieldInterface classes for specialization

What is apparent with the Lexicon and ObjectSet classes are the slightly different behavior from the default DataField behavior. For example, datafields representing a Lexicon should be ordered since they are ordered relative to their order field: https://github.com/cbmi/avocado/blob/2.x/avocado/models.py#L201-202 (which is defined on their Meta class).

The idea is to be able to associate an interface to a class/type of datafields, .e.g Lexicon types or ObjectSets. The implementation should consider the granularity. For example, there are instance level classes such as ViewSets, Translators, and Formatters. These are as such due to the specialization of each instance. An interface seems to be more appropriate at a class level (although it may only impact a few datafields int he end).

An interface is also constrained in that most other features depend on it's behavior and should not be changed easily e.g. as an editable field in the admin interface.

The methods and properties exposed on the DataField class could act as the starting point for being able to be overridden.

Hooks for dynamic DataConcept model

To support the true notion of a concept, the DataConcept should support integrating various mixins with fields, methods and properties supporting different implementations of the DataConcept.

Add initial South migrations

This should happen right before the final release

Muse over SYNC_CHOICES_MINIMUM

When using the sync command, the enable_choices flag could be set given some threshold number of distinct values.

Improve condition parser and generator

Currently all the work is being done here (as described) https://github.com/cbmi/avocado/blob/develop/avocado/query/nodes.py#L2-39

There are few scenarios where an incoming condition must be parsed for validation and subsequently condition generation (when/if implemented an authorization step would precede validation, though in practice authorization has rarely been necessary). Though, these two steps are technically independent. In many cases the condition generation step is not necessary until the query itself actually gets executed. The validation step is more common as the API is being used to gradually construct query conditions.

To prevent redundant parsing and generation, when a node in the tree changes, no other nodes need to be re-parsed. The only exception is when nodes are deleted and the parent container node needs to be collapsed (that is, it now only contains a single node).

Using the something like https://github.com/stefankoegl/python-json-patch to make proper edits to parts of a large JSON-like structure may be a good strategy

Document buffered paginator

Code of interest:

https://github.com/cbmi/avocado/blob/2.x/avocado/core/paginator.py

Data model snapshots for change detection

Similar to South in that serializing the data model allows for detecting changes. This would enable updating various metadata dependent references with the content or removing references for deprecated metadata.

Better strategy for defining keys during formatting

Formatter data is passed in as an OrderedDict which means if two fields that have the same name (from a different model) are part of the same concept, the keys will clash and the first value will be overritten.

Support for sending a raw input value alongside the "clean" value in DataContext trees

There are quite a few cases when the client will utilize and select from a set of labels corresponding to some raw value rather than the value itself or enter a value in directly. The problem occurs when attempting to echo back the original input value. For key-based fields, the integer is used to query the database directly, not the label representation which means in order to display the query conditions to the client, a database hit would have to occur to retrieve the label.

A cleaner approach is to allow storing the corresponding raw input or label with the value that will be queried with.

Permissions via DataContext-like objects

Existing Solutions

There are two typical levels of permission granting for data:

table-wide
row-level

Table-wide permissions are typically too coarse and sometimes promote bad practice like partitioning a table my some key. This is generally only appropriate for multi-tenant solutions, but separate database schemas or databases is generally a cleaner solution for this.

Row-level permissions is generally granular enough, but can be difficult to manage and potentially very expensive to enforce depending on how large the table is the lookup is being performed on.

Conditional Permissions

A DataContext can be constructed consisting of one or more conditions that can be act as a pre-filter for any query generated by Avocado. This can be applied globally and/or at a group/user level.

The benefits include:

a single object that is applied to any applicable query that can be tweaked and created independent of the data
the generic nature of condition tree enables is not limited to a particular table, but rather the whole query
no costly lookup on the data itself to remove rows a user cannot see. the permissions are inherit to the conditions

Design Decisions

The design decisions that need to be made include:

subclass DataContext as is, or turn DataContext into an abstract class and subclass it for separate models
what is the interface for defining one of these pre-filer objects
what is an appropriate name for these objects
should pre-filters not associated with users or groups act as global ones by default?
- the published flag can be used to toggle the filters on and off
at what level should these pre-filters be applied?
- if these are treated as Just Another API™ this will enable downstream usage to be flexible and opt-in
  - for example, Serrano would have to apply these explicitly within it's resources
- up to this point Avocado has been treated as a programmer's library, not so much a definitive solution for the problem domain