chop-dbhi / avocado Goto Github PK
View Code? Open in Web Editor NEWMetadata APIs for Django
Home Page: http://avocado.harvest.io
License: Other
Metadata APIs for Django
Home Page: http://avocado.harvest.io
License: Other
#44 may be rolled up into this.
Code of interest:
A hidden feature that was implemented a while back (1dacc30) is this notion of composite DataContext
s. Normally if two sets of query conditions are combined, the resulting is a new one that has copied the contents of the source objects. A composite DataContext
simply references the objects that make it up. This has the benefit (and side effect) of being a dynamic set of conditions that are made up of one or more other sets of conditions.
This is primarily useful for creating building blocks that are used in combination to create something more complex:
A ----\
C
B ----/
The problem with simply copying a set of conditions into a more complicated query is that you need to go back and update them if your conditions have changed. If you changed A
or B
you must change C
. By simply referencing A
and B
in C
, those changes are seen automatically since the conditions are compiled at runtime.
It might be nice to not only order categories but to mark sub-categories as "default", which would cause those to be initially displayed instead of "All". This would allow a category to be broken up into basic/advanced or common/uncommon subcategories, with the most common (less overwhelming) subset being displayed to the user first.
This isn't a bug, but rather a constraint of how DISTINCT .. ORDER BY
queries are performed in (most?) database implementations. The problematic side effect is the redundancy of output caused by ordering across a one-to-many relationship without actually wanting the related column in the SELECT
clause.
A statement like
SELECT DISTINCT foo
...
ORDER BY bar, baz
is not valid because all ORDER BY
columns must exist in the DISTINCT
clause. An approach is for Avocado to handle this in the exporter API. This method would need to prune off added columns transparently (which it actually does anyway..) and excludes redundant rows by checking against the previously iterated row.
Another consideration is using an ordered subquery, but there is no guarantee the ordering remains on the outer query (http://stackoverflow.com/a/2101925 and http://stackoverflow.com/a/5119308).
Log history of query generation. This includes:
DataContext
)DataView
)DataContextView
)Code of interest:
Specifically, the properties values
, coded_values
, and distribution
can be cached for as long as the underlying data does not change. If an instance is saved or deleted, these properties may no longer be valid.
A potential (naïve) solution would be add post_delete
and post_save
signal handlers, but in the case of a batch update these handlers would get executed potentially thousands of times which would be redundant.
Another thing to consider is the means in which the underlying data will change. For large data amounts of data, multiple rows may be altered using the QuerySet methods bulk_create
, update
, and delete
, but rarely would the equivalent Model
methods be used. A more likely scenario is using external ETL processes for altering the data.
At a bare minimum, a instance method should be defined on the Field
model for invalidating the cache.
The use case is primarily when the data changes or translators are modified the resulting counts may change as well.
One side effect of this is the user noticing the count changes, but this should be communicated in some other means (e.g. message, changelog, etc.).
Most likely a command will suffice which can optionally take specific app, model or field names that target specific datacontexts. This has the performance benefit of not having to perform a recount for every query.
For consistent exports over time various software (e.g. Sas, R) require a particular coding scheme for strings. Rather than having the downstream process keep track of the mapping, each value for all string-based data can be coded ahead of time.
This would entail a single model:
class CodedValue(models.Model):
field = models.ForeignKey(Field)
value = models.CharField(max_length=100)
coded = models.IntegerField()
class Meta(object):
unique_together = ('field', 'value')
verbose_name = 'coded value'
verbose_name_plural = 'coded values'
field
can be nullable in the case that a coded value can be shared across multiple fields. Thus a lookup would try to find a coded value for a particular field them fallback to non-field-specific coding for that value.
A Concept
represents one or more Definition
instances providing various query and output customizations.
Each time the DataContext
is validated, the respective node can be flagged with a flag that informs clients downstream the node will not be used when the query is executed.
This will ensure errors are not thrown on the server as a result of a missing datafield.
DataConcept
and DataField
(see https://github.com/cbmi/avocado/blob/develop/avocado/templates/search/indexes/avocado/dataconcept_text.txt)SearchManager
of sorts using the haystack APIs for a clean API, for example DataConcept.objects.search('male')
should find the sex DataConcept
object and any other concepts associated with that textAs the title states..
It is common for saving off sets of objects (e.g. patients) for later reference. This enables performing actions to the set without the constraints of the query conditions as well as performing basic set operations such as union and intersection between sets of the same type.
The main benefits include:
For models that need to be set-enabled, the model should subclass ObjectSet
which will make it easier elsewhere to make use of such models. The subclass must implement a ManyToManyField
pointing to the model of interest.
The pseudo-implementation is as follows:
ObjectSet
must implement the __and__
and __or__
hooks for set-like behavior between instances of the same typeObjectSet
must have name
, description
, and user
fields as well as auto-updating created
and modified
datetime fieldsObjectSet
should store the current size
and is auto-updated on every changeObjectSet
should have a foreign key to avocado.models.DataContext
to optionally keep track of the original set of conditions used to construct the setObjectSetThrough
class (for the M2M relationship) that keeps track of added
and removed
objects from the set that were not originally in the set (upon initial creation)The usage should look like this:
from avocado.sets import ObjectSet, ObjectSetThrough
class PatientSet(ObjectSet):
patients = models.ManyToManyField(Patient, through='PatientSetThrough')
class PatientSetThrough(ObjectSetThrough):
# shadowing the built-in is ok here..
set = models.ForeignKey(PatientSet)
item = models.ForeignKey(Patient)
Things to consider:
set
and item
may be useful for generically making use of ObjectSet
subclasses across the APIadded
and removed
flags as well?The main integration point is exposing sets as a means of filtering the object of interest. An ObjectSetTranslator
must be implemented to apply the filter. For example, if I have a query that says:
This translates to a DataContext
that looks like this:
{
"type": "and",
"children": [{
"id": 1, // sex datafield
"operator": "exact",
"value": "female"
}, {
"id": 2, // age datafield
"operator": "lt"
"value": 5
}, {
"id": 3, // patientset datafield
"operator": "exact",
"value": 30 // patientset id
}]
}
Although it may be queried and exposed as querying on the set itself, the actually query being executed must join from Patient
→ PatientSetThrough
where set_id = 30
. The translator must map from an ObjectSet
subclass to it's respective M2M through model. The output SQL would look something like:
SELECT ... some columns ...
FROM "patient" INNER JOIN "patient_set_through" ON ("patient"."id" = "patient_set_through"."item_id")
WHERE "patient"."sex" = 'female' AND "patient"."age" < 5 AND "patient_set_through"."set_id" = 30
This will help people brand new to the data to get a sense of what the data looks like.
"Field" not only has meaning in the context of model fields and form fields, but does not encapsulate the raw nature of the data for Avocado. Although the data model is tied to a relational database, which is tied to Django model classes, DataElement
feels more descriptive and appropriate for the intended use of the API.
This applies to DataContext
and DataView
objects to limit the number of items in history.
This is partly to support #35, but also enables writing formatters dealing with variably-sized arrays of values.
Every definable component in Avocado revolves around the notion of a domain. A domain is merely a high level of organization and up to this point has been implicit to the project and implementation. For a project focused on nutrition, the implicit domain is "nutrition", but there may be a desire for a multi-domain project which could include co-operative domains or domain supersets.
The role of the Domain
has changed since the initial inception. It is indeed a high-level (the highest at the current time) of organization, but it is required now. Concept
s are associated with a particular domain which provides additional context to the intended audience.
A text renderer takes a DataField
instance, operator and value and composes a human-readable representation. The simplest task is rendering a single condition. Representations can become a bit more complicated with deeply nested conditions. One representation could be a bulleted list where the top-level list header defines the logical operator between items in the immediate list.
These attributes are mutually exclusive but are represented as separate model fields.. not cool.
Code of interest:
The Definition
model describes a single part of the data model integrated with Avocado.
Integrated in the old Avocado codebase: https://github.com/cbmi/django-avocado/blob/master/avocado/cache.py
Clean up unnecessary complexity and give them better names.
At a minimum, a valid formatter name must be defined and the data to be formatted is required. A template filter could accomplish this:
{{ data|formatter:"Fancy Mapping" }}
where data
can be a single value, an array or dict. This will simply echo back the value (without any keys associated it) and is most appropriate for simple formatters that map raw values or list processing.
For more robust uses, the templatetag can be used:
{% avocado format "Fancy Mapping" data as mapped %}
{{ mapped.key1 }} / {{ mapped.key2 }}
This will return the full object with keys set to the mapped
context variable.
Currently the conditions
property on Scope
(logic tree.transform(self._get_obj()).text)
) returns 'is equal to has no value' for equality criteria matching 'No Data'. IMHO, this should either be 'is equal to no value' or even better re-written as 'has no value'. Similar for the negation. In fields/operators.py, the Exact
and iExact
classes could override the text
method.
Also, the stringify
method of Operator
is buggy when the value is None and the operator is NotExact
or NotiExact
; such a condition will read 'is not equal to has any value', which is the opposite of the correct meaning. It should read something like 'is not equal to no value', or better, 'has any value'.
(I notice the existence of Null
and NotNull
operator classes; I'm not sure how they play into all this.)
I can fork and make changes if you want.
A standalone app e.g. avocado-docs
See prefetch_related
What is apparent with the Lexicon
and ObjectSet
classes are the slightly different behavior from the default DataField
behavior. For example, datafields representing a Lexicon
should be ordered since they are ordered relative to their order
field: https://github.com/cbmi/avocado/blob/2.x/avocado/models.py#L201-202 (which is defined on their Meta
class).
The idea is to be able to associate an interface to a class/type of datafields, .e.g Lexicon types or ObjectSets. The implementation should consider the granularity. For example, there are instance level classes such as ViewSet
s, Translator
s, and Formatter
s. These are as such due to the specialization of each instance. An interface seems to be more appropriate at a class level (although it may only impact a few datafields int he end).
An interface is also constrained in that most other features depend on it's behavior and should not be changed easily e.g. as an editable field in the admin interface.
The methods and properties exposed on the DataField
class could act as the starting point for being able to be overridden.
To support the true notion of a concept, the DataConcept
should support integrating various mixins with fields, methods and properties supporting different implementations of the DataConcept.
This should happen right before the final release
When using the sync
command, the enable_choices
flag could be set given some threshold number of distinct values.
Currently all the work is being done here (as described) https://github.com/cbmi/avocado/blob/develop/avocado/query/nodes.py#L2-39
There are few scenarios where an incoming condition must be parsed for validation and subsequently condition generation (when/if implemented an authorization step would precede validation, though in practice authorization has rarely been necessary). Though, these two steps are technically independent. In many cases the condition generation step is not necessary until the query itself actually gets executed. The validation step is more common as the API is being used to gradually construct query conditions.
To prevent redundant parsing and generation, when a node in the tree changes, no other nodes need to be re-parsed. The only exception is when nodes are deleted and the parent container node needs to be collapsed (that is, it now only contains a single node).
Using the something like https://github.com/stefankoegl/python-json-patch to make proper edits to parts of a large JSON-like structure may be a good strategy
Code of interest:
Similar to South in that serializing the data model allows for detecting changes. This would enable updating various metadata dependent references with the content or removing references for deprecated metadata.
Formatter data is passed in as an OrderedDict
which means if two fields that have the same name (from a different model) are part of the same concept, the keys will clash and the first value will be overritten.
There are quite a few cases when the client will utilize and select from a set of labels corresponding to some raw value rather than the value itself or enter a value in directly. The problem occurs when attempting to echo back the original input value. For key-based fields, the integer is used to query the database directly, not the label representation which means in order to display the query conditions to the client, a database hit would have to occur to retrieve the label.
A cleaner approach is to allow storing the corresponding raw input or label with the value that will be queried with.
There are two typical levels of permission granting for data:
Table-wide permissions are typically too coarse and sometimes promote bad practice like partitioning a table my some key. This is generally only appropriate for multi-tenant solutions, but separate database schemas or databases is generally a cleaner solution for this.
Row-level permissions is generally granular enough, but can be difficult to manage and potentially very expensive to enforce depending on how large the table is the lookup is being performed on.
A DataContext
can be constructed consisting of one or more conditions that can be act as a pre-filter for any query generated by Avocado. This can be applied globally and/or at a group/user level.
The benefits include:
The design decisions that need to be made include:
DataContext
as is, or turn DataContext
into an abstract class and subclass it for separate modelspublished
flag can be used to toggle the filters on and offThere are too many dependencies on the DataField.simple_type
property to allow overriding existing types.
Code of interest:
Code of interest:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.