quangis / transforge Goto Github PK

Describe processes as type transformations, with inference that supports subtypes and parametric polymorphism. Create and query corresponding transformation graphs.

License: GNU General Public License v3.0

Python 99.29% Shell 0.71%

type inference subtype subsumption constraint transformation workflow abstract tool polymorphism

transforge's People

Contributors

Stargazers

Watchers

transforge's Issues

Move querying to Flow

After #39, we can now properly organize the SPARQL stuff where it belongs. Perhaps also rename Flow to WorkflowQuery for clarity.

Optionally leave out step predicates

The ta:step predicate connects transformations to each data step along the way. This should be an option, since it clutters the visualization and is not vital.

One parameter per abstraction

Abstractions should have one parameter each. This would simplify the code.

Inferred types are not aware of schematic variables

Consider the following function:

compose = Operation(
    lambda α, β, γ: (β ** γ) ** (α ** β) ** (α ** γ),
    derived=lambda f, g, x: f(g(x))
)

Clearly, the type inferred from the derivation should correspond to the
declared type for this operation. However, swapping the arguments in the
declared type does not immediately lead to an error:

compose = Operation(
    lambda α, β, γ: (α ** β) ** (β ** γ) ** (α ** γ),
    derived=lambda f, g, x: f(g(x))
)

An error will still be raised, but only once we use it:

A = Type.declare("A")
B = Type.declare("B")
C = Type.declare("C")
op1 = Operation(A ** B)
op2 = Operation(B ** C)
compose(op2, op1)

It would be helpful if the wrong declared type were immediately caught at the
time that it is defined.

Internal data should get a type

Functions should get an .inputs() and .output() method so that we can assign a type to the internal data nodes.

Related to issue #27

Compound types are now represented with blank nodes pointing to parameter type nodes. A new blank node for each instance! Since type nodes are used so often, we should introduce reusable canonical types nodes for efficient storage and retrieval.

Confusing behaviour of equality operator.

As it is, the equality operator can hold that v == o is false but o == v is true, if o is a type operation and v is a variable that is bound to o. This is confusing.

Infinite loop in constraints.

Fixing issue #11 showed that the following gets in an infinite loop:

    A = Type.declare('A')
    F = Type.declare('F', params=2)
    B = Type.declare('B', supertype=A)
    TypeSchema(lambda x: x | F(_, A) @ [F(A, x), F(B, x)])

Reuse blank type nodes in a single transformation graph

In a single transformation graph, blank type nodes representing the same type should be reused rather than duplicated. This is related but not the same as issue #25. To achieve this, we will make a TransformationGraph subclass of rdflib.Graph that keeps track of type nodes that have already been created. The hashing functionality implemented in db95ab8 will be useful here.

As a side effect, the options of the rdf_* methods of TransformationAlgebraRDF can be made properties of this graph, simplifying the code significantly.

Wildcards cause overeager minimization of constraints.

While F(A,_) is a subtype of F(_,A), x @ [F(A,_), F(_,A)] should not reduce to x @ F(A,_) --- they really are two distinct options.

Therefore, wildcards should not always be considered as being a sub- and supertype of anything --- only when the next step is to unify them. Add an always_accept_wildcards option to .subtype().

I suspect a relation to issue #16.

Test coverage

Test coverage should be automatically maintained via coverage.py.

Necessary tests separated by theme so as to have a hold on how to organize tests:

type.py
- Does type inference work?
- Do subtype relations hold?
- Are constraints minimized properly?
- Do constraints raise errors at the appropriate times?
expr.py
- Can we compose expressions? Do the types check?
- Are the types fixed at the appropriate times?
- Are DeclarationErrors raised when they should?
- Does printing expressions as strings work in the way we expect?
lang.py
- Can we parse everything we want? That is, inline typing, curried and uncurried functions, anonymous sources, labelled sources...
- Are expressions printed as strings and then parsed equal to the original expressions?
graph.py
- Are expressions properly translated into graphs? What happens when a function is taken as parameter? What happens when that function contains variables? What happens when an expression can be expanded into primitives?
query.py
- Do queries work in all situations?

The above list is to be expanded.

Constraint options containing variables aren't excluded.

The type R3(Loc, x, Reg) | R2(Reg, Nom) @ [R2(Loc, x), R2(Reg, x)] should of course be resolved to R3(Loc, Nom, Reg).

Automatically query for aspects of Flow

For testing what parts of a query are important, we would like to query for specific aspects of a flow. That is, we might want to query for workflows that match only the output node of a flow; query for workflows that have all the operations of a flow but disregard the chronological ordering; etcetera.

Rather than manually constructing these queries for the paper, they should be constructed programmatically from the 'full' query. Add an argument like aspects: Literal['output', 'types', 'operations', 'order', 'full'] = 'full' to the .query() method.

Constraints are not unified.

A set of constraints x | x @ F(y) | y @ A or x | x @ A | x @ B | A <= B are not reduced to, respectively, x | x @ F(A) and x | x @ A.

Check of declared type against inferred type doesn't consider constraints.

A declared type of x is more general than an inferred type of x | x @ A , but this is not checked.

Pointless constraints remain.

Far too many pointless contraints remain without being fulfilled.

>>> cct.parse("(join_key (select (eq) (lTopo (deify (merge (pi2 (test x)))) (merge (pi2 (test x)))) (in)) (groupby (get) (test x)))")
R3(Loc, _1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 ├─R2(Reg, _1110) ** R3(Loc, _1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │  ├─╼ join_key : R3(Loc, _1107, Reg) ** R2(Reg, _1110) ** R3(Loc, _1110, Reg) | _1107 >= Nom | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2
(_1110, Reg)]
 │  └─R3(Loc, Nom, Reg)
 │     ├─Val ** R3(Loc, Nom, Reg)
 │     │  ├─R3(Loc, Nom, Reg) ** Val ** R3(Loc, Nom, Reg)
 │     │  │  ├─╼ select : (_1111 ** Val ** Bool) ** R3(Loc, Nom, Reg) ** Val ** R3(Loc, Nom, Reg) | _1111 <= Val | R3(Loc, Nom, Reg) @ [R3(_1111, _, _), R3(_, _1111, _), R3(_, _, _1111)]
 │     │  │  └─╼ eq : Val ** Val ** Bool
 │     │  └─R3(Loc, Nom, Reg)
 │     │     ├─Reg ** R3(Loc, Nom, Reg)
 │     │     │  ├─╼ lTopo : R1(Loc) ** Reg ** R3(Loc, Nom, Reg)
 │     │     │  └─R1(Loc)
 │     │     │     ├─╼ deify : Reg ** R1(Loc)
 │     │     │     └─Reg
 │     │     │        ├─╼ merge : R1(Reg) ** Reg
 │     │     │        └─R1(Reg)
 │     │     │           ├─╼ pi2 : R2(_1110, Reg) ** R1(Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │     │     │           └─╼ test x : R2(_1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │     │     └─Reg
 │     │        ├─╼ merge : R1(Reg) ** Reg
 │     │        └─R1(Reg)
 │     │           ├─╼ pi2 : R2(_1110, Reg) ** R1(Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │     │           └─╼ test x : R2(_1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │     └─╼ in : Nom
 └─R2(Reg, _1110) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
    ├─R2(_1110, Reg) ** R2(Reg, _1110) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
    │  ├─╼ groupby : (R1(_1110) ** _1110) ** R2(_1110, Reg) ** R2(Reg, _1110) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
    │  └─╼ get : R1(_1110) ** _1110 | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
    └─╼ test x : R2(_1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]

This is probably related to #8 in that not all variables that should know about constraints in fact do know about them.

Reuse nodes in primitives

Primitives are now generated at the expression level: an operation g that is defined as λx y. f(…) and used like g(a, b) will first be converted to f(…[a, b]) and only then transformed to RDF. However, this will cause unnecessary duplication of the subexpressions a, b. The RDF could also point to the same node.

Tests for RDF output

See the method for checking if two RDF graphs are isomorphic at https://rdflib.readthedocs.io/en/stable/_modules/rdflib/compare.html. Also at https://rdflib.readthedocs.io/en/stable/_modules/rdflib/tools/graphisomorphism.html.

Check, for example:

~~What happens for data outputs within parameter functions?~~
~~What happens to data inputs within parameter functions?~~
~~What happens to nested parameter functions?~~
~~What happens when params are supplied by outer functions?~~
~~What happens when f takes multiple function parameters?~~

Passing an abstraction to a non-primitive function leads to error.

My assumption was that this should not happen, but it does:

http://geographicknowledge.de/vocab/GISTools.rdf#ZonalStatisticsMeanRatio

join_attr (get_attrL (objectregionnominals x2)) (apply1 (fcont (avg) (field x1)) (get_attrL (objectregionnominals x2)))

http://geographicknowledge.de/vocab/GISTools.rdf#ZonalStatisticsMeanInterval

join_attr (get_attrL (objectregionnominals x2)) (apply1 (fcont (avg) (itvfield x1)) (get_attrL (objectregionnominals x2)))

http://geographicknowledge.de/vocab/GISTools.rdf#SpatialJoinCountTess

join_attr (get_attrL (objectregionnominals x2)) (apply1 (ocont (get_attrL (objectregionnominals x1))) (get_attrL (objectregionnominals x2)))

http://geographicknowledge.de/vocab/GISTools.rdf#MergeObjects

join_attr (groupby (compose merge (compose pi2 (join_subset (get_attrL (objectregionnominals x1))))) (apply1 objectify (get_attrR (objectregionnominals x1)))) (getobjectnames (pi2 (get_attrR (objectregionnominals x1))))

Body of an abstraction may be a function

Right now, the possibility of the body of an abstraction being a function is excluded. This is of course unnecessary.

Constraints should be able to be fulfilled even if they do contain variables.

Consider R2(Val, _3016) @ [R2(Val, _3016)].

Resolve, normalize and follow perform similar tasks

In type.py, resolve() takes the upper or lower bound on variables, follow() follows type variables to their bindings and normalize() applies follow to all variables in a type. These perform confusingly similar tasks, so it would be helpful to combine them.

Simplify internal operations

The attached notes may help in understanding the following rambling.

Right now, we create an internal operation for every operation-as-parameter, and connect them to all parameter values that may be available to them.

This is relatively straightforward for one level of nesting, but it gets unwieldy for operations that are nested inside parameters. Consider that the internal part of an outer operation has access to parameter values that may be passed along to the beginning of the pipeline of the inner operations. So the inner internals must be fed with data produced by outer internals. Therefore, adding ta:feed connections from outer internals to inner internals is the proposed method for dealing with issue #37; see the red lines in the notes.

However, it may be possible instead to use only a single internal operation as a 'black box' for everything that happens 'inside' an operation. Additional operations-as-parameters, as well as nested operations, would simply reuse that one node.

This solution would require cycles (due to the interaction between the internal operation and the unknown order in which it uses other values available to it), but cycles were already possible (see https://github.com/quangis/transformation-algebra/blob/c3474c6c42ff992a4d760c62662a178d29dd577a/tests/test_rdf.py#L269), albeit only in the case of one operation with multiple operations-as-parameters.

The solution would reduce the number of triples. However, it needs deeper consideration before being implemented. What information would we lose by doing it this way? Is the increased occurrence of cycles a performance concern?

Inline type annotation and a notation for source data inputs

Data input functions are functions that don't do anything except introduce data. Ideally, these should be removed; a dedicated notation such as data LABEL : Type would remove the need for making up and defining functions that only serve to mirror the types they introduce.

Handling of variables & constraints is slow

Enumerating variables and passing constraints between them is very inefficient, particularly after f6fef34. Ideally, we would have a
set-of-variables datastructure to keep track of variables and a clever way to keep track of their constraints.

Normalize abstractions

Abstractions that apply their variable immediately as in λx.f x should be the same as simply f. Therefore, we should unify the RDF translation processes for primitives and abstractions.

Wrap expression trees

Expression trees can get quite large. Optionally wrapping them might help readability immensely.

Investigate speed issues

Optimize SPARQL queries. FILTER NOT EXISTS is apparently not very fast.

Use the library within LaTeX

It should be possible to call cct.parse from within LaTeX. Then parsing trees can be typeset really easily and can be changed on a whim.

Leave out some nodes by default.

ta:step is already optional (see issue #42 and commit ad84a19). Make it fully opt-in and change tests accordingly, for the clearest visualizations.

Constraint options are not unified.

A constraint of the form x | x @ [F(A),F(B)] where A < B is not reduced to x | x @ F(A).

Allow `setattr` access on transformation algebras

The way we define transformation algebras now is something like:

f = Operation(A ** A, name="f")
alg = TransformationAlgebra()
alg.add(f)

Or:

f = Operation(A ** A)
alg = TransformationAlgebra()
alg.add(**globals())

In the first case, we explicitly name an operation and add it to the algebra. In the second case, we don't have to explicitly name it because the name is known from the keyword arguments. Either way, though, all operations are defined before being added. This is because operations that are defined in terms of other operations need to be validated (their declared type needs to correspond to the type inferred from the definition). If not all operations have been defined yet at the point that validation happens, this will fail. (See issue #3).

However, it would cut down on verbosity in some cases to be able to do:

alg = TransformationAlgebra()
alg.f = Operation(A ** A)

To do this, validation needs to happen at a different point.

Suggestion: RDF integration

Since the aim of transformation algebras is semantic (not implementation), it might be a good idea to add a module that extends transformation algebras to generate RDF vocabularies, and parse expressions as RDF graphs using those vocabularies.

Error messages are undescriptive for constraints

Knowing what definition an error occurs in will help users debug their algebras.

A better DSL for flow queries

Right now, querying workflows is done using Flows, which are constructed by abusing Python's >> operator and ... ellipsis type. This is not future-proof, because it leaves very little room for adding functionality, and there are some other issues (such as: how do you express at what node a flow starts?)

A LINQ-style domain-specific language within Python would probably fit better.

Infinite loop with schematic variables

Once we define max not as:

Operation(R2(Val, Ord) ** Ord)

but as:

Operation(lambda x: R2(Val, x) ** x | x @ Ord)

Then the following gets into an infinite loop:

groupbyL max (relunion (pi2 (apply2 (lgDist (gridgraph (locationfield x1) (ratiofield x2))) (apply nest (pi1 (locationfield x1))) (accumulate (locationfield x1)))))

Data with the same label is not unified.

The type algorithm should understand that data with the same label should be the same data and as such should be unified.

String printing

To print variable names, we currently store a name for every variable and rename them at appropriate times before calling __str__. This is error-prone. Creating a Name object and a .text() method on types is more explicit and allows us to carry one consistent naming scheme per session, as well as allowing for carrying extra information: should we include type annotations? Use unicode? Etcetera.

Show subtype constraints on variable constraints.

If a variable occurs in a constraint but not in its context, constraints on that variable are not displayed.

Reduce input-output triples

In the RDF representation, every Operation node is supposed to be coupled to exactly one output Data node. It should be easy to see that simply combining them into a single OperationResult node (or another, more aesthetically pleasing name) would dramatically slash the number of triples in the store, and it would also make for simpler queries.

,--internal-,
↓       |   |
λ → h → g → f
↑           ↑
`---- x ----'

Allow non-lambda in derivation definitions.

A function might not be a lambda but simply a shortcut.

Add support for choice in Flows

Query 2 uses a disjunction in the SLTL query. Queries must therefore also support choice.

quangis / transforge Goto Github PK

transforge's People

Contributors

Stargazers

Watchers

transforge's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs