GithubHelp home page GithubHelp logo

quangis / transforge Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 0.0 1.06 MB

Describe processes as type transformations, with inference that supports subtypes and parametric polymorphism. Create and query corresponding transformation graphs.

License: GNU General Public License v3.0

Python 99.29% Shell 0.71%
type inference subtype subsumption constraint transformation workflow abstract tool polymorphism

transforge's People

Contributors

nsbgn avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

transforge's Issues

Move querying to Flow

After #39, we can now properly organize the SPARQL stuff where it belongs. Perhaps also rename Flow to WorkflowQuery for clarity.

Optionally leave out step predicates

The ta:step predicate connects transformations to each data step along the way. This should be an option, since it clutters the visualization and is not vital.

Inferred types are not aware of schematic variables

Consider the following function:

compose = Operation(
    lambda α, β, γ: (β ** γ) ** (α ** β) ** (α ** γ),
    derived=lambda f, g, x: f(g(x))
)

Clearly, the type inferred from the derivation should correspond to the
declared type for this operation. However, swapping the arguments in the
declared type does not immediately lead to an error:

compose = Operation(
    lambda α, β, γ: (α ** β) ** (β ** γ) ** (α ** γ),
    derived=lambda f, g, x: f(g(x))
)

An error will still be raised, but only once we use it:

A = Type.declare("A")
B = Type.declare("B")
C = Type.declare("C")
op1 = Operation(A ** B)
op2 = Operation(B ** C)
compose(op2, op1)

It would be helpful if the wrong declared type were immediately caught at the
time that it is defined.

Canonical type nodes in RDF

Compound types are now represented with blank nodes pointing to parameter type nodes. A new blank node for each instance! Since type nodes are used so often, we should introduce reusable canonical types nodes for efficient storage and retrieval.

Confusing behaviour of equality operator.

As it is, the equality operator can hold that v == o is false but o == v is true, if o is a type operation and v is a variable that is bound to o. This is confusing.

Infinite loop in constraints.

Fixing issue #11 showed that the following gets in an infinite loop:

    A = Type.declare('A')
    F = Type.declare('F', params=2)
    B = Type.declare('B', supertype=A)
    TypeSchema(lambda x: x | F(_, A) @ [F(A, x), F(B, x)])

Reuse blank type nodes in a single transformation graph

In a single transformation graph, blank type nodes representing the same type should be reused rather than duplicated. This is related but not the same as issue #25. To achieve this, we will make a TransformationGraph subclass of rdflib.Graph that keeps track of type nodes that have already been created. The hashing functionality implemented in db95ab8 will be useful here.

As a side effect, the options of the rdf_* methods of TransformationAlgebraRDF can be made properties of this graph, simplifying the code significantly.

Wildcards cause overeager minimization of constraints.

While F(A,_) is a subtype of F(_,A), x @ [F(A,_), F(_,A)] should not reduce to x @ F(A,_) --- they really are two distinct options.

Therefore, wildcards should not always be considered as being a sub- and supertype of anything --- only when the next step is to unify them. Add an always_accept_wildcards option to .subtype().

I suspect a relation to issue #16.

Test coverage

Test coverage should be automatically maintained via coverage.py.

Necessary tests separated by theme so as to have a hold on how to organize tests:

  • type.py

    • Does type inference work?
    • Do subtype relations hold?
    • Are constraints minimized properly?
    • Do constraints raise errors at the appropriate times?
  • expr.py

    • Can we compose expressions? Do the types check?
    • Are the types fixed at the appropriate times?
    • Are DeclarationErrors raised when they should?
    • Does printing expressions as strings work in the way we expect?
  • lang.py

    • Can we parse everything we want? That is, inline typing, curried and uncurried functions, anonymous sources, labelled sources...
    • Are expressions printed as strings and then parsed equal to the original expressions?
  • graph.py

    • Are expressions properly translated into graphs? What happens when a function is taken as parameter? What happens when that function contains variables? What happens when an expression can be expanded into primitives?
  • query.py

    • Do queries work in all situations?

The above list is to be expanded.

Automatically query for aspects of Flow

For testing what parts of a query are important, we would like to query for specific aspects of a flow. That is, we might want to query for workflows that match only the output node of a flow; query for workflows that have all the operations of a flow but disregard the chronological ordering; etcetera.

Rather than manually constructing these queries for the paper, they should be constructed programmatically from the 'full' query. Add an argument like aspects: Literal['output', 'types', 'operations', 'order', 'full'] = 'full' to the .query() method.

Constraints are not unified.

A set of constraints x | x @ F(y) | y @ A or x | x @ A | x @ B | A <= B are not reduced to, respectively, x | x @ F(A) and x | x @ A.

Pointless constraints remain.

Far too many pointless contraints remain without being fulfilled.

>>> cct.parse("(join_key (select (eq) (lTopo (deify (merge (pi2 (test x)))) (merge (pi2 (test x)))) (in)) (groupby (get) (test x)))")
R3(Loc, _1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 ├─R2(Reg, _1110) ** R3(Loc, _1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │  ├─╼ join_key : R3(Loc, _1107, Reg) ** R2(Reg, _1110) ** R3(Loc, _1110, Reg) | _1107 >= Nom | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2
(_1110, Reg)]
 │  └─R3(Loc, Nom, Reg)
 │     ├─Val ** R3(Loc, Nom, Reg)
 │     │  ├─R3(Loc, Nom, Reg) ** Val ** R3(Loc, Nom, Reg)
 │     │  │  ├─╼ select : (_1111 ** Val ** Bool) ** R3(Loc, Nom, Reg) ** Val ** R3(Loc, Nom, Reg) | _1111 <= Val | R3(Loc, Nom, Reg) @ [R3(_1111, _, _), R3(_, _1111, _), R3(_, _, _1111)]
 │     │  │  └─╼ eq : Val ** Val ** Bool
 │     │  └─R3(Loc, Nom, Reg)
 │     │     ├─Reg ** R3(Loc, Nom, Reg)
 │     │     │  ├─╼ lTopo : R1(Loc) ** Reg ** R3(Loc, Nom, Reg)
 │     │     │  └─R1(Loc)
 │     │     │     ├─╼ deify : Reg ** R1(Loc)
 │     │     │     └─Reg
 │     │     │        ├─╼ merge : R1(Reg) ** Reg
 │     │     │        └─R1(Reg)
 │     │     │           ├─╼ pi2 : R2(_1110, Reg) ** R1(Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │     │     │           └─╼ test x : R2(_1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │     │     └─Reg
 │     │        ├─╼ merge : R1(Reg) ** Reg
 │     │        └─R1(Reg)
 │     │           ├─╼ pi2 : R2(_1110, Reg) ** R1(Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │     │           └─╼ test x : R2(_1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
 │     └─╼ in : Nom
 └─R2(Reg, _1110) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
    ├─R2(_1110, Reg) ** R2(Reg, _1110) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
    │  ├─╼ groupby : (R1(_1110) ** _1110) ** R2(_1110, Reg) ** R2(Reg, _1110) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
    │  └─╼ get : R1(_1110) ** _1110 | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]
    └─╼ test x : R2(_1110, Reg) | R2(Reg, _1110) @ [R2(Reg, _1110)] | R2(_1110, Reg) @ [R2(_1110, Reg)] | R2(_1110, Reg) @ [R2(_1110, Reg)]

This is probably related to #8 in that not all variables that should know about constraints in fact do know about them.

Reuse nodes in primitives

Primitives are now generated at the expression level: an operation g that is defined as λx y. f(…) and used like g(a, b) will first be converted to f(…[a, b]) and only then transformed to RDF. However, this will cause unnecessary duplication of the subexpressions a, b. The RDF could also point to the same node.

Tests for RDF output

See the method for checking if two RDF graphs are isomorphic at https://rdflib.readthedocs.io/en/stable/_modules/rdflib/compare.html. Also at https://rdflib.readthedocs.io/en/stable/_modules/rdflib/tools/graphisomorphism.html.

Check, for example:

  • What happens for data outputs within parameter functions?
  • What happens to data inputs within parameter functions?
  • What happens to nested parameter functions?
  • What happens when params are supplied by outer functions?
  • What happens when f takes multiple function parameters?

Passing an abstraction to a non-primitive function leads to error.

My assumption was that this should not happen, but it does:

http://geographicknowledge.de/vocab/GISTools.rdf#ZonalStatisticsMeanRatio

join_attr (get_attrL (objectregionnominals x2)) (apply1 (fcont (avg) (field x1)) (get_attrL (objectregionnominals x2)))

http://geographicknowledge.de/vocab/GISTools.rdf#ZonalStatisticsMeanInterval

join_attr (get_attrL (objectregionnominals x2)) (apply1 (fcont (avg) (itvfield x1)) (get_attrL (objectregionnominals x2)))

http://geographicknowledge.de/vocab/GISTools.rdf#SpatialJoinCountTess

join_attr (get_attrL (objectregionnominals x2)) (apply1 (ocont (get_attrL (objectregionnominals x1))) (get_attrL (objectregionnominals x2)))

http://geographicknowledge.de/vocab/GISTools.rdf#MergeObjects

join_attr (groupby (compose merge (compose pi2 (join_subset (get_attrL (objectregionnominals x1))))) (apply1 objectify (get_attrR (objectregionnominals x1)))) (getobjectnames (pi2 (get_attrR (objectregionnominals x1))))

Resolve, normalize and follow perform similar tasks

In type.py, resolve() takes the upper or lower bound on variables, follow() follows type variables to their bindings and normalize() applies follow to all variables in a type. These perform confusingly similar tasks, so it would be helpful to combine them.

Simplify internal operations

The attached notes may help in understanding the following rambling.

Right now, we create an internal operation for every operation-as-parameter, and connect them to all parameter values that may be available to them.

This is relatively straightforward for one level of nesting, but it gets unwieldy for operations that are nested inside parameters. Consider that the internal part of an outer operation has access to parameter values that may be passed along to the beginning of the pipeline of the inner operations. So the inner internals must be fed with data produced by outer internals. Therefore, adding ta:feed connections from outer internals to inner internals is the proposed method for dealing with issue #37; see the red lines in the notes.

However, it may be possible instead to use only a single internal operation as a 'black box' for everything that happens 'inside' an operation. Additional operations-as-parameters, as well as nested operations, would simply reuse that one node.

This solution would require cycles (due to the interaction between the internal operation and the unknown order in which it uses other values available to it), but cycles were already possible (see https://github.com/quangis/transformation-algebra/blob/c3474c6c42ff992a4d760c62662a178d29dd577a/tests/test_rdf.py#L269), albeit only in the case of one operation with multiple operations-as-parameters.

The solution would reduce the number of triples. However, it needs deeper consideration before being implemented. What information would we lose by doing it this way? Is the increased occurrence of cycles a performance concern?

Inline type annotation and a notation for source data inputs

Data input functions are functions that don't do anything except introduce data. Ideally, these should be removed; a dedicated notation such as data LABEL : Type would remove the need for making up and defining functions that only serve to mirror the types they introduce.

Handling of variables & constraints is slow

Enumerating variables and passing constraints between them is very inefficient, particularly after f6fef34. Ideally, we would have a
set-of-variables datastructure to keep track of variables and a clever way to keep track of their constraints.

Normalize abstractions

Abstractions that apply their variable immediately as in λx.f x should be the same as simply f. Therefore, we should unify the RDF translation processes for primitives and abstractions.

Wrap expression trees

Expression trees can get quite large. Optionally wrapping them might help readability immensely.

Use the library within LaTeX

It should be possible to call cct.parse from within LaTeX. Then parsing trees can be typeset really easily and can be changed on a whim.

Allow `__setattr__` access on transformation algebras

The way we define transformation algebras now is something like:

f = Operation(A ** A, name="f")
alg = TransformationAlgebra()
alg.add(f)

Or:

f = Operation(A ** A)
alg = TransformationAlgebra()
alg.add(**globals())

In the first case, we explicitly name an operation and add it to the algebra. In the second case, we don't have to explicitly name it because the name is known from the keyword arguments. Either way, though, all operations are defined before being added. This is because operations that are defined in terms of other operations need to be validated (their declared type needs to correspond to the type inferred from the definition). If not all operations have been defined yet at the point that validation happens, this will fail. (See issue #3).

However, it would cut down on verbosity in some cases to be able to do:

alg = TransformationAlgebra()
alg.f = Operation(A ** A)

To do this, validation needs to happen at a different point.

Suggestion: RDF integration

Since the aim of transformation algebras is semantic (not implementation), it might be a good idea to add a module that extends transformation algebras to generate RDF vocabularies, and parse expressions as RDF graphs using those vocabularies.

A better DSL for flow queries

Right now, querying workflows is done using Flows, which are constructed by abusing Python's >> operator and ... ellipsis type. This is not future-proof, because it leaves very little room for adding functionality, and there are some other issues (such as: how do you express at what node a flow starts?)

A LINQ-style domain-specific language within Python would probably fit better.

Infinite loop with schematic variables

Once we define max not as:

Operation(R2(Val, Ord) ** Ord)

but as:

Operation(lambda x: R2(Val, x) ** x | x @ Ord)

Then the following gets into an infinite loop:

groupbyL max (relunion (pi2 (apply2 (lgDist (gridgraph (locationfield x1) (ratiofield x2))) (apply nest (pi1 (locationfield x1))) (accumulate (locationfield x1)))))

String printing

To print variable names, we currently store a name for every variable and rename them at appropriate times before calling __str__. This is error-prone. Creating a Name object and a .text() method on types is more explicit and allows us to carry one consistent naming scheme per session, as well as allowing for carrying extra information: should we include type annotations? Use unicode? Etcetera.

Reduce input-output triples

In the RDF representation, every Operation node is supposed to be coupled to exactly one output Data node. It should be easy to see that simply combining them into a single OperationResult node (or another, more aesthetically pleasing name) would dramatically slash the number of triples in the store, and it would also make for simpler queries.

Fix issue where types in RDF won't be fully unified

Unification happens during concatenation and translation to RDF. This is not a problem when there are no variables in output types, but there might be cases where variables are already encoded in the RDF and then unified later.

Connecting graphs to SPARQL endpoints

The in-memory graphs we use now should be made context-aware and connected to SPARQL endpoints.

Follow-up: adding and removing transformation graphs to the endpoint.

Variable naming

Variable names in the output are now called _<some number>. This can be confusing and overwhelming. Prettyprinting remaining variables in expression trees would be user-friendly.

Moreover, schematic variables should be visually differentiated from instance variables. Perhaps we should make schematic variables follow a different naming scheme vs instance variables.

Nested functions should know about internals

Consider f (g h) x, with f, g, h functions, (g h) a function, and x data. f should be connected to the same internal data that g is connected to.

,--internal-,
↓       |   |
λ → h → g → f
↑           ↑
`---- x ----'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.