cloudprivacylabs / lsa Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 1.0 68.86 MB

Layered Schema Architecure

License: Apache License 2.0

Go 93.99% Makefile 0.18% HTML 4.93% CSS 0.31% Python 0.57% Dockerfile 0.01%

graphs interoperability linked-data schema

lsa's People

Contributors

Stargazers

Watchers

Forkers

jcskywalker

lsa's Issues

conditional valueset lookup

The lookup may change based on the contents. For example, "Observation" lookup may use one lookup table for "Smoking", and another for "Alcohol", etc.

Add support for recognizing date values

Add a labeledAs term

This is mainly for JSON overlays. They don't have a way to add new labels. labeledAs will add the label(s) to the node.

This can be

labeledAs: label

labeledAs: [label1, label2,...]

In a schema, you can use labeledAs to add additional labels to a node. So:

{
   "@id": <id>
  "@type": "Object",
  "labeledAs": "x"
}

This is equivalent to:

{
  "@id": <id>
  "@type: ["Object", "x"],
}

There are cases where you can't do that. One is a JSON schema.

A JSON overlay looks like this:

{
   "properties": {
     "x": {
        "x-ls": {
          ...overlay terms
       }
   }
}

So you have no control on the labels of the fields. The labeledAs will be used in JSON schemas:

{
   "properties": {
     "x": {
        "x-ls": {
           "labeledAs": "Y"
          ...overlay terms
       }
   }
}

This needs post-processing of the schema graph. So:

Write a func to process labeledAs nodes. func ProcessLabeledAs(g graph)
This should go through all nodes that has labeledAs property, and add those labels to the nodes
Remove labeledAs property
Call ProcessLabeledAs after compile

We have to fill in defaultValues for those nodes that are empty after ingestion

Move Go section to end or separate page

Not clear there are Go details in the middle of the spec. Either move to end or create new document. This is an introduction to the Go implementation packages you have created and having a tutorial how to use it could later on be useful. In the future there might be other implementations like python.

CSV ingest join

We may get CSV files that are results of SQL joins. We need a way to ingest these.

The CSV ingestion should be able to use multiple schemas, each schema addressing a set of columns in the input. For each row, each schema specifies an entry to be ingested. Subsequent identical entries are to be assumed repetition due to join.

Let's have something different from csvingester.

Pipeline: ingest/csv/join
cobra command: csvjoin

This will need parameters:
Array of:
schema, range columns
bundle/type, range columns

Ex:

A,B,C,D,E,F,G,H,I
a,b,c,d,e,f,g,h,i
a,b,c,x,y,z,j,k,l
a,b,c,x,,y,z,q,w,e
--- output graph
e,f,g....

let's force to use a bundle for this type of CSV file. So there is a bundle containing all the entities in the schema.

Parameters (ordering matters):
SchemaA (variant id from the bundle), data is in cols 1-3, row identity is 1 (a is the primary key)
SchemaB (variant id from the bundle), 4-6, (1-4 are the ids)
SchemaC, 7-9 (1-4-7 are the ids)

When you ingest, create one graph for each instance of entity A, containing multiple instances of B and C.

col headers: A, B, C
row 1: a_1, b_1, c_1
row 2: a_1, b_2, c_2

row 3 : a_2, b_3, c_3,

This is a join of:

A:
a_1
a_2

B:
b_1, a_1
b_2, a_1
b_3, a_2

c:
c_1,b_1
c_2,b_2
c_3,b_3

When the entity for the first schema changes, output the graph, create new graph

For parameters:

You should be able to specify a schema, or bundle/type
You should be able to specify the range of columns (and numbers) (or use col names in schema)
You should be able to specify a list of columns (col headers or numbers) to specify row id (data that uniquely identifies that row, it may include columns of other entities)

(a,b,c) -- (d,e,f) -- (g,h,i)
           -- (x,y,z) -- (j,k,l) 
                         -- (q,w,e)

Need to translate this into:

ingest csv (a,,b,c) using Schema A
ingest csv (d,e,f) using Schema B, link to 1
ingest csv (g,h,i) using Schema C, link to 2
ingest csv (x,y,z) using Schema B, link to 1
8 ingest csv (j,k,l) using Schema C, link to 5
ingest csv (q,w,e) using Schema C, link to 8

Values insert new nodes under vsContext

Instead, it should locate where the new node will be inserted based on context.

The test TestStructuredDeepVS fails because the valueset results are created directly under root, not under root -> obj

Fix this by creating a path using EnsurePath.

This is done in ValuesetInfo.createResultNodes

We need to add hints to Polymorhic attributes

Add a schema field hint so that polymorphic objects can be validated quickly.

This will be done by using a node label: https://lschema.org/typeDiscriminator

Steps:

Update schemas/ls.json, add new mapping: "typeDiscriminator": "ls:typeDiscriminator"
Write a unit test with polymoprhic input, (use FHIR schema with an overlay)
In json/parse, xml/parse, use the discriminator
- in json/parse, there is ParsePolymorphic func, it tests all options. When testing options, it should first ingest the discriminator fields, and validate. If those fail, it should fail.
- To do that, in parse context, add a discriminator [][]string. If there is a hint, use that hint in ParseObject
- The ctx.discriminator will be a stack of []string. In ParsePolymorphic, push hints, when done, pop hints

We need a way to deal with categorized data ingestion

Categorized data is data represented in multiple columns. This is especially relevant for survey data. For instance:

User ID, Q1Ans1, Q1Ans2, Q2Ans1, Q2Ans2, Q2Ans3, ..

where each column is the response to a checkbox.

valueType should be @id

valueType uses namespaces. JSON-LD context should define it as @id

Errors should add the filename to the error message

When we get a JSON syntax error, it is impossible to find out in which JSON file parsing failed.

Change ValueAccessor to get/set values

Currently, valueAccessor uses GetNodeValue and SetNodeValue. However, value may belong to an edge, or a property. So we need additional semantic support for:

Parsing native value given a string and Properties
Formatting native value given oldValue and Properties

Compile does not add repeated instance of schema

If a schema includes multiple instances of the same base schema as different variants, only one is included in compiled schema

Include feature for schemas

Nice to have: include another schema. For instance:

{
  "@id": "someObjectId",
  "@type": "Object",
  "include": "referenceToSchema",
  "namespace": "https://new-namespace"
}

This would include the schema "referenceToSchema" in this schema, replacing its namespace to the given namespace. This would help defining common structs, like "code".

We need a way to set edge label from the source of the edge

We need something like this:

"someAttr": {
   "@type": "Object",
  "edgeLabel": "label",  // Connect all attributes via this edge label
  "attributes": {
  }
}

and

"someArr": {
  "@type": "Array",
  "edgeLabel": "label",
  ...
}

Ability to get valuesetname from the document

Something like

vsTableField: fieldName

Refactor pipelines

Pipelines are not a core component of LSA, they are defined at the layers binary level. Still, they are reusable, so we would like to have the pipeline support in a separate package under layers. So:

Create layers/cmd/pipeline package
Move PipelineContext (and methods) into pipeline package
Replace PipelineContext.InputFiles with func Next() io.ReadCloser
Move pipeline/step declarations into pipeline package
Export the marshaling support
Move fork, readgraph, writegraph into pipeline package
Everything else remaining under cmd, including the pipeline command

Fix the remaining pipeline related cmds accordingly. One thing to note: If len(InputFiles)==0, we read from stdin. You need to deal with this using a custom func that returns ioutil.NopCloser(os.StdIn) for Next() when called the first time, and nil afterwards,

Make sure to close the ReadCloser in pipeline when done.

Linking doc with repeated entity ids break links

If there are multiple entityId root nodes with same id, one of them is left orphan after linking

Add a merge graphs cmd to layers

This should input multiple graphs, and output a combination of those. There should be a mechanism to describe which node in one graph maps to another node in another graph.

Timezone mismatch when converting dates

this is the value in expected graph:
2021-10-13T00:00:00Z
posting to DB saves it as this:
2021-10-13T00:00:00-07:00
loading from DB comes out as this:
2021-10-13T00:00:00-07:00
calling ls.SetNodeValue on 2021-10-13T00:00:00-07:00 converts it to:
2021-10-13T00:00:00Z

Add term info to PropertyValue

Looking up term metadata is taking a lot of time. We should change PropertyValue to include term metadata. To do this:

type PropertyValue struct {
       sem *TermSemantics
	value interface{}
}

Add

func  (pv *PropertyValue) GetSem() *TermSemantics

Change IntPropertyValue, StringPropertyValue, etc. to:

XXXPropertyValue(term string, value int)

It should be impossible to construct a PropertyValue without term.

During construction, lookup term semantics and assign pv.sem to that (use GetTermInfo())
Store a pointer to that.

Once this is done, change GetNodesWithValidators() to use pv.GetSem() instead of GetTermMetadata()

Remove UnmarshalJSON and UnmarshalYAML from PropertyValue. Fix the property unmarshaler in ls/graphjson.go

valueset lookups should be based on expressions

Valueset definitions use schema node ids to refer to other nodes. Instead, the base code should use opencypher expressions. If expression aren't given, current behaviour should be replicated with predefined expressions.

Validate transform script

We need to make sure transform script matches target schema. For this:

Add a new method to pk/transform/TransformScript, Validate(targetSchema *ls.Layer) error
The new method should validate that every key of TransformScript.TargetSchemaNodes exists as an attribute in targetSchema
Search where TransformScript.Compile is called. Also call validated there.

GetValue for properties

If a schema attribute defined to be ingested as a property, there is no way to get its value. We need a function that gets the value of an attribute given shcema attribute id even if the attribute is a property.

x-ls extensions for $ref in JSON schema

JSON schema spec <2019 ignores all attributes if there is a $ref. That means, any overlays added to an object with $ref are ignored.

PatternDateTimeParser is using wrong format

If the node has GoTimeFormat, it should format using go formatter. If the node has Goment format, it should format using the goment formatter. Currently it is using the go formatter for both cases

Recover from panics in pipeline

For ingest_csv, ingest_json, ingest_xml, there is a loop in the pipeline to process inputs. The loop is of the form:

for {
    nextinput()
    // processInput
}

Convert this to:

for {
    nextInput()
    func() {
       defer func() {
          // recover from panic
          // Send the error down pipelineContext.errorLogger
       }()
       processInput
    }()
}

Add a pipelineContext.errorLogger func:

type PipelineContext struct {
    ErrorLogger func(pipelineContext, err error) bool
}

By default, set ErrorLogger to a func that logs the error

If ErrorLogger returns false, stop pipeline

Use something like this:

for {
   nextInput()
   doneErr:=nil
    func() {
       // recover
       // if !ErrorLogger(..) {
          doneErr=errr
     }()
   if done!=nil {
      return doneErr
   }
}

Valuesets where each node has an associated predicate

We need a valueset where each node has an associated predicate. This is necessary to represent ranges of values as separate unique nodes. For example: confidence level value in a node can be linked to a node that represents a range of confidence levels.

Add a valueType term

Currently we are overloading node labels to denote node types as well as value types. I believe value types should be in a property. So, instead of having a node (:Value:``xsd:date``), we should have (:Value {valueType: 'xsd:date'})

JSON schemas in bundles

If a JSON schema defines multiple entities, like the FHIR schema, we refer to the same schema in every entity definition. We should still support that, but also add a JSON schema definition at the top level of a bundle:

jsonSchemas:
  - id: https://hl7.org/fhir
    ref: fhir.schema.json
    overlays:
       - ovl1.json
       -  ovl2.json

Then, bundle references using this ID should refer to schema + overlays.

develop branch is not building

Develop branch is failing because there are calls to ls.Ingest without context under several packages. Make sure cmd/layers builds.

Post processing in graph builder is taking too long

The GraphBuilder has a PostIngest function that goes through all schema nodes and finds properties that implement PostIngest interface, and calls them. This is done separately for each ingested doc. The list of nodes that have PostIngest properties can be computed once, and the post ingest functions can be called on those schema nodes. But the graph builder does not know about the schema. The ingester func does.

So, the Ingest func should be converted to a method:

type Ingester struct {
  Schema *Layer
  ...
}

func (ing *Ingester) Ingest(...)

The Ingester can cache things. For the PostIngest work:

Write a Ingester.GetPostIngestSchemaNodes() function. This should go through all schema nodes and create a slice of schema nodes that have post ingest properties and cache it.
The ingester should use GetPostIngestSchemaNodes() to get the (cached) schema nodes, then call graphBuilder.PostIngestSchemaNode on them.
Make the cache functions of Ingester thread -safe (use RWMutex)

We should get rid of PropertyValue

We should find a better way of storing non-persistent properties in the graph.

AsPropertyOfTerm is not necessary. Use IngestAsTerm

Currently, we have an "ls:asProperty" term that specifies the data element should be ingested as a property. Instead, we should use "ls:ingestAs = property"

A method to split input into multiple nodes

Example input data is as follows:

"03/21/2019 Cefazolin <=4"

This is given in a single cell of a table. When we ingest this, there should be a way to split this cell into three nodes.

One option is to use a regex-based scheme:

Ingestion schema:
{
"@id": "rawDataID",
"@type": "Value",
"ls:split.regex": "(date regex) (text regex) (value regex)",
"ls:split.captureGroups": [ "schema id of date field", "schema id of text field", "schema id of value field"]
}

This would process the raw input value using the regex, and assign the captured values to the given attributes in the schema.

We need an option to not ingest empty values.

Clarification of architecture and uses cases

The architecture diagram shows a full architecture covering all use cases but would help to have a table indicating if input only, output only or both input/output is used for each use case.

The wiring between input and output might not be clear. I guess there is a "mapping" layer in to perform the wiring. Good to mention if that is mechanism.

cloudprivacylabs / lsa Goto Github PK

lsa's People

Contributors

Stargazers

Watchers

Forkers

lsa's Issues

col headers: A, B, C row 1: a_1, b_1, c_1 row 2: a_1, b_2, c_2

Recommend Projects

Recommend Topics

Recommend Org

Jobs

col headers: A, B, C
row 1: a_1, b_1, c_1
row 2: a_1, b_2, c_2