cloudprivacylabs / lsa Goto Github PK
View Code? Open in Web Editor NEWLayered Schema Architecure
License: Apache License 2.0
Layered Schema Architecure
License: Apache License 2.0
The lookup may change based on the contents. For example, "Observation" lookup may use one lookup table for "Smoking", and another for "Alcohol", etc.
This is mainly for JSON overlays. They don't have a way to add new labels. labeledAs will add the label(s) to the node.
This can be
labeledAs: label
or
labeledAs: [label1, label2,...]
In a schema, you can use labeledAs
to add additional labels to a node. So:
{
"@id": <id>
"@type": "Object",
"labeledAs": "x"
}
This is equivalent to:
{
"@id": <id>
"@type: ["Object", "x"],
}
There are cases where you can't do that. One is a JSON schema.
A JSON overlay looks like this:
{
"properties": {
"x": {
"x-ls": {
...overlay terms
}
}
}
So you have no control on the labels of the fields. The labeledAs will be used in JSON schemas:
{
"properties": {
"x": {
"x-ls": {
"labeledAs": "Y"
...overlay terms
}
}
}
This needs post-processing of the schema graph. So:
Not clear there are Go details in the middle of the spec. Either move to end or create new document. This is an introduction to the Go implementation packages you have created and having a tutorial how to use it could later on be useful. In the future there might be other implementations like python.
We may get CSV files that are results of SQL joins. We need a way to ingest these.
The CSV ingestion should be able to use multiple schemas, each schema addressing a set of columns in the input. For each row, each schema specifies an entry to be ingested. Subsequent identical entries are to be assumed repetition due to join.
Let's have something different from csvingester.
Pipeline: ingest/csv/join
cobra command: csvjoin
This will need parameters:
Array of:
schema, range columns
bundle/type, range columns
Ex:
A,B,C,D,E,F,G,H,I
a,b,c,d,e,f,g,h,i
a,b,c,x,y,z,j,k,l
a,b,c,x,,y,z,q,w,e
--- output graph
e,f,g....
let's force to use a bundle for this type of CSV file. So there is a bundle containing all the entities in the schema.
Parameters (ordering matters):
SchemaA (variant id from the bundle), data is in cols 1-3, row identity is 1 (a is the primary key)
SchemaB (variant id from the bundle), 4-6, (1-4 are the ids)
SchemaC, 7-9 (1-4-7 are the ids)
When you ingest, create one graph for each instance of entity A, containing multiple instances of B and C.
row 3 : a_2, b_3, c_3,
This is a join of:
A:
a_1
a_2
B:
b_1, a_1
b_2, a_1
b_3, a_2
c:
c_1,b_1
c_2,b_2
c_3,b_3
When the entity for the first schema changes, output the graph, create new graph
For parameters:
(a,b,c) -- (d,e,f) -- (g,h,i)
-- (x,y,z) -- (j,k,l)
-- (q,w,e)
Need to translate this into:
Instead, it should locate where the new node will be inserted based on context.
The test TestStructuredDeepVS fails because the valueset results are created directly under root, not under root -> obj
Fix this by creating a path using EnsurePath.
This is done in ValuesetInfo.createResultNodes
Add a schema field hint so that polymorphic objects can be validated quickly.
This will be done by using a node label: https://lschema.org/typeDiscriminator
Steps:
Categorized data is data represented in multiple columns. This is especially relevant for survey data. For instance:
User ID, Q1Ans1, Q1Ans2, Q2Ans1, Q2Ans2, Q2Ans3, ..
where each column is the response to a checkbox.
valueType uses namespaces. JSON-LD context should define it as @id
When we get a JSON syntax error, it is impossible to find out in which JSON file parsing failed.
Currently, valueAccessor uses GetNodeValue and SetNodeValue. However, value may belong to an edge, or a property. So we need additional semantic support for:
If a schema includes multiple instances of the same base schema as different variants, only one is included in compiled schema
Nice to have: include another schema. For instance:
{
"@id": "someObjectId",
"@type": "Object",
"include": "referenceToSchema",
"namespace": "https://new-namespace"
}
This would include the schema "referenceToSchema" in this schema, replacing its namespace to the given namespace. This would help defining common structs, like "code".
We need something like this:
"someAttr": {
"@type": "Object",
"edgeLabel": "label", // Connect all attributes via this edge label
"attributes": {
}
}
and
"someArr": {
"@type": "Array",
"edgeLabel": "label",
...
}
Something like
vsTableField: fieldName
Pipelines are not a core component of LSA, they are defined at the layers
binary level. Still, they are reusable, so we would like to have the pipeline support in a separate package under layers. So:
layers/cmd/pipeline
packagefunc Next() io.ReadCloser
Fix the remaining pipeline related cmds accordingly. One thing to note: If len(InputFiles)==0, we read from stdin. You need to deal with this using a custom func that returns ioutil.NopCloser(os.StdIn) for Next() when called the first time, and nil afterwards,
Make sure to close the ReadCloser in pipeline when done.
If there are multiple entityId root nodes with same id, one of them is left orphan after linking
This should input multiple graphs, and output a combination of those. There should be a mechanism to describe which node in one graph maps to another node in another graph.
this is the value in expected graph:
2021-10-13T00:00:00Z
posting to DB saves it as this:
2021-10-13T00:00:00-07:00
loading from DB comes out as this:
2021-10-13T00:00:00-07:00
calling ls.SetNodeValue on 2021-10-13T00:00:00-07:00 converts it to:
2021-10-13T00:00:00Z
Looking up term metadata is taking a lot of time. We should change PropertyValue to include term metadata. To do this:
type PropertyValue struct {
sem *TermSemantics
value interface{}
}
Add
func (pv *PropertyValue) GetSem() *TermSemantics
Change IntPropertyValue, StringPropertyValue, etc. to:
XXXPropertyValue(term string, value int)
It should be impossible to construct a PropertyValue without term.
During construction, lookup term semantics and assign pv.sem to that (use GetTermInfo())
Store a pointer to that.
Once this is done, change GetNodesWithValidators() to use pv.GetSem() instead of GetTermMetadata()
Remove UnmarshalJSON and UnmarshalYAML from PropertyValue. Fix the property unmarshaler in ls/graphjson.go
Valueset definitions use schema node ids to refer to other nodes. Instead, the base code should use opencypher expressions. If expression aren't given, current behaviour should be replicated with predefined expressions.
We need to make sure transform script matches target schema. For this:
If a schema attribute defined to be ingested as a property, there is no way to get its value. We need a function that gets the value of an attribute given shcema attribute id even if the attribute is a property.
JSON schema spec <2019 ignores all attributes if there is a $ref. That means, any overlays added to an object with $ref are ignored.
If the node has GoTimeFormat, it should format using go formatter. If the node has Goment format, it should format using the goment formatter. Currently it is using the go formatter for both cases
For ingest_csv, ingest_json, ingest_xml, there is a loop in the pipeline to process inputs. The loop is of the form:
for {
nextinput()
// processInput
}
Convert this to:
for {
nextInput()
func() {
defer func() {
// recover from panic
// Send the error down pipelineContext.errorLogger
}()
processInput
}()
}
Add a pipelineContext.errorLogger func:
type PipelineContext struct {
ErrorLogger func(pipelineContext, err error) bool
}
By default, set ErrorLogger to a func that logs the error
If ErrorLogger returns false, stop pipeline
Use something like this:
for {
nextInput()
doneErr:=nil
func() {
// recover
// if !ErrorLogger(..) {
doneErr=errr
}()
if done!=nil {
return doneErr
}
}
We need a valueset where each node has an associated predicate. This is necessary to represent ranges of values as separate unique nodes. For example: confidence level value in a node can be linked to a node that represents a range of confidence levels.
Currently we are overloading node labels to denote node types as well as value types. I believe value types should be in a property. So, instead of having a node (:Value:``xsd:date``)
, we should have (:Value {valueType: 'xsd:date'})
If a JSON schema defines multiple entities, like the FHIR schema, we refer to the same schema in every entity definition. We should still support that, but also add a JSON schema definition at the top level of a bundle:
jsonSchemas:
- id: https://hl7.org/fhir
ref: fhir.schema.json
overlays:
- ovl1.json
- ovl2.json
Then, bundle references using this ID should refer to schema + overlays.
Develop branch is failing because there are calls to ls.Ingest without context under several packages. Make sure cmd/layers builds.
The GraphBuilder has a PostIngest function that goes through all schema nodes and finds properties that implement PostIngest interface, and calls them. This is done separately for each ingested doc. The list of nodes that have PostIngest properties can be computed once, and the post ingest functions can be called on those schema nodes. But the graph builder does not know about the schema. The ingester func does.
So, the Ingest func should be converted to a method:
type Ingester struct {
Schema *Layer
...
}
func (ing *Ingester) Ingest(...)
The Ingester can cache things. For the PostIngest work:
We should find a better way of storing non-persistent properties in the graph.
Currently, we have an "ls:asProperty" term that specifies the data element should be ingested as a property. Instead, we should use "ls:ingestAs = property"
Example input data is as follows:
"03/21/2019 Cefazolin <=4"
This is given in a single cell of a table. When we ingest this, there should be a way to split this cell into three nodes.
One option is to use a regex-based scheme:
Ingestion schema:
{
"@id": "rawDataID",
"@type": "Value",
"ls:split.regex": "(date regex) (text regex) (value regex)",
"ls:split.captureGroups": [ "schema id of date field", "schema id of text field", "schema id of value field"]
}
This would process the raw input value using the regex, and assign the captured values to the given attributes in the schema.
The architecture diagram shows a full architecture covering all use cases but would help to have a table indicating if input only, output only or both input/output is used for each use case.
The wiring between input and output might not be clear. I guess there is a "mapping" layer in to perform the wiring. Good to mention if that is mechanism.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.