gchq / gafferpy Goto Github PK
View Code? Open in Web Editor NEWPython API for Gaffer
Home Page: https://gchq.github.io/gafferpy/
License: Apache License 2.0
Python API for Gaffer
Home Page: https://gchq.github.io/gafferpy/
License: Apache License 2.0
There is inconsistency in the use single and double quotes across gafferpy, this should be standardised
The current gafferpy API has an implementation for a SeedPair
that makes use of the class uk.gov.gchq.gaffer.commonutil.pair.Pair
however, this only accepts a single EntitySeed
for both the first and second argument of the pair. A generic Pair
class is needed in gafferpy so that operations such as GetElementsBetweenSetsPairs
can be used.
This was found under testing of the GetElementsBetweenSetsPairs
operation as currently you cannot use this operation via gafferpy due to missing suitable class for the input to the operation. A class needs adding that can accept both a pair of objects or a pair of two lists and serialise them correctly into JSON.
For reference valid JSON for the GetElementsBetweenSetsPairs
operation looks like:
{
"class" : "GetElementsBetweenSetsPairs",
"input" : {
"class" : "Pair",
"first" : {
"Iterable": [
{
"class" : "EntitySeed",
"vertex" : 1
}
]
},
"second" : {
"Iterable": [
{
"class" : "EntitySeed",
"vertex" : 2
},
{
"class" : "EntitySeed",
"vertex" : 4
}
]
}
}
}
There are a few places were the Gaffer version is specified but not changed by the "Update Gaffer Version" workflow. The workflow should be changed so that all mentions of Gaffer version are updated together.
For example:
gafferpy is great for directly sending json to the Gaffer rest-api, and to do this creates Python objects that map 1 to 1 to that json. However, these queries can become very long and require knowledge of Gaffer's verbose query language/json.
For a better user experience, an additional library should be made to sit on top of gafferpy, allowing users to specify more Pythonic, user friendly queries that would get translated to gafferpy queries and sent.
For example, here is a json query for a GetElements operation on the road traffic api. It gets Edges connected to the Entity with vertex M32:1
. These are then filtered on the count property being more than 1, and the group by removed. It is also using the gaffer.federatedstore.operation.graphIds
option to assert that this only gets executes on sub-graph graph1
:
{
"class": "uk.gov.gchq.gaffer.operation.impl.get.GetElements",
"input": [
{
"class": "uk.gov.gchq.gaffer.operation.data.EntitySeed",
"vertex": "M32:1"
}
],
"view": {
"edges": {
"RoadUse": {
"preAggregationFilterFunctions": [
{
"selection": [
"count"
],
"predicate": {
"class": "uk.gov.gchq.koryphe.impl.predicate.IsMoreThan",
"value": {
"java.lang.Long": 1
}
}
}
],
"groupBy": []
}
}
},
"directedType": "EITHER",
"options": {
"gaffer.federatedstore.operation.graphIds": "graph1"
}
}
This is a very long and verbose mapping to the Java api. The gafferpy code to perform this query is a verbose map to this json:
from gafferpy import gaffer as g
from gafferpy import gaffer_connector
gc = gaffer_connector.GafferConnector("http://localhost:8080/rest/latest")
op = g.GetElements(
input=['M32:1'],
view=g.View(
edges=[
g.ElementDefinition(
group='RoadUse',
group_by=[],
pre_aggregation_filter_functions=[
g.PredicateContext(
selection=['count'],
predicate=g.IsMoreThan(
value=g.long(1)
)
)
]
)
]
),
directed_type=g.DirectedType.EITHER,
options=["graph1"]
)
results = gc.execute_operation(op)
A more usable query library based in Python could look something like this:
from gafferpy import gaffer_query as gq
from gafferpy import gaffer_connector
gc = gaffer_connector.GafferConnector("http://localhost:8080/rest/latest")
results = gq.GetElements(using=gc, graphs="graph1") \
.input("M32:1") \
.view(edge="RoadUse", group_by=[], pre_agg_filter="count > 1") \
.directed("either")
Most of this simplification could be achieved by restructuring operations so that objects like ElementDefinitions don't have to be created in such a verbose way.
For the simplification of the predicate however, a parser would have to be written to map the string to the relevant Predicate.
Caused by some README confusion: sphinx-notes/pages#33
Tests should be added that assert helper functions still work for backwards compatibility.
A lot of Operations in gafferpy have inner helper functions that make them easier to use.
For example, the following code wraps your inputs into a list of ElementSeeds:
https://github.com/gchq/gaffer-tools/blob/45f5fd1920bf5b93459f16df097224c1c2d0ed50/python-shell/src/gafferpy/gaffer_operations.py#L1123-L1136
This means you could provide an input of 1
, but it will be wrapped as:
[{"class": "uk.gov.gchq.gaffer.operation.data.EntitySeed", "vertex": 1}]
These core Operations will now be generated and by default these helper functions will not be generated.
Therefore, there needs to be more tests to assert that the helper functions still work for backwards compatibility, so where they are missing they can be added to the generator code.
As described in #8, a lot of gafferpy classes have "helper functions", that effectively wrap some inputs for the user to make gafferpy easier to use.
With fishbowl's addition into gafferpy, a lot of these were lost and needed to be added back manually.
However, fishbowl could use the type of a parameter to generate helper functions.
For example, where the operation details endpoint states the input parameter has:
"className": "uk.gov.gchq.gaffer.data.element.Element[]"
This could automatically wrap a single Element into a list, and even wrap single values in EntitySeeds.
As well as this, Element's properties could be wrapped in types depending on the schema.
The "Update Gaffer Version" workflow does not work properly right now. As well as this, the release should be reworked and simplified.
Currently, when trying to serialise SeedPair in the Python Shell, a TypeError is thrown:
TypeError: Object of type SeedPair is not JSON serializable
Describe the new feature you'd like
gafferpy should be able to optionally return results in the form of a dataframe
Why do you want this feature?
It would enable users to interact with the data more easily, rather than getting a basic list of elements
Rather than having to manually create an issue and rename the PR, this could all be automated
Currently, there exists a single generate.py script which uses fishbowl to generate the core api code for gafferpy and put it into a directory where gafferpy expects it. However, there are bugs and usability issues with this.
Firstly, to generate fishbowl, a GafferConnector
is used to connect to the rest api. However, this connector imports some gafferpy modules such as gafferpy.gaffer_operations
, so it breaks if there is not an already existing generated library.
Another issue is that currently it is not very easy to use fishbowl to extend gafferpy with custom operations. Users would have to download the gafferpy source code, use the generate.py script, and then import that library from source instead.
It would be nice if perhaps a fishbowl command line interface could be used instead so that users could specify things like: location of the rest api, where to put generated files, which files to generate, and whether to just generate the additional classes or a whole gafferpy installation.
Rough example usage:
fishbowl --api "http://localhost:8080/myRest" --output ./fishbowl_classes/ --generate operations,predicates
As well as this, perhaps a special import feature can be made where users can at runtime generate specific classes from a rest-api and these will be used to overwrite the default gafferpy ones.
Rough example:
from gafferpy import gaffer as g
from fishbowl.fishbowl import Fishbowl
Fishbowl("http://localhost:8080/rest", type="in-memory", classes="operations")
g.CustomOp()
This repository has been created from gaffer-tools but will only host gafferpy. The non gafferpy related content should be removed, and the CI should be updated
Gaffer 2 removed seed matching, but gafferpy could retain backwards compatibility with existing scripts by adding this back and translating the json to use Views instead.
Describe the new feature you'd like
gafferpy should be able to stream results back from the rest api, probably in bulk chunks where the user can set the size
Why do you want this feature?
If very large results are returned, this would allow gafferpy users to process the results as they come, effectively utilising the lazy iterable from Accumulo. It would mean that large results that would otherwise not fit into memory can be processed in a stream.
Additional context
The /graph/operations/execute/chunked
endpoint should be used to stream results back from the rest api
Currently, results are returned in gafferpy as either the direct json result from the Gaffer api, or as gafferpy object equivalent. This is okay for some use cases, but if a users wants to perform a simple, fast query, it can become bogged down in a lot of Java related boilerplate to do with types.
This is an example output from the road-traffic example:
{'class': 'uk.gov.gchq.gaffer.data.element.Edge',
'destination': 'M32:M4 (19)',
'directed': True,
'group': 'RoadUse',
'matchedVertex': 'SOURCE',
'properties': {'count': {'java.lang.Long': 841303},
'countByVehicleType': {'uk.gov.gchq.gaffer.types.FreqMap': {'AMV': 407034,
'BUS': 1375,
'CAR': 320028,
'HGV': 27234,
'HGVA3': 1277,
'HGVA5': 5964,
'HGVA6': 4817,
'HGVR2': 11369,
'HGVR3': 2004,
'HGVR4': 1803,
'LGV': 55312,
'PC': 1,
'WMV2': 3085}},
'endDate': {'java.util.Date': 1431543599999},
'startDate': {'java.util.Date': 1034319600000}},
'source': 'M32:1'}
It would be great if this could be optionally return an object that you could get results directly from without nested types involved:
>>> print(result.source)
'M32:1'
>>> print(result.properties.count)
841303
>>> print(result.countByVehicleType.CAR)
320028
This could be implemented as a generator that takes json input to create these results objects lazily. Dictionaries can be mapped to objects easily in Python (see munch).
When creating this generator, users should be able to easily add transform functions to the result, like removing, renaming and applying functions to fields. A lot of this functionality (renaming fields, ignoring fields and transforming them) already comes with Gaffer though, so perhaps this could be added to the OperationChain rather than executed in Python.
Describe the new feature you'd like
gafferpy should be able to take an iterable and call an operation repeatedly using chunks from that iterable
Why do you want this feature?
This would allow a large AddElements to be easily chunked into user defined sizes
Should include more details like:
Currently the copyright year is set to 2022 in the templates. It should instead use the actual date to create correct copyright dates.
gaffer-tools' random element generation could be used as reference. It will be a useful thing to be able to do directly in gafferpy
mvn verify -Proad-traffic-demo
runs all the Gaffer test.
It should be mvn clean install -pl :road-traffic-demo -Proad-traffic-demo,quick
instead.
The tests could do with tidying and using pytest
Gaffer has a Spark library with Scala and Java APIs for accessing data using Spark; generating RDDs and Spark DataFrames from Gaffer graphs.
Gaffer also has a python shell with implementations of standard Gaffer operations that can be executed on the graph using Gaffer's rest service.
Extending the python API to support spark operations - producing RDDs and DataFrames - would open Gaffer up to a lot of useful python and spark data science and machine learning libraries
There should be a workflow that rebuilds the docs and pushes them to gh-pages after every merge to main
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.