gchq / gafferpy Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 3.0 7.15 MB

Python API for Gaffer

Home Page: https://gchq.github.io/gafferpy/

License: Apache License 2.0

Python 97.13% Jinja 0.73% Jupyter Notebook 2.14%

accumulo gaffer graph python

gafferpy's People

Contributors

Stargazers

Watchers

Forkers

t92549 rebeleyesoldier richard-chris

gafferpy's Issues

Typo in update-gaffer-version.yaml

gafferpy/.github/workflows/update-gaffer-version.yaml

Line 40 in c7658a9

  java -Dgaffer.schemas=src/test/road-traffic-exampleschema -Dgaffer.storeProperties=src/test/road-traffic-examplefederatedStore.properties -Dgaffer.graph.config=src/test/road-traffic-examplefederatedGraphConfig.json -jar spring-rest.jar & 

Cleanup inconsistent use of quotes in gafferpy

There is inconsistency in the use single and double quotes across gafferpy, this should be standardised

gafferpy API missing a basic Pair implementation

The current gafferpy API has an implementation for a SeedPair that makes use of the class uk.gov.gchq.gaffer.commonutil.pair.Pair however, this only accepts a single EntitySeed for both the first and second argument of the pair. A generic Pair class is needed in gafferpy so that operations such as GetElementsBetweenSetsPairs can be used.

This was found under testing of the GetElementsBetweenSetsPairs operation as currently you cannot use this operation via gafferpy due to missing suitable class for the input to the operation. A class needs adding that can accept both a pair of objects or a pair of two lists and serialise them correctly into JSON.

For reference valid JSON for the GetElementsBetweenSetsPairs operation looks like:

{
    "class" : "GetElementsBetweenSetsPairs",
    "input" : {
        "class" : "Pair",
        "first" : {
            "Iterable": [
                {
                    "class" : "EntitySeed",
                    "vertex" : 1
                }
            ]
        },
        "second" : {
            "Iterable": [
                {
                    "class" : "EntitySeed",
                    "vertex" : 2
                },
                {
                    "class" : "EntitySeed",
                    "vertex" : 4
                }
            ]
        }
    }
}

Update Gaffer version to 2.1.0

Improve update Gaffer workflow

There are a few places were the Gaffer version is specified but not changed by the "Update Gaffer Version" workflow. The workflow should be changed so that all mentions of Gaffer version are updated together.
For example:

gafferpy/.github/workflows/continuous-integration.yaml

Line 58 in 04cb920

  curl -o road-traffic-rest-2.0.0.war https://repo1.maven.org/maven2/uk/gov/gchq/gaffer/road-traffic-rest/2.0.0/road-traffic-rest-2.0.0.war 

Add a high level Python query library ontop of gafferpy

gafferpy is great for directly sending json to the Gaffer rest-api, and to do this creates Python objects that map 1 to 1 to that json. However, these queries can become very long and require knowledge of Gaffer's verbose query language/json.
For a better user experience, an additional library should be made to sit on top of gafferpy, allowing users to specify more Pythonic, user friendly queries that would get translated to gafferpy queries and sent.

For example, here is a json query for a GetElements operation on the road traffic api. It gets Edges connected to the Entity with vertex M32:1. These are then filtered on the count property being more than 1, and the group by removed. It is also using the gaffer.federatedstore.operation.graphIds option to assert that this only gets executes on sub-graph graph1:

{
    "class": "uk.gov.gchq.gaffer.operation.impl.get.GetElements",
    "input": [
        {
            "class": "uk.gov.gchq.gaffer.operation.data.EntitySeed",
            "vertex": "M32:1"
        }
    ],
    "view": {
        "edges": {
            "RoadUse": {
                "preAggregationFilterFunctions": [
                    {
                        "selection": [
                            "count"
                        ],
                        "predicate": {
                            "class": "uk.gov.gchq.koryphe.impl.predicate.IsMoreThan",
                            "value": {
                                "java.lang.Long": 1
                            }
                        }
                    }
                ],
                "groupBy": []
            }
        }
    },
    "directedType": "EITHER",
    "options": {
        "gaffer.federatedstore.operation.graphIds": "graph1"
    }
}

This is a very long and verbose mapping to the Java api. The gafferpy code to perform this query is a verbose map to this json:

from gafferpy import gaffer as g
from gafferpy import gaffer_connector

gc = gaffer_connector.GafferConnector("http://localhost:8080/rest/latest")
op = g.GetElements(
    input=['M32:1'],
    view=g.View(
        edges=[
            g.ElementDefinition(
                group='RoadUse',
                group_by=[],
                pre_aggregation_filter_functions=[
                    g.PredicateContext(
                        selection=['count'],
                        predicate=g.IsMoreThan(
                            value=g.long(1)
                        )
                    )
                ]
            )
        ]
    ),
    directed_type=g.DirectedType.EITHER,
    options=["graph1"]
)
results = gc.execute_operation(op)

A more usable query library based in Python could look something like this:

from gafferpy import gaffer_query as gq
from gafferpy import gaffer_connector

gc = gaffer_connector.GafferConnector("http://localhost:8080/rest/latest")
results = gq.GetElements(using=gc, graphs="graph1") \
            .input("M32:1") \
            .view(edge="RoadUse", group_by=[], pre_agg_filter="count > 1") \
            .directed("either")

Most of this simplification could be achieved by restructuring operations so that objects like ElementDefinitions don't have to be created in such a verbose way.
For the simplification of the predicate however, a parser would have to be written to map the string to the relevant Predicate.

Fix sphinx-notes/pages usage

Caused by some README confusion: sphinx-notes/pages#33

Add tests for gafferpy helper functions

Tests should be added that assert helper functions still work for backwards compatibility.

Background

A lot of Operations in gafferpy have inner helper functions that make them easier to use.

For example, the following code wraps your inputs into a list of ElementSeeds:
https://github.com/gchq/gaffer-tools/blob/45f5fd1920bf5b93459f16df097224c1c2d0ed50/python-shell/src/gafferpy/gaffer_operations.py#L1123-L1136

This means you could provide an input of 1, but it will be wrapped as:
[{"class": "uk.gov.gchq.gaffer.operation.data.EntitySeed", "vertex": 1}]

These core Operations will now be generated and by default these helper functions will not be generated.

Therefore, there needs to be more tests to assert that the helper functions still work for backwards compatibility, so where they are missing they can be added to the generator code.

Improve fishbowl's generation of helper functions

As described in #8, a lot of gafferpy classes have "helper functions", that effectively wrap some inputs for the user to make gafferpy easier to use.
With fishbowl's addition into gafferpy, a lot of these were lost and needed to be added back manually.
However, fishbowl could use the type of a parameter to generate helper functions.

For example, where the operation details endpoint states the input parameter has:
"className": "uk.gov.gchq.gaffer.data.element.Element[]"
This could automatically wrap a single Element into a list, and even wrap single values in EntitySeeds.

As well as this, Element's properties could be wrapped in types depending on the schema.

Fix release and update workflows

The "Update Gaffer Version" workflow does not work properly right now. As well as this, the release should be reworked and simplified.

SeedPair does not serialise to JSON in python shell

Currently, when trying to serialise SeedPair in the Python Shell, a TypeError is thrown:
TypeError: Object of type SeedPair is not JSON serializable

Add automatic dataframe for gafferpy results

Describe the new feature you'd like
gafferpy should be able to optionally return results in the form of a dataframe

Why do you want this feature?
It would enable users to interact with the data more easily, rather than getting a basic list of elements

Update Gaffer version to 2.2.0

Update Gaffer workflow should automatically create issue as well

Rather than having to manually create an issue and rename the PR, this could all be automated

Improve usability of fishbowl

Currently, there exists a single generate.py script which uses fishbowl to generate the core api code for gafferpy and put it into a directory where gafferpy expects it. However, there are bugs and usability issues with this.

Firstly, to generate fishbowl, a GafferConnector is used to connect to the rest api. However, this connector imports some gafferpy modules such as gafferpy.gaffer_operations, so it breaks if there is not an already existing generated library.

Another issue is that currently it is not very easy to use fishbowl to extend gafferpy with custom operations. Users would have to download the gafferpy source code, use the generate.py script, and then import that library from source instead.

It would be nice if perhaps a fishbowl command line interface could be used instead so that users could specify things like: location of the rest api, where to put generated files, which files to generate, and whether to just generate the additional classes or a whole gafferpy installation.
Rough example usage:

fishbowl --api "http://localhost:8080/myRest" --output ./fishbowl_classes/ --generate operations,predicates

As well as this, perhaps a special import feature can be made where users can at runtime generate specific classes from a rest-api and these will be used to overwrite the default gafferpy ones.
Rough example:

from gafferpy import gaffer as g
from fishbowl.fishbowl import Fishbowl

Fishbowl("http://localhost:8080/rest", type="in-memory", classes="operations")
g.CustomOp()

Convert gaffer-tools to gafferpy repo

This repository has been created from gaffer-tools but will only host gafferpy. The non gafferpy related content should be removed, and the CI should be updated

gafferpy seed matching backwards compatibility

Gaffer 2 removed seed matching, but gafferpy could retain backwards compatibility with existing scripts by adding this back and translating the json to use Views instead.

Add results streaming to gafferpy

Describe the new feature you'd like
gafferpy should be able to stream results back from the rest api, probably in bulk chunks where the user can set the size

Why do you want this feature?
If very large results are returned, this would allow gafferpy users to process the results as they come, effectively utilising the lazy iterable from Accumulo. It would mean that large results that would otherwise not fit into memory can be processed in a stream.

Additional context
The /graph/operations/execute/chunked endpoint should be used to stream results back from the rest api

Add an optional data transformer to gafferpy results

Currently, results are returned in gafferpy as either the direct json result from the Gaffer api, or as gafferpy object equivalent. This is okay for some use cases, but if a users wants to perform a simple, fast query, it can become bogged down in a lot of Java related boilerplate to do with types.
This is an example output from the road-traffic example:

{'class': 'uk.gov.gchq.gaffer.data.element.Edge',
  'destination': 'M32:M4 (19)',
  'directed': True,
  'group': 'RoadUse',
  'matchedVertex': 'SOURCE',
  'properties': {'count': {'java.lang.Long': 841303},
                 'countByVehicleType': {'uk.gov.gchq.gaffer.types.FreqMap': {'AMV': 407034,
                                                                             'BUS': 1375,
                                                                             'CAR': 320028,
                                                                             'HGV': 27234,
                                                                             'HGVA3': 1277,
                                                                             'HGVA5': 5964,
                                                                             'HGVA6': 4817,
                                                                             'HGVR2': 11369,
                                                                             'HGVR3': 2004,
                                                                             'HGVR4': 1803,
                                                                             'LGV': 55312,
                                                                             'PC': 1,
                                                                             'WMV2': 3085}},
                 'endDate': {'java.util.Date': 1431543599999},
                 'startDate': {'java.util.Date': 1034319600000}},
  'source': 'M32:1'}

It would be great if this could be optionally return an object that you could get results directly from without nested types involved:

>>> print(result.source)
'M32:1'
>>> print(result.properties.count)
841303
>>> print(result.countByVehicleType.CAR)
320028

This could be implemented as a generator that takes json input to create these results objects lazily. Dictionaries can be mapped to objects easily in Python (see munch).

When creating this generator, users should be able to easily add transform functions to the result, like removing, renaming and applying functions to fields. A lot of this functionality (renaming fields, ignoring fields and transforming them) already comes with Gaffer though, so perhaps this could be added to the OperationChain rather than executed in Python.

Add bulk helpers to gafferpy

Describe the new feature you'd like
gafferpy should be able to take an iterable and call an operation repeatedly using chunks from that iterable

Why do you want this feature?
This would allow a large AddElements to be easily chunked into user defined sizes

Improve docs landing page

Should include more details like:

what is gafferpy
overview for each section (generated, fishbowl, connector)
link to main docs

Fix publish docs workflow

Fishbowl should correctly set copyright dates

Currently the copyright year is set to 2022 in the templates. It should instead use the actual date to create correct copyright dates.

Implement random element generation in gafferpy

gaffer-tools' random element generation could be used as reference. It will be a useful thing to be able to do directly in gafferpy

Change README road-traffic-demo command

mvn verify -Proad-traffic-demo runs all the Gaffer test.
It should be mvn clean install -pl :road-traffic-demo -Proad-traffic-demo,quick instead.

Refactor tests to use tox and pytest

The tests could do with tidying and using pytest

A PySpark API for Gaffer

Gaffer has a Spark library with Scala and Java APIs for accessing data using Spark; generating RDDs and Spark DataFrames from Gaffer graphs.

Gaffer also has a python shell with implementations of standard Gaffer operations that can be executed on the graph using Gaffer's rest service.

Extending the python API to support spark operations - producing RDDs and DataFrames - would open Gaffer up to a lot of useful python and spark data science and machine learning libraries

Release docs to gh-pages

There should be a workflow that rebuilds the docs and pushes them to gh-pages after every merge to main

gchq / gafferpy Goto Github PK

gafferpy's People

Contributors

Stargazers

Watchers

Forkers

gafferpy's Issues

Background

Recommend Projects

Recommend Topics

Recommend Org

Jobs