GithubHelp home page GithubHelp logo

gchq / gafferpy Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 3.0 7.15 MB

Python API for Gaffer

Home Page: https://gchq.github.io/gafferpy/

License: Apache License 2.0

Python 97.13% Jinja 0.73% Jupyter Notebook 2.14%
accumulo gaffer graph python

gafferpy's People

Contributors

c015dariu avatar ctas582 avatar d47853 avatar dependabot[bot] avatar g609bmsma avatar gaffer01 avatar gchqdeveloper314 avatar github-actions[bot] avatar hchho avatar j8934893 avatar javadev001001 avatar jpelbertrios avatar l50741 avatar lb324567 avatar m29827 avatar m316257 avatar m55624 avatar m607123 avatar macenturalxl1 avatar n288tjyrx avatar p013570 avatar p3430233 avatar r32575 avatar sameshl avatar t511203 avatar t616178 avatar t92549 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

gafferpy's Issues

gafferpy API missing a basic Pair implementation

The current gafferpy API has an implementation for a SeedPair that makes use of the class uk.gov.gchq.gaffer.commonutil.pair.Pair however, this only accepts a single EntitySeed for both the first and second argument of the pair. A generic Pair class is needed in gafferpy so that operations such as GetElementsBetweenSetsPairs can be used.

This was found under testing of the GetElementsBetweenSetsPairs operation as currently you cannot use this operation via gafferpy due to missing suitable class for the input to the operation. A class needs adding that can accept both a pair of objects or a pair of two lists and serialise them correctly into JSON.

For reference valid JSON for the GetElementsBetweenSetsPairs operation looks like:

{
    "class" : "GetElementsBetweenSetsPairs",
    "input" : {
        "class" : "Pair",
        "first" : {
            "Iterable": [
                {
                    "class" : "EntitySeed",
                    "vertex" : 1
                }
            ]
        },
        "second" : {
            "Iterable": [
                {
                    "class" : "EntitySeed",
                    "vertex" : 2
                },
                {
                    "class" : "EntitySeed",
                    "vertex" : 4
                }
            ]
        }
    }
}

Improve update Gaffer workflow

There are a few places were the Gaffer version is specified but not changed by the "Update Gaffer Version" workflow. The workflow should be changed so that all mentions of Gaffer version are updated together.
For example:

curl -o road-traffic-rest-2.0.0.war https://repo1.maven.org/maven2/uk/gov/gchq/gaffer/road-traffic-rest/2.0.0/road-traffic-rest-2.0.0.war

Add a high level Python query library ontop of gafferpy

gafferpy is great for directly sending json to the Gaffer rest-api, and to do this creates Python objects that map 1 to 1 to that json. However, these queries can become very long and require knowledge of Gaffer's verbose query language/json.
For a better user experience, an additional library should be made to sit on top of gafferpy, allowing users to specify more Pythonic, user friendly queries that would get translated to gafferpy queries and sent.

For example, here is a json query for a GetElements operation on the road traffic api. It gets Edges connected to the Entity with vertex M32:1. These are then filtered on the count property being more than 1, and the group by removed. It is also using the gaffer.federatedstore.operation.graphIds option to assert that this only gets executes on sub-graph graph1:

{
    "class": "uk.gov.gchq.gaffer.operation.impl.get.GetElements",
    "input": [
        {
            "class": "uk.gov.gchq.gaffer.operation.data.EntitySeed",
            "vertex": "M32:1"
        }
    ],
    "view": {
        "edges": {
            "RoadUse": {
                "preAggregationFilterFunctions": [
                    {
                        "selection": [
                            "count"
                        ],
                        "predicate": {
                            "class": "uk.gov.gchq.koryphe.impl.predicate.IsMoreThan",
                            "value": {
                                "java.lang.Long": 1
                            }
                        }
                    }
                ],
                "groupBy": []
            }
        }
    },
    "directedType": "EITHER",
    "options": {
        "gaffer.federatedstore.operation.graphIds": "graph1"
    }
}

This is a very long and verbose mapping to the Java api. The gafferpy code to perform this query is a verbose map to this json:

from gafferpy import gaffer as g
from gafferpy import gaffer_connector

gc = gaffer_connector.GafferConnector("http://localhost:8080/rest/latest")
op = g.GetElements(
    input=['M32:1'],
    view=g.View(
        edges=[
            g.ElementDefinition(
                group='RoadUse',
                group_by=[],
                pre_aggregation_filter_functions=[
                    g.PredicateContext(
                        selection=['count'],
                        predicate=g.IsMoreThan(
                            value=g.long(1)
                        )
                    )
                ]
            )
        ]
    ),
    directed_type=g.DirectedType.EITHER,
    options=["graph1"]
)
results = gc.execute_operation(op)

A more usable query library based in Python could look something like this:

from gafferpy import gaffer_query as gq
from gafferpy import gaffer_connector

gc = gaffer_connector.GafferConnector("http://localhost:8080/rest/latest")
results = gq.GetElements(using=gc, graphs="graph1") \
            .input("M32:1") \
            .view(edge="RoadUse", group_by=[], pre_agg_filter="count > 1") \
            .directed("either")

Most of this simplification could be achieved by restructuring operations so that objects like ElementDefinitions don't have to be created in such a verbose way.
For the simplification of the predicate however, a parser would have to be written to map the string to the relevant Predicate.

Add tests for gafferpy helper functions

Tests should be added that assert helper functions still work for backwards compatibility.

Background

A lot of Operations in gafferpy have inner helper functions that make them easier to use.

For example, the following code wraps your inputs into a list of ElementSeeds:
https://github.com/gchq/gaffer-tools/blob/45f5fd1920bf5b93459f16df097224c1c2d0ed50/python-shell/src/gafferpy/gaffer_operations.py#L1123-L1136

This means you could provide an input of 1, but it will be wrapped as:
[{"class": "uk.gov.gchq.gaffer.operation.data.EntitySeed", "vertex": 1}]

These core Operations will now be generated and by default these helper functions will not be generated.

Therefore, there needs to be more tests to assert that the helper functions still work for backwards compatibility, so where they are missing they can be added to the generator code.

Improve fishbowl's generation of helper functions

As described in #8, a lot of gafferpy classes have "helper functions", that effectively wrap some inputs for the user to make gafferpy easier to use.
With fishbowl's addition into gafferpy, a lot of these were lost and needed to be added back manually.
However, fishbowl could use the type of a parameter to generate helper functions.

For example, where the operation details endpoint states the input parameter has:
"className": "uk.gov.gchq.gaffer.data.element.Element[]"
This could automatically wrap a single Element into a list, and even wrap single values in EntitySeeds.

As well as this, Element's properties could be wrapped in types depending on the schema.

Fix release and update workflows

The "Update Gaffer Version" workflow does not work properly right now. As well as this, the release should be reworked and simplified.

Add automatic dataframe for gafferpy results

Describe the new feature you'd like
gafferpy should be able to optionally return results in the form of a dataframe

Why do you want this feature?
It would enable users to interact with the data more easily, rather than getting a basic list of elements

Improve usability of fishbowl

Currently, there exists a single generate.py script which uses fishbowl to generate the core api code for gafferpy and put it into a directory where gafferpy expects it. However, there are bugs and usability issues with this.

Firstly, to generate fishbowl, a GafferConnector is used to connect to the rest api. However, this connector imports some gafferpy modules such as gafferpy.gaffer_operations, so it breaks if there is not an already existing generated library.

Another issue is that currently it is not very easy to use fishbowl to extend gafferpy with custom operations. Users would have to download the gafferpy source code, use the generate.py script, and then import that library from source instead.

It would be nice if perhaps a fishbowl command line interface could be used instead so that users could specify things like: location of the rest api, where to put generated files, which files to generate, and whether to just generate the additional classes or a whole gafferpy installation.
Rough example usage:

fishbowl --api "http://localhost:8080/myRest" --output ./fishbowl_classes/ --generate operations,predicates

As well as this, perhaps a special import feature can be made where users can at runtime generate specific classes from a rest-api and these will be used to overwrite the default gafferpy ones.
Rough example:

from gafferpy import gaffer as g
from fishbowl.fishbowl import Fishbowl

Fishbowl("http://localhost:8080/rest", type="in-memory", classes="operations")
g.CustomOp()

Add results streaming to gafferpy

Describe the new feature you'd like
gafferpy should be able to stream results back from the rest api, probably in bulk chunks where the user can set the size

Why do you want this feature?
If very large results are returned, this would allow gafferpy users to process the results as they come, effectively utilising the lazy iterable from Accumulo. It would mean that large results that would otherwise not fit into memory can be processed in a stream.

Additional context
The /graph/operations/execute/chunked endpoint should be used to stream results back from the rest api

Add an optional data transformer to gafferpy results

Currently, results are returned in gafferpy as either the direct json result from the Gaffer api, or as gafferpy object equivalent. This is okay for some use cases, but if a users wants to perform a simple, fast query, it can become bogged down in a lot of Java related boilerplate to do with types.
This is an example output from the road-traffic example:

{'class': 'uk.gov.gchq.gaffer.data.element.Edge',
  'destination': 'M32:M4 (19)',
  'directed': True,
  'group': 'RoadUse',
  'matchedVertex': 'SOURCE',
  'properties': {'count': {'java.lang.Long': 841303},
                 'countByVehicleType': {'uk.gov.gchq.gaffer.types.FreqMap': {'AMV': 407034,
                                                                             'BUS': 1375,
                                                                             'CAR': 320028,
                                                                             'HGV': 27234,
                                                                             'HGVA3': 1277,
                                                                             'HGVA5': 5964,
                                                                             'HGVA6': 4817,
                                                                             'HGVR2': 11369,
                                                                             'HGVR3': 2004,
                                                                             'HGVR4': 1803,
                                                                             'LGV': 55312,
                                                                             'PC': 1,
                                                                             'WMV2': 3085}},
                 'endDate': {'java.util.Date': 1431543599999},
                 'startDate': {'java.util.Date': 1034319600000}},
  'source': 'M32:1'}

It would be great if this could be optionally return an object that you could get results directly from without nested types involved:

>>> print(result.source)
'M32:1'
>>> print(result.properties.count)
841303
>>> print(result.countByVehicleType.CAR)
320028

This could be implemented as a generator that takes json input to create these results objects lazily. Dictionaries can be mapped to objects easily in Python (see munch).

When creating this generator, users should be able to easily add transform functions to the result, like removing, renaming and applying functions to fields. A lot of this functionality (renaming fields, ignoring fields and transforming them) already comes with Gaffer though, so perhaps this could be added to the OperationChain rather than executed in Python.

Add bulk helpers to gafferpy

Describe the new feature you'd like
gafferpy should be able to take an iterable and call an operation repeatedly using chunks from that iterable

Why do you want this feature?
This would allow a large AddElements to be easily chunked into user defined sizes

Improve docs landing page

Should include more details like:

  • what is gafferpy
  • overview for each section (generated, fishbowl, connector)
  • link to main docs

A PySpark API for Gaffer

Gaffer has a Spark library with Scala and Java APIs for accessing data using Spark; generating RDDs and Spark DataFrames from Gaffer graphs.

Gaffer also has a python shell with implementations of standard Gaffer operations that can be executed on the graph using Gaffer's rest service.

Extending the python API to support spark operations - producing RDDs and DataFrames - would open Gaffer up to a lot of useful python and spark data science and machine learning libraries

Release docs to gh-pages

There should be a workflow that rebuilds the docs and pushes them to gh-pages after every merge to main

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.