From 15e1c0eb0a33c4cf4e226ec50c44b09132107868 Mon Sep 17 00:00:00 2001 From: tb06904 <141412860+tb06904@users.noreply.github.com> Date: Tue, 16 Apr 2024 10:24:50 +0000 Subject: [PATCH] Gh-465: Update gaffer gremlin docs for 2.2 (#466) * update deployment guide for gaffer-gremlin * update the gremlin querying pages --- .../gaffer-deployment/gremlin.md | 140 +++++---- .../query/gremlin/gremlin-limits.md | 4 +- docs/user-guide/query/gremlin/gremlin.md | 270 ++++++++++++++---- 3 files changed, 288 insertions(+), 126 deletions(-) diff --git a/docs/administration-guide/gaffer-deployment/gremlin.md b/docs/administration-guide/gaffer-deployment/gremlin.md index ebbdbfb479..e5e22cd355 100644 --- a/docs/administration-guide/gaffer-deployment/gremlin.md +++ b/docs/administration-guide/gaffer-deployment/gremlin.md @@ -28,10 +28,6 @@ traversals are spawned. To do this we recommend utilising the provided which can be configured to use the Gaffer Tinkerpop implementation so that a endpoint is available for Gremlin queries. -!!! note - For release of Gaffer v2.2.0 a ready made container image will be provided - to run a preconfigured Gremlin server for Gaffer. - ## Connecting to Any Existing Gaffer Graph The simplest way to connect Gremlin to an existing Gaffer instance where you may @@ -55,41 +51,60 @@ flowchart LR --> D(Existing Gaffer Instance) ``` -To establish this connection three configuration files are required: +To establish this connection you can make use of the existing `gaffer-gremlin` +OCI image which is an extension of the existing `gremlin-server` image. This +provides the Tinkerpop library which allows users to connect Gaffer graphs as +well as some pre installed configuration to get up and running quickly. + +```bash +docker pull gchq/gaffer-gremlin:latest +``` + +!!! note + You will likely need to configure the default `gaffer-gremlin` image to your + environment, please continue reading to learn more. + +### Configuring the `gaffer-gremlin` Image + +To use the image you will need to provide two configuration files that are specific +to your environment, they are: -- `store.properties` - The Gaffer store configuration for the Proxy Store. -- `gafferpop.properties` - The configuration for the Gaffer Tinkerpop library. -- `gremlin-server.yaml` - Configures the Gremlin server. +- `store.properties` - Gaffer store configuration. +- `gafferpop.properties` - Configuration for the Gaffer Tinkerpop library (Gafferpop). -Starting with the Proxy Store, this is identical to running a Proxy Store -normally and involves simply creating a Gaffer `store.properties` file to use. -An example `store.properties` file is given below that will connect to a graph's -REST API running at `https://localhost:8080/rest/latest`: +Once these files are configured you can use bind mounts to make them available when running the image: + +```bash +docker run \ + --name gaffer-gremlin \ + --publish 8182:8182 \ + --volume store.properties:conf/gaffer/store.properties \ + --volume gafferpop.properties:conf/gafferpop/gafferpop.properties \ + tinkerpop/gremlin-server:latest gremlin-server.yaml +``` + +#### Configuring the Proxy Store + +Starting with the Proxy Store, this is identical to running a normal [Proxy Store](../gaffer-stores/proxy-store.md) +and involves simply creating a Gaffer `store.properties` file to use. An example +`store.properties` file is given below that will connect to a graph's REST API +running at `https://localhost:8080/rest`: ```properties gaffer.store.class=uk.gov.gchq.gaffer.proxystore.ProxyStore # These should be configured to an existing graph deployment gaffer.host=localhost gaffer.port=8080 -gaffer.context-root=/rest/latest +gaffer.context-root=/rest ``` -### Configuring the Gremlin Server - -Next we need to configure the Gremlin server, for this the easiest way is to use -the provided container image from Tinkerpop. To start with simply pull the image -then we will bind mount in everything we need to make it run with the configured -Proxy Store and Gaffer Tinkerpop implementation. +#### Configuring the Gafferpop Library -```bash -docker pull tinkerpop/gremlin-server:latest -``` - -The first file to create is the `gafferpop.properties`, this is the configuration -for the Gaffer implementation of Tinkerpop (a.k.a Gafferpop). Most of the set up -here is for the construction of the Gafferpop Graph instance which we want to -make run with the `store.properties` we've already configured. An example -`gaffer.properties` would look like the following: +The `gafferpop.properties`, file is the configuration for the Gaffer +implementation of Tinkerpop (a.k.a Gafferpop). Most of the set up here is for +the construction of the Gafferpop Graph instance which we want to make run with +the `store.properties` we've already configured. An example `gaffer.properties` +would look like the following: ```properties # The Tinkerpop graph class we should use @@ -99,13 +114,29 @@ gaffer.storeproperties=conf/gaffer/store.properties gaffer.userId=user01 ``` -The second file needed is the configuration for the Gremlin server, this is -what ties everything together and makes sure the server runs using the Gaffer -implementation we have configured. A default file is provided in the +Many of these properties in the example above should be self explanatory, a full breakdown of +of the available properties is as follows: + +| Property Key | Description | +| --- | --- | +| `gremlin.graph` | The Tinkerpop graph class we should use | +| `gaffer.graphId` | The graph ID of the Tinkerpop graph | +| `gaffer.storeproperties` | The path to the store properties file | +| `gaffer.schemas` | The path to the directory containing the graph schema files | +| `gaffer.userId` | The user ID for the Tinkerpop graph | +| `gaffer.dataAuths` | The data auths for the user to specify what operations can be performed | +| `gaffer.operation.options` | Additional operation options that will be passed to the Tinkerpop graph variables in the form `key:value` + +#### Configuring the Gremlin Server + +The underlying Gremlin server can also be configured if required. The `gaffer-gremlin` +image comes with an existing YAML configuration based on the example from the [Tinkerpop repository](https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server.yaml). +This file should be suitable for most use cases but a custom one can be provided +via a bind mount. If supplying a custom file please ensure you still include the +following sections: -From this file two places need modifying, the first is to change it to use -our graph configuration file by modifying the `graphs` section like so: +Ensure the `gafferpop.properties` file is set by modifying the `graphs` section like so: ```yaml graphs: { @@ -113,9 +144,8 @@ graphs: { } ``` -The second place is to ensure the Gaffer plugin is loaded for Gremlin which is -achieved by adding the following to the list of plugins in the `plugins` -section: +Ensure the Gaffer plugin is loaded for Gremlin which is achieved by adding the +following to the list of plugins in the `plugins` section: ```yaml uk.gov.gchq.gaffer.tinkerpop.gremlinplugin.GafferPopGremlinPlugin: {} @@ -124,41 +154,3 @@ uk.gov.gchq.gaffer.tinkerpop.gremlinplugin.GafferPopGremlinPlugin: {} !!! tip See the [Tinkerpop docs](https://tinkerpop.apache.org/docs/current/reference/#gremlin-server) for more information on Gremlin server configuration. - -### Running the Gremlin Server - -After following the previous steps you should now have three custom files -created which we will bind mount into a `gremlin-server` container. One final -step is to obtain required Gaffer JARs and add them to the container as -well. There are many different ways to do this the easiest being through maven -which can use following goal to download all dependencies from a POM: - -```bash -mvn clean dependency:copy-dependencies -``` - -Once all JARs are available they can be bind mounted to a path, such as -`ext/gafferpop/plugin/`, in the container to be added to the classpath. - -The bind mount location of the custom configuration files are as follows: - -- `store.properties` -> `conf/gaffer/store.properties` -- `gafferpop.properties` -> `conf/gafferpop/gafferpop.properties` -- `gremlin-server.yaml` -> `conf/gremlin-server.yaml` - -The container can then be run as normal with the above bind mounts and -specifying the `conf/gremlin-server.yaml` file as the run argument for the -container, for example: - -```bash -docker run \ - --detach \ - --name gaffer-gremlin \ - --hostname gaffer-gremlin \ - --publish 8182:8182 \ - --net host \ - --volume store.properties:conf/gaffer/store.properties \ - --volume gafferpop.properties:conf/gafferpop/gafferpop.properties \ - --volume gremlin-server.yaml:conf/gremlin-server.yaml \ - tinkerpop/gremlin-server:latest gremlin-server.yaml -``` diff --git a/docs/user-guide/query/gremlin/gremlin-limits.md b/docs/user-guide/query/gremlin/gremlin-limits.md index 3c07d28414..f16fd86046 100644 --- a/docs/user-guide/query/gremlin/gremlin-limits.md +++ b/docs/user-guide/query/gremlin/gremlin-limits.md @@ -29,7 +29,9 @@ Current known limitations or bugs: - Performance compared to standard Gaffer OperationChains is hampered due to a custom `TraversalStratergy` not being implemented. - The ID of an Edge follows a specific format that is made up of its source and - destination IDs like `[dest, source]`. + destination IDs like `[source, dest]`. To use this in a seeded query you must + format it like `g.E("[source, dest]")` or a list like + `g.E(["[source1, dest1]","[source2, dest2]"])` - Issues seen using `hasKey()` and `hasValue()` in same query. - May experience issues using the `range()` query function. - May experience issues using the `where()` query function. diff --git a/docs/user-guide/query/gremlin/gremlin.md b/docs/user-guide/query/gremlin/gremlin.md index 5515ad335b..60e71bf591 100644 --- a/docs/user-guide/query/gremlin/gremlin.md +++ b/docs/user-guide/query/gremlin/gremlin.md @@ -1,74 +1,243 @@ # Gremlin in Gaffer -[Gremlin](https://tinkerpop.apache.org/gremlin.html) is a query language for traversing graphs. -It is a core component of the Apache Tinkerpop library and allows users to easily express more complex graph queries. - -GafferPop is a lightweight Gaffer implementation of the [TinkerPop framework](https://tinkerpop.apache.org/), where TinkerPop methods are delegated to Gaffer graph operations. - !!! warning - GafferPop is still in development and has some [limitations](gremlin-limits.md). - The implementation is basic and its performance is unknown in comparison to using standard Gaffer `OperationChains`. - -The addition of Gremlin as query language in Gaffer allows users to represent complex graph queries in a simpler language akin to other querying languages used in traditional and NoSQL databases. - -## Gremlin Features - -### Interfacing with Gremlin - -One of the great features of Gremlin is its versatility of use. -There are a large number of supported language libraries that allow you to write queries in whichever coding language you prefer. -For example, there is a [Python Gremlin](https://pypi.org/project/gremlinpython/) interface. -This means your tooling won't have to change to write these queries which is a nice bonus for Gremlin. + GafferPop is still under development and has some [limitations](gremlin-limits.md). + The implementation may not allow some advanced features of Gremlin and it's + performance is unknown in comparison to standard Gaffer `OperationChains`. -### Imperative & Declarative Queries +[Gremlin](https://tinkerpop.apache.org/gremlin.html) is a query language for +traversing graphs. It is a core component of the Apache Tinkerpop library and +allows users to easily express more complex graph queries. -Gremlin supports 3 main methods of querying methods that gives us an element of flexibility using the library. -Imperative queries are procedural and describe what's happening at each sequential step whereas declarative queries describe what should happen but lets the query compiler decide how and which order steps should be ran in. -Choosing the right method here allows for a lot of control over how our queries get ran or allows the controller to optimise. +GafferPop is a lightweight Gaffer implementation of the [TinkerPop framework](https://tinkerpop.apache.org/), +where TinkerPop methods are delegated to Gaffer graph operations. -### OTLP and OLAP - -Gremlin queries are flexible in that they can be evaluated in a realtime (OLTP) or batch (OLAP) format, this allows us a lot of flexibility in use, especially when querying over a multi machine or federated graph. +The addition of Gremlin as query language in Gaffer allows users to represent +complex graph queries in a simpler language akin to other querying languages +used in traditional and NoSQL databases. It also has wide support for various +languages so for example, you can write queries in Python via the [`gremlinpython` library](https://pypi.org/project/gremlinpython/) !!! tip - Information on Gremlin as a query language, its associated libraries and more in-depth tutorials can be found in the [Apache Tinkerpop Gremlin docs](https://tinkerpop.apache.org/gremlin.html). + In-depth tutorials on Gremlin as a query language and its associated libraries + can be found in the [Apache Tinkerpop Gremlin docs](https://tinkerpop.apache.org/gremlin.html). -## Gremlin in Gaffer +## Using Gremlin Queries in Gaffer -Gremlin was added to Gaffer as a new graph query language in version 2.1. -There is a small demo on the [gaffer-docker repo](https://github.com/gchq/gaffer-docker/tree/develop/docker/gremlin-gaffer) using the "TinkerPop Modern" [demo graph](https://tinkerpop.apache.org/docs/current/images/tinkerpop-modern.png). +Gremlin was added to Gaffer in version 2.1 as a new graph query language and since +version 2.2 a container image is provided allowing a Gremlin layer to be added to +existing 2.x graphs. A full tutorial on setting up this image is provided in the +[administration guide](../../../administration-guide/gaffer-deployment/gremlin.md). -## Basic Queries +This guide will use the [Python API for Gremlin](https://pypi.org/project/gremlinpython/) +to demonstrate some basic capabilities and how they compare to standard Gaffer syntax. -We recommend connecting to Gremlin using a [Gremlin server](https://tinkerpop.apache.org/docs/current/reference/#connecting-gremlin-server). -For example to connect a Gremlin server using the Python API: +To start querying in Gremlin we first need a reference to what is known as the +Graph Traversal. To obtain this we need to connect to a running Gremlin server, +similar to how a connection to the Gaffer REST API is needed if using +[`gafferpy`](../../apis/python-api.md). We can do this by first importing the required +libraries like so (many of these will be needed later for queries): ```python - from gremlin_python.process.anonymous_traversal_source import traversal - - g = traversal().withRemote( - DriverRemoteConnection('ws://localhost:8182/gremlin')) +from gremlin_python.process.anonymous_traversal import traversal +from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection +from gremlin_python.process.graph_traversal import __ ``` -!!! note - The [Gremlin administration guide](../../../administration-guide/gaffer-deployment/gremlin.md) contains further details on how you can add Gremlin querying to your own Graph instance. - -Some basic queries can be carried out on the data. -The following example is a seeded query from ID 1 with a filter/view for only the `person` group: +We can then establish a connection to the Gremlin server and save a reference to +this (typically called `g`): -```groovy - g.V('1').hasLabel('person') +```python +# Setup a connection with the gremlin server running on localhost +g = traversal().with_remote(DriverRemoteConnection('ws://localhost:8182/gremlin', 'g')) ``` -This example calculates the paths from ID 1 to ID 3 (with a maximum of 6 loops): - -```groovy - start = '1'; - end = '3'; - g.V(start).repeat(bothE().otherV().simplePath()).until(hasId(end).or().loops().is(6)).path() +Now that we have the traversal reference this can be used to spawn graph traversals +and get results back. + +### Basic Gremlin Queries + +Gremlin queries (similar to Gaffer queries) usually require a starting set of +entities to query from. Commonly Gremlin queries will be left without any IDs in +the starting seed which would be analogous to asking for all vertexes with +`g.V()` or, all edges with `g.E()`. However, if this type of query is used with +the Gafferpop library this will in effect call a `GetAllElements` operation which +is less than ideal. Its therefore, highly recommended to always seed the query with +IDs. + +We will use the following graph to demonstrate the basic usage of gremlin compared +to Gaffer. + +```mermaid +graph LR + A(["Person + + ID: John"]) + -- + "Created + weight: 0.2" + --> + B(["Software + + ID: 1"]) + A + -- + "Created + weight: 0.6" + --> + C(["Software + + ID: 2"]) ``` -There are more example queries using this graph to be found in the [Gremlin Getting Started](https://tinkerpop.apache.org/docs/current/tutorials/getting-started/) docs. +!!! example "" + Now say we wanted to get all the vertexes connected from the `Person` + vertex `"John"` via a `Created` edge (essentially all the things `"John"` + has created): + + === "Gremlin" + + Gremlin is 'lazy' so will only execute your query if you request it + to by using a `to_list()` or calling `next()` on the iterator. + + ```python + # We seed with "John" and traverse over any "Created" out edges + g.V("John").out("Created").element_map().to_list() + ``` + + Result: + + ```text + [{: '1', : 'Software'}] + [{: '2', : 'Software'}] + ``` + + === "Gaffer JSON" + + Note standard Gaffer you must know the group of the target vertexes you + want returned otherwise edges will be also present in the result. + + ```JSON + { + "class": "OperationChain", + "operations": [ + { + "class": "GetAdjacentIds", + "input": [ + { + "class": "EntitySeed", + "vertex": "John" + } + ], + "view": { + "edges": { + "Created": {} + } + } + }, + { + "class": "GetElements", + "view": { + "entities": { + "Software": {} + } + } + } + ] + } + ``` + + Result: + + ```JSON + [ + { + "class": "uk.gov.gchq.gaffer.data.element.Entity", + "group": "Software", + "vertex": 1 + }, + { + "class": "uk.gov.gchq.gaffer.data.element.Entity", + "group": "Software", + "vertex": 2 + } + ] + ``` + +As you can see the Gremlin query is quite a bit easier to write and +provides the results in a handy output to be reused. Now say if you wanted +to apply some filtering on the same graph, the following is an example +of how Gremlin handles this: + +!!! example "" + Get all the `Created` edges from vertex `"John"` that have a `"weight"` + greater than 0.4: + + === "Gremlin" + + ```python + # If needed we run this through an 'element_map()' call to get more info on the edge + g.V("John").outE("Created").has("weight", P.gt(0.4)).to_list() + ``` + + Result: + + ```text + [e[['John', 2]][John-Created->2]] + ``` + + === "Gaffer JSON" + + ```JSON + { + "class": "GetElements", + "input": [ + { + "class": "EntitySeed", + "vertex": "John" + } + ], + "view": { + "edges": { + "Created": { + "preAggregationFilterFunctions": [ + { + "selection": [ + "weight" + ], + "predicate": { + "class": "IsMoreThan", + "orEqualTo": false, + "value": { + "Float": 0.4 + } + } + } + ] + } + } + } + } + ``` + + Result: + + ```JSON + [ + { + "class": "uk.gov.gchq.gaffer.data.element.Edge", + "group": "Created", + "source": "John", + "destination": "2", + "directed": true, + "matchedVertex": "SOURCE", + "properties": { + "weight": 0.6 + } + } + ] + ``` + +There are more example queries to be found in the [Gremlin Getting Started](https://tinkerpop.apache.org/docs/current/tutorials/getting-started/) docs. ## Mapping Gaffer to TinkerPop @@ -82,4 +251,3 @@ a table of how different parts are mapped is as follows: | Entity | Vertex | | Edge | Edge | | Edge ID | A list with the source and destination of the Edge e.g. `[dest, source]` | -