Skip to content
rjrudin edited this page Feb 27, 2020 · 32 revisions

Version 3.0.0 features several new tasks for leveraging the new Data Movement SDK in version 4 of the MarkLogic Java Client API.

New in version 3.3.0 - see all the tasks for Exporting data.

Important - in 3.4.1 and prior versions, tasks that perform updates, such as deleting a collection, do not default to using a consistent snapshot. Be sure to use -PconsistentSnapshot=true when using these tasks until ml-gradle defaults to using a consistent snapshot for such tasks. You may just want to add that to gradle.properties so that every job defaults to using a consistent snapshot:

// in gradle.properties
consistentSnapshot=true

The problem

The goal of these tasks is to solve a common problem - you need to perform some kind of update operation on tens of thousands (perhaps even just thousands) of documents or more, and the operation times out in Query Console or a simple main module.

So you then either break the operation up and run it multiple times in Query Console, or you create a new CoRB job for an ad hoc operation. CoRB is an important tool on MarkLogic projects for running transforms, particularly as part of a deployment (like migrating your data as part of a new release). But ideally you can get the job done with just the Java Client API, which you already have access to in your build.gradle file via ml-gradle, and by leveraging the new features and benefits of DMSDK.

The solution

ml-gradle now provides a solution to this common problem by using DMSDK to perform all the updates, thus scaling to any number of documents, and with a simple command line interface. The tasks in 3.0.0 are focused on common update operations on document collections and permissions, along with using collections and URI patterns to select the documents to update. But there's also support for easily creating your own tasks that use DMSDK to perform any kind of update based on any set of documents.

So while you'll almost certainly keep using CoRB and Gradle together for transforms that either need to be repeated often and/or benefit from being able to write custom code that goes beyond simple queries and update operations, you can use these new DMSDK-based Gradle tasks for simple operations that don't need custom code and can be knocked out quickly via the command line and a few parameters.

Trying it out

First, make sure your build.gradle file is using the latest ml-gradle 3.x version:

plugins {
  id "com.marklogic.ml-gradle" version "3.2.1"
}

To see all the new tasks, just run the following:

gradle tasks

And look for the new "Data Movement Tasks" group.

Here are a few examples to give you an idea of how the tasks work.

Let's say we have 1 million documents in a collection named "red". We can easily add those to another collection - note how "whereCollections" defines the comma-separated set of collections of documents we want to modify (a document only needs to belong to one of the collections), and "collections" defines the comma-separated collections we want to add to each selected document:

gradle mlAddCollections -Pcollections=blue -PwhereCollections=red

Generally, properties that let you select documents to modify will be prefixed with "where".

We can also explicitly set all the collections too:

gradle mlSetCollections -Pcollections=red,blue,green -PwhereCollections=red

And then remove collections to get back to our original state:

gradle mlRemoveCollections -Pcollections=blue,green -PwhereCollections=red

We can also select documents via a URI pattern (which is processed under the hood by cts:uri-match - and 3.1.0 will have support for specifying a full query on the command line as well, though you can achieve that in 3.0.0 by writing your own task as described below):

gradle mlAddCollections -Pcollections=xmlDocuments -PwhereUriPattern=**.xml

And just like collections, we can set permissions too, using the common "role,capability,role,capability" syntax for specifying permissions:

gradle mlAddPermissions -Ppermissions=rest-reader,read,rest-writer,update -PwhereUriPattern=**.json

And as you probably expect now, you can use mlRemovePermissions and mlSetPermissions to remove and set document permissions too.

And of course, sometimes you just need to delete entire collections that contain tens of millions of documents - no problem now:

gradle mlDeleteCollections -Pcollections=red,blue

And new in 3.2.0 - for deleting collections, you can make a custom task and set the collections as a task property:

task deleteStuff (type: com.marklogic.gradle.task.datamovement.DeleteCollectionsTask) {
  collections = ["red", "blue"]
}

Also, starting in 3.14.0, you can include many instance of DeleteCollectionsTask in the same build.gradle file and invoke them at the same time. Prior to 3.14.0, an error would be thrown because of how the "collections" property was being set.

Configurable properties for any DMSDK task

Each of these tasks has several properties that affect how DMSDK operates.

You can configure the thread count (defaults to 8):

gradle -PthreadCount=32 ...

Or the batch size (default to 100):

gradle -PbatchSize=50 ...

Or the job name:

gradle -PjobName=my-job ...

Or whether a consistent snapshot is used (defaults to false):

gradle -PconsistentSnapshot=true ...

In addition, if you'd like some basic logging generated for each batch of URIs that's processed, just do the following:

gradle -PlogBatches=true ...

Specifying a database

By default, each Data Movement task will connect to ML using the value of mlRestPort (on a Data Hub Framework project, this will instead be the value of mlFinalPort). This of course means that the database that the task reads from / writes to will be the content database associated with the app server listening on the value of mlRestPort.

To specify a different database, use the "database" property:

gradle -Pdatabase=some-other-database ...

ml-gradle will then connect to this database via the mlAppServicesPort, which defaults to 8000.

Warning - don't use this approach if you need a consistent snapshot. See this issue for more details. If you need a consistent snapshot, you either need to use a port for an app server that is pointed at the database containing data you want to manipulate, or you need to use the custom DatabaseClient approach below.

Using a custom DatabaseClient

Starting in version 3.11.0, you can provide a custom instance of DatabaseClient, though only when declaring a task:

task myTask(type: com.marklogic.gradle.task.datamovement.AddCollectionsTask) {
  client = myClient // This could be instantiated in a Gradle "ext" block
}

For example, on a Data Hub Framework project, if you want to run a task against the staging database, you could do:

client = hubConfig.newStagingClient()

Writing your own tasks that use DMSDK

Starting in version 3.3.0, the easiest way to write your own task is to extend DataMovementTask and reuse one of the Job classes in marklogic-data-movement-components (note that project was created based off of code that used to exist in ml-javaclient-util - the code is the same, packages and class names didn't change).

Let's say you want to add your own QueryFailureListener:

task myTask(type: com.marklogic.gradle.task.datamovement.DataMovementTask) {
  doLast {
    def client = newClient()
    def job = new com.marklogic.client.ext.datamovement.job.SimpleQueryBatcherJob()
    job.addQueryFailureListener(...) // add whatever implementation you want here
    job.addUrisReadyListener(...) // add whatever implementation you want here
    // configure anything else you want on the job
    try {job.run(client)} finally {client.release()}
  }
}

Starting in version 3.0.0, you can use some of the "apply" methods in DataMovementTask, which reuse QueryBatcherTemplate, though these are deprecated now in favor of the Job API as shown above.

task myTask(type: com.marklogic.gradle.task.datamovement.DataMovementTask) {
  doLast {
    applyOnCollections(new com.example.MyListener(), "some-collection")
  }
}

With a little more code, you can access any method on the QueryBatcherTemplate class that's used under the hood - such as using a method that lets you run any query that returns URIs:

task myTask(type: com.marklogic.gradle.task.datamovement.DataMovementTask) {
  doLast {
    def client = newClient()
    try {
      newQueryBatcherTemplate(client).applyOnUrisQuery(new com.example.MyListener(), "cts:uris((), (), some query...)")
    } finally {
      client.release()
    }
  }
}

Writing custom tasks that reuse DMSDK tasks

You can also write a custom task that reuses a DMSDK task and sets the relevant properties before the task executes. For example, a custom task for exporting documents from a collection to a file would look like this:

task exportMyData(type: com.marklogic.gradle.task.datamovement.ExportToFileTask) {
  doFirst {
    project.ext.whereCollections = "my-data"
    project.ext.exportPath = /path/to/export/to
  }
}

Debugging DMSDK with Gradle

If you're having an issue with a DMSDK task, a good way to debug it is to create a Gradle task that uses DMSDK directly without any of the plumbing provided by the marklogic-data-movement-components project.

The task below is an example of deleting data in a collection by directly using DMSDK APIs. You can use this as a starting point for debugging the issue you have - just swap out DeleteListener for a different listeners that processes each batch of URIs. Also, note that running this with Gradle's info-level logging enabled - via "-i" or "--info" - should result in some useful logging from the MarkLogic Java client.

task deleteData {
	doLast {
		def client = com.marklogic.client.DatabaseClientFactory.newClient("localhost", 8010,
						new com.marklogic.client.DatabaseClientFactory.DigestAuthContext("admin", "admin"))
		def dataMovementManager = client.newDataMovementManager()
		def query = client.newQueryManager().newStructuredQueryBuilder().collection("changeme")
		def queryBatcher = dataMovementManager.newQueryBatcher(query)
			.onUrisReady(new com.marklogic.client.datamovement.DeleteListener())
			.withConsistentSnapshot()

		dataMovementManager.startJob(queryBatcher)
		queryBatcher.awaitCompletion()
		dataMovementManager.stopJob(queryBatcher)

		client.release()
	}
}
Clone this wiki locally