Draft: #756 - implement python workflow submissions #762

kdazzle · 2024-08-08T22:52:24Z

WIP - Stubs out implementation for #756

This pretty much implements what a workflow job submission type would look like, though I'm sure I'm missing something. Tests haven't been added yet.

Sample

Outside of the new submission type, models are the same. Here is what one could look like:

# my_model.py
import pyspark.sql.types as T
import pyspark.sql.functions as F


def model(dbt, session):
    dbt.config(
        materialized='incremental',
        submission_method='workflow_job'
    )

    output_schema = T.StructType([
        T.StructField("id", T.StringType(), True),
        T.StructField("odometer_meters", T.DoubleType(), True),
        T.StructField("timestamp", T.TimestampType(), True),
    ])
    return spark.createDataFrame(data=spark.sparkContext.emptyRDD(), schema=output_schema)

The config for a model could look like (forgive my jsonification...yaml data structures still freak me out):

models:
  - name: my_model
      workflow_job_config:
        email_notifications: {
          on_failure: ["reynoldxin@databricks.com"]
        }
        max_retries: 2
        timeout_seconds: 18000
        existing_job_id: 12341234  # added: optional
        additional_task_settings: {  # added: Optional
          "task_key": "my_dbt_task"
        }
        post_hook_tasks: [{  # added: optional
          "depends_on": [{ "task_key": "my_dbt_task" }],
          "task_key": 'OPTIMIZE_AND_VACUUM',
          "notebook_task": {
            "notebook_path": "/my_notebook_path",
            "source": "WORKSPACE",
          },
        }]
        grants:  # added: Optional
          view: [
            {"group_name": "marketing-team"},
          ]
          run: [
            {"user_name": "alighodsi@databricks.com"}
          ]
          manage: []
      job_cluster_config:
        spark_version: "12.2.x-scala2.12"
        node_type_id: "rd-fleet.8xlarge"
        runtime_engine: "{{ var('job_cluster_defaults.runtime_engine') }}"
        data_security_mode: "{{ var('job_cluster_defaults.data_security_mode') }}"
        autoscale: {
          "min_workers": 1,
          "max_workers": 4
        }

Explanation

For all of the dbt configs that I added (in addition to the Databricks API attributes), I tried to roughly mediate between the dbt convention of requiring minimal configuration, but also allowing for the full flexibility of the Databricks API. Attribute names were trying to split the difference between the Databricks API and the dbt API. Happy to change the approach for anything.

added existing_job_id in case users want to reuse an existing workflow. If no name is provided in this config, it will get renamed to the default job name (currently f"{self.database}-{self.schema}-{self.identifier}__dbt")
Job names must be unique unless existing_job_id is also provided
The task key for the model run task is hardcoded as task_a - configurable in additional_task_settings
Allow for "post_hook tasks"
- Can specify a different cluster type using Databricks' new_cluster or existing_cluster_id. Leaving blank reuses the model's cluster config
- post_hook might be a misnomer, because you could technically set the dbt model to depend on one of these tasks, making it also a pre hook
grants - allow for permissions to be set on the workflow so that additional users/teams can run the job ad hoc if needed (for initial runs/backfills, etc). The owner is carried forward (partly because I wasn't sure if there was a great way to determine whether the current user is a user or service principal), and the format needs to follow the Databricks API where you specify whether the user is a user, group, or SP.
additional_task_settings to add to/override the default dbt model task

Todo:

Reuse all_purpose_cluster attribute, similar to job_cluster_config?
Can I use a serverless job cluster? (by not defining any cluster)
Fix the run tracker
What happens if the workflow is already running?
- I'd like the new dbt job run to start tracking the current Databricks workflow run, rather than failing
Log if workflow permissions are being changed? (Kind of mimicking TF apply logs, which have been helpful in the past when table permissions had been unexpectedly broadened)

Description

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

…ties

dbt/adapters/databricks/python_submissions.py

Signed-off-by: Kyle Valade <kylevalade@rivian.com>

… on job Signed-off-by: Kyle Valade <kylevalade@rivian.com>

Signed-off-by: Kyle Valade <kylevalade@rivian.com>

…be reset Signed-off-by: Kyle Valade <kylevalade@rivian.com>

… attempting to trigger a new run

benc-db · 2024-09-27T21:07:04Z

@kdazzle can you rebase/target your PR against 1.9.latest? I have a couple of things that I need to wrap up, but I'm planning to take some version of this into the 1.9 release.

dbt/adapters/databricks/python_models/python_submissions.py

benc-db · 2024-10-01T21:10:32Z

Looks like some syntax you're using does not work with python 3.8 based on unit test failures.

benc-db · 2024-10-01T21:13:24Z

dbt/adapters/databricks/api_client.py

+    def __init__(self, session: Session, host: str):
+        super().__init__(session, host, "/api/2.0/permissions/jobs")
+
+    def put(self, job_id, access_control_list):


Should this be part of the operations that dbt handles? If so, should there be an equivalent for the job runs approach?

I'd argue it should be since dbt creates these workflows, so they should also at least allow for the possibility of managing permissions. Otherwise these objects are kind of orphaned and there has to be some other process. Dbt abdicates that responsibility on schemas, which puts everyone in an awkward position.

If so, should there be an equivalent for the job runs approach?

I'm not sure it's as necessary for job runs, since those are usually just a one-time thing. But I'm happy to add it in there if you think that makes the most sense.

Good point about schema. That's in my backlog of pain points :P

I'll take a look at pulling that into job runs too

benc-db · 2024-10-01T21:15:34Z

dbt/adapters/databricks/api_client.py

+        logger.info(f"Workflow creation response={response.json()}")
+        return response.json()["job_id"]
+
+    def update_by_reset(self, job_id, job_spec):


What is this intended for? It's obscure enough that a doc string might be useful.

Yeah good call. It's basically a PUT, but it's called a reset in the Databricks API. It's to make the job reflect the config, with any changes.

Changed method name to update_job_settings per your other comment

benc-db · 2024-10-01T21:16:47Z

dbt/adapters/databricks/python_models/python_submissions.py

+
+    @property
+    def default_job_name(self) -> str:
+        return f"{self.database}-{self.schema}-{self.identifier}__dbt"


Think it might be helpful to put the dbt up front. That way if you sort in the UI, all of the dbt jobs are together.

good point, thanks Ben

benc-db · 2024-10-01T21:18:21Z

dbt/adapters/databricks/python_models/python_submissions.py

+
+    @property
+    def notebook_dir(self) -> str:
+        return f"/Shared/dbt_python_model/{self.database}/{self.schema}"


We should probably use the folder api logic. I've been told that using Shared is an anti-pattern and that we might get rid of it as we continue with improved governance.

Ah ok cool - I hadn't dug into that part of your refactorings. I'll make that change.

updated to use the new API classes

benc-db · 2024-10-01T21:21:28Z

dbt/adapters/databricks/python_models/python_submissions.py

+        job_id, is_new = self._get_or_create_job(workflow_spec)
+
+        if not is_new:
+            self.api_client.workflows.update_by_reset(job_id, workflow_spec)


Even though the API is 'reset', I think update_job_settings is probably more apt.

Yeah, I like that

benc-db · 2024-10-01T21:24:10Z

dbt/adapters/databricks/python_models/python_submissions.py

+        :return: the run id
+        """
+        active_runs = self.api_client.job_runs.list_active_runs_for_job(job_id)
+        if len(active_runs) > 0:


If the workflow is already running, I think we still need to schedule a run after it completes...this is one of the edge-cases that Amy is worried about. Consider that in the dbt run, the tables that this model depends on may have been updated after that job run started; in order to ensure that downstream tables get those updates, we need to run the job again.

Gotcha - that's a good point. That shouldn't be too hard to work in

Didn't realize there was an option for that in the Databricks API already - that made things easy. Done.

benc-db

Mostly minor. Biggest thing is that I think if there is already an active run, we need to wait and then execute another one.

databricks#756 - stub out implementation for python workflow submissions

f03b3e5

kdazzle requested review from andrefurlan-db, benc-db and rcypher-databricks as code owners August 8, 2024 22:52

kdazzle changed the title ~~#756 - stub out implementation for python workflow submissions~~ Draft: #756 - stub out implementation for python workflow submissions Aug 8, 2024

Kyle Valade added 5 commits August 12, 2024 09:57

Make this runnable

1f8a17e

Allow for job updates

16cb9df

doccomments to code

f92af82

Allow for existing cluster id; pull notebook paths + dirs into proper…

36be7c3

…ties

Build path from dir, not vice versa

8affd9b

kdazzle commented Aug 12, 2024

View reviewed changes

dbt/adapters/databricks/python_submissions.py Outdated Show resolved Hide resolved

Kyle Valade added 2 commits August 12, 2024 14:21

Remove comments

2e5e09d

linting

0ba3c1d

Signed-off-by: Kyle Valade <kylevalade@rivian.com>

kdazzle mentioned this pull request Aug 13, 2024

Run dbt python models as full-fledged jobs (not one-time job submission) #756

Open

Kyle Valade added 6 commits August 13, 2024 12:08

Allow user to specify an existing workflow/job id

b4dfbe7

Don't override job name - provide a default

4a46b79

Allow additional tasks to get added to the workflow as 'post_hook_tasks'

afc1b5a

Allow for additional model task settings; allow permissions to be set…

df173aa

… on job Signed-off-by: Kyle Valade <kylevalade@rivian.com>

Update permissions regardless of whether the job is new or not

0078058

Signed-off-by: Kyle Valade <kylevalade@rivian.com>

Don't skip setting job grants if they aren't defined, as they should …

3664df2

…be reset Signed-off-by: Kyle Valade <kylevalade@rivian.com>

kdazzle changed the title ~~Draft: #756 - stub out implementation for python workflow submissions~~ Draft: #756 - implement for python workflow submissions Aug 14, 2024

kdazzle changed the title ~~Draft: #756 - implement for python workflow submissions~~ Draft: #756 - implement python workflow submissions Aug 14, 2024

Kyle Valade added 3 commits September 20, 2024 08:55

Use different run endpoint for result status

b3f6dbf

If a workflow is already running, return the active run id instead of…

3f100c4

… attempting to trigger a new run

Allow for serverless tasks

13033ed

kdazzle changed the base branch from main to 1.9.latest September 27, 2024 21:12

BROKEN: starting to merge with 1.9.latest. Need to fix api_client calls

0985af6

kdazzle commented Sep 27, 2024

View reviewed changes

dbt/adapters/databricks/python_models/python_submissions.py Outdated Show resolved Hide resolved

Add new API classes to support workflow jobs

dfaa6cc

benc-db reviewed Oct 1, 2024

View reviewed changes

Add lots of types

81996f4

benc-db reviewed Oct 1, 2024

View reviewed changes

benc-db requested changes Oct 1, 2024

View reviewed changes

Kyle Valade added 4 commits October 1, 2024 15:23

Fix types for python 3.8

214372b

Missed a Tuple

c19b196

Use FolderApi; change dbt namespacing placement

e9bc5fc

Queue each workflow run behind an active run

743ca00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: #756 - implement python workflow submissions #762

Draft: #756 - implement python workflow submissions #762

kdazzle commented Aug 8, 2024 •

edited

Loading

benc-db commented Sep 27, 2024

benc-db commented Oct 1, 2024

benc-db Oct 1, 2024

kdazzle Oct 1, 2024

benc-db Oct 1, 2024

kdazzle Oct 1, 2024

benc-db Oct 1, 2024

kdazzle Oct 1, 2024

kdazzle Oct 1, 2024

benc-db Oct 1, 2024

kdazzle Oct 1, 2024

kdazzle Oct 1, 2024

benc-db Oct 1, 2024

kdazzle Oct 1, 2024

kdazzle Oct 1, 2024

benc-db Oct 1, 2024

kdazzle Oct 1, 2024

kdazzle Oct 1, 2024

benc-db Oct 1, 2024

kdazzle Oct 1, 2024

kdazzle Oct 1, 2024

benc-db left a comment

Draft: #756 - implement python workflow submissions #762

Are you sure you want to change the base?

Draft: #756 - implement python workflow submissions #762

Conversation

kdazzle commented Aug 8, 2024 • edited Loading

Sample

Explanation

Todo:

Description

Checklist

benc-db commented Sep 27, 2024

benc-db commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benc-db left a comment

Choose a reason for hiding this comment

kdazzle commented Aug 8, 2024 •

edited

Loading