Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream contributions from Union.ai #5769

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

Conversation

andrewwdye
Copy link
Contributor

@andrewwdye andrewwdye commented Sep 23, 2024

Why are the changes needed?

This change upstreams a series of contributions from Union.ai.

What changes were proposed in this pull request?

  • Fix cluster pool assignment validation (@iaroslav-ciupin )
  • Don't send inputURI for start-node (@iaroslav-ciupin )
  • Unexpectedly deleted pod metrics (@iaroslav-ciupin )
  • CreateDownloadLink: Head before signing (@iaroslav-ciupin )
  • Fix metrics scale division in timer (@iaroslav-ciupin )
  • Add histogram stopwatch to stow storage (@andrewwdye )
  • Override ArrayNode log links with map plugin (@hamersaw )
  • Fix k3d local setup prefix (@andrewwdye )
  • adjust Dask LogName to (Dask Runner Logs) (@fiedlerNr9 )
  • Dask dashboard should have a separate log config (@EngHabu )
  • Log and monitor failures to validate access tokens (@katrogan )
  • Fix type assertion when an event is missed while connection to apiser… (@EngHabu )
  • Add read replica host config and connection (@squiishyy )
  • added lock to memstore make threadsafe (@hamersaw )
  • Move storage cache settings to correct location (@mbarrien )
  • Add config for grpc MaxMessageSizeBytes (@andrewwdye )
  • Add org to CreateUploadLocation (@katrogan )
  • Histogram Bucket Options (@squiishyy )
  • Add client-go metrics (@andrewwdye )
  • Enqueue owner on launchplan terminal state (@hamersaw )
  • Add configuration for launchplan cache resync duration (@hamersaw )
  • Overlap fetching input and output data (@andrewwdye )
  • Fix async notifications tests (@andrewwdye )
  • Overlap FutureFileReader blob store writes/reads (@andrewwdye )
  • Overlap create execution blob store reads/writes (@andrewwdye )

How was this patch tested?

TBD

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

N/A

Docs link

N/A

Copy link

codecov bot commented Sep 23, 2024

Codecov Report

Attention: Patch coverage is 52.08333% with 414 lines in your changes missing coverage. Please review.

Project coverage is 36.41%. Comparing base (66ff152) to head (895344d).

Files with missing lines Patch % Lines
flyteadmin/pkg/manager/mocks/resource_interface.go 0.00% 242 Missing ⚠️
...eller/pkg/controller/nodes/array/event_recorder.go 68.11% 15 Missing and 7 partials ⚠️
flytestdlib/promutils/scope.go 61.40% 20 Missing and 2 partials ⚠️
...er/pkg/controller/nodes/task/future_file_reader.go 0.00% 20 Missing ⚠️
...eplugins/go/tasks/plugins/k8s/dask/config_flags.go 42.42% 19 Missing ⚠️
flytestdlib/database/db.go 0.00% 14 Missing ⚠️
flytepropeller/pkg/controller/controller.go 0.00% 12 Missing ⚠️
flyteadmin/auth/authzserver/provider.go 47.05% 8 Missing and 1 partial ⚠️
...ytestdlib/promutils/labeled/histogram_stopwatch.go 80.00% 4 Missing and 4 partials ⚠️
flyteadmin/auth/handlers.go 0.00% 7 Missing ⚠️
... and 11 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5769      +/-   ##
==========================================
+ Coverage   36.31%   36.41%   +0.10%     
==========================================
  Files        1304     1309       +5     
  Lines      110048   110761     +713     
==========================================
+ Hits        39964    40335     +371     
- Misses      65928    66258     +330     
- Partials     4156     4168      +12     
Flag Coverage Δ
unittests-datacatalog 51.37% <ø> (ø)
unittests-flyteadmin 55.18% <35.86%> (-0.41%) ⬇️
unittests-flytecopilot 12.17% <ø> (ø)
unittests-flytectl 62.21% <ø> (ø)
unittests-flyteidl 7.16% <90.47%> (+0.03%) ⬆️
unittests-flyteplugins 53.32% <51.16%> (-0.03%) ⬇️
unittests-flytepropeller 42.07% <59.76%> (+0.14%) ⬆️
unittests-flytestdlib 55.99% <74.76%> (+0.62%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrewwdye andrewwdye force-pushed the union/upstream branch 19 times, most recently from 69991e1 to 0d1583e Compare September 25, 2024 20:23
@andrewwdye andrewwdye marked this pull request as ready for review September 25, 2024 21:46
andrewwdye and others added 7 commits September 30, 2024 16:25
This change modifies launch paths stemming from `launchExecutionAndPrepareModel` to overlap blob store write and read calls, which dominate end-to-end latency (as seen in the traces below).

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
This change updates `FutureFileReader.Cache` and `FutureFileReader.RetrieveCache` to use overlapped write and reads, respectively, to reduce end-to-end latency. The read path is a common operation on each iteration of the propeller `Handle` loop for dynamic nodes.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
I didn't chase down why assumptions changed here and why these tests broke, but fixing them with more explicit checks.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
This change updates `GetExecutionData`, `GetNodeExecutionData`, and `GetTaskExecutionData` to use overlapped reads when fetching input and output data.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Currently, the launchplan cache resync duration uses the DownstreamEval duration configuration which is also used for the sync period on the k8s client. This means if we want to configure a more aggressive launchplan cache resync, we would also incur overhead in syncing all k8s resources (ex. Pods from `PodPlugin`). By adding a separate configuration value we can update these independently.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
This PR enqueues the owner workflow for evaluation when the launchplan auto refresh cache detects a launchplan in a terminal state.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Register a few metric callbacks with the client-go metrics interface so that we can monitor request latencies and rate limiting of kubeclient.

```
❯ curl http://localhost:10254/metrics | rg k8s_client
k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.005"} 84
k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.01"} 87
k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.025"} 89
k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.05"} 99
k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.1"} 114
k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.25"} 117
k8s_client_rate_limiter_latency_bucket{verb="GET",le="0.5"} 117
k8s_client_rate_limiter_latency_bucket{verb="GET",le="1"} 117
k8s_client_rate_limiter_latency_bucket{verb="GET",le="2.5"} 117
k8s_client_rate_limiter_latency_bucket{verb="GET",le="5"} 117
k8s_client_rate_limiter_latency_bucket{verb="GET",le="10"} 117
k8s_client_rate_limiter_latency_bucket{verb="GET",le="+Inf"} 117
k8s_client_rate_limiter_latency_sum{verb="GET"} 1.9358371670000003
k8s_client_rate_limiter_latency_count{verb="GET"} 117
k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.005"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.01"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.025"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.05"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.1"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.25"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="0.5"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="1"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="2.5"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="5"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="10"} 6
k8s_client_rate_limiter_latency_bucket{verb="POST",le="+Inf"} 6
k8s_client_rate_limiter_latency_sum{verb="POST"} 1.0542e-05
k8s_client_rate_limiter_latency_count{verb="POST"} 6
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.005"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.01"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.025"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.05"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.1"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.25"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="0.5"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="1"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="2.5"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="5"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="10"} 1
k8s_client_rate_limiter_latency_bucket{verb="PUT",le="+Inf"} 1
k8s_client_rate_limiter_latency_sum{verb="PUT"} 5e-07
k8s_client_rate_limiter_latency_count{verb="PUT"} 1
k8s_client_request_latency_bucket{verb="GET",le="0.005"} 84
k8s_client_request_latency_bucket{verb="GET",le="0.01"} 86
k8s_client_request_latency_bucket{verb="GET",le="0.025"} 89
k8s_client_request_latency_bucket{verb="GET",le="0.05"} 99
k8s_client_request_latency_bucket{verb="GET",le="0.1"} 112
k8s_client_request_latency_bucket{verb="GET",le="0.25"} 117
k8s_client_request_latency_bucket{verb="GET",le="0.5"} 117
k8s_client_request_latency_bucket{verb="GET",le="1"} 117
k8s_client_request_latency_bucket{verb="GET",le="2.5"} 117
k8s_client_request_latency_bucket{verb="GET",le="5"} 117
k8s_client_request_latency_bucket{verb="GET",le="10"} 117
k8s_client_request_latency_bucket{verb="GET",le="+Inf"} 117
k8s_client_request_latency_sum{verb="GET"} 2.1254330859999997
k8s_client_request_latency_count{verb="GET"} 117
k8s_client_request_latency_bucket{verb="POST",le="0.005"} 5
k8s_client_request_latency_bucket{verb="POST",le="0.01"} 5
k8s_client_request_latency_bucket{verb="POST",le="0.025"} 5
k8s_client_request_latency_bucket{verb="POST",le="0.05"} 6
k8s_client_request_latency_bucket{verb="POST",le="0.1"} 6
k8s_client_request_latency_bucket{verb="POST",le="0.25"} 6
k8s_client_request_latency_bucket{verb="POST",le="0.5"} 6
k8s_client_request_latency_bucket{verb="POST",le="1"} 6
k8s_client_request_latency_bucket{verb="POST",le="2.5"} 6
k8s_client_request_latency_bucket{verb="POST",le="5"} 6
k8s_client_request_latency_bucket{verb="POST",le="10"} 6
k8s_client_request_latency_bucket{verb="POST",le="+Inf"} 6
k8s_client_request_latency_sum{verb="POST"} 0.048558582
k8s_client_request_latency_count{verb="POST"} 6
k8s_client_request_latency_bucket{verb="PUT",le="0.005"} 1
k8s_client_request_latency_bucket{verb="PUT",le="0.01"} 1
k8s_client_request_latency_bucket{verb="PUT",le="0.025"} 1
k8s_client_request_latency_bucket{verb="PUT",le="0.05"} 1
k8s_client_request_latency_bucket{verb="PUT",le="0.1"} 1
k8s_client_request_latency_bucket{verb="PUT",le="0.25"} 1
k8s_client_request_latency_bucket{verb="PUT",le="0.5"} 1
k8s_client_request_latency_bucket{verb="PUT",le="1"} 1
k8s_client_request_latency_bucket{verb="PUT",le="2.5"} 1
k8s_client_request_latency_bucket{verb="PUT",le="5"} 1
k8s_client_request_latency_bucket{verb="PUT",le="10"} 1
k8s_client_request_latency_bucket{verb="PUT",le="+Inf"} 1
k8s_client_request_latency_sum{verb="PUT"} 0.002381375
k8s_client_request_latency_count{verb="PUT"} 1
k8s_client_request_total{code="200",method="GET"} 120
k8s_client_request_total{code="200",method="PUT"} 1
k8s_client_request_total{code="409",method="POST"} 6
```

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
squiishyy and others added 18 commits September 30, 2024 16:25
Add abstraction to be able to pass buckets custom defined to histogram vectors.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
We need to make the grpc max recv message size in propeller's admin client configurable to match the server-side configuration we support in admin.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
- Add a new field to the postgres db config struct, `readReplicaHost`.
- Add a new endpoint in the `database` package to enable establishing a connection with a db without creating it if it doesn't exist

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
…ver was severed

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
I was trying to use `setup_local_dev.sh`, and it wasn't working out of the box. Looks like it expects `k3d-` prefix for the kubecontext

Ran `setup_local_dev.sh`

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
This PR adds a configuration option to override ArrayNode log links with those defined in the map plugin. The map plugin contains it's own configuration for log links, which may differ from those defined on the PodPlugin. ArrayNode, executing subNodes as regular tasks (ie. using the PodPlugin) means that it uses the default PodPlugin log templates.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
This change
* Adds a new `HistogramStopWatch` to promutils. This [allows for aggregating latencies](https://prometheus.io/docs/practices/histograms/#quantiles) across pods and computing quantiles at query time
* Adds `HistogramStopWatch` latency metrics for stow so that we can reason about storage latencies in aggregate. Existing latency metrics remain.

- [x] Added unittests

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
* Fix metrics scale division in timer

Signed-off-by: Iaroslav Ciupin <iaroslav@union.ai>

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
* Count when we see unexpectedly terminated pods

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
* send empty `inputUri` for `start-node` in node execution event to flyteadmin and therefore, GetNodeExecutionData will not attempt to download non-existing inputUri as it was doing before this change.
* add DB migration to clear `input_uri` in existing `node_executions` table for start nodes.

Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Signed-off-by: Andrew Dye <andrewwdye@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants