Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New hubs] Dandi #3879

Merged
merged 10 commits into from
Apr 2, 2024
Merged

[New hubs] Dandi #3879

merged 10 commits into from
Apr 2, 2024

Conversation

GeorgianaElena
Copy link
Member

Closes #3824

Also, updates the eksctl jsonnet file to not have dask nodes (I assumed there will be daskhubs on setup) and the terraform to use the requested buckets name (staging and prod were assumed by the template).

Copy link

github-actions bot commented Mar 28, 2024

Merging this PR will trigger the following deployment actions.

Support and Staging deployments

Cloud Provider Cluster Name Upgrade Support? Reason for Support Redeploy Upgrade Staging? Reason for Staging Redeploy
aws dandi Yes Following helm chart values files were modified: support.values.yaml Yes Following helm chart values files were modified: staging.values.yaml, enc-staging.secret.values.yaml, common.values.yaml
aws earthscope No Yes Following helm chart values files were modified: common.values.yaml
aws bican No Yes Following helm chart values files were modified: common.values.yaml, staging.values.yaml, enc-staging.secret.values.yaml

Production deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
aws dandi prod Following helm chart values files were modified: prod.values.yaml, enc-prod.secret.values.yaml, common.values.yaml
aws earthscope prod Following helm chart values files were modified: common.values.yaml
aws bican prod Following helm chart values files were modified: common.values.yaml, prod.values.yaml, enc-prod.secret.values.yaml

@GeorgianaElena
Copy link
Member Author

GeorgianaElena commented Mar 29, 2024

I don't know what's wrong, but I can't get the autoscaler to trigger when starting a new server. It's just stuck in:

I0329 11:55:18.773905       1 scheduling_queue.go:1070] "About to try and schedule pod" pod="staging/jupyter-georgianaelena"
I0329 11:55:18.773926       1 schedule_one.go:81] "Attempting to schedule pod" pod="staging/jupyter-georgianaelena"
I0329 11:55:18.774113       1 preemption.go:216] "Preemption will not help schedule pod on any node" pod="staging/jupyter-georgianaelena"
I0329 11:55:18.774149       1 schedule_one.go:860] "Unable to schedule pod; no fit; waiting" pod="staging/jupyter-georgianaelena" err="0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.."
I0329 11:55:18.774231       1 schedule_one.go:936] "Updating pod condition" pod="staging/jupyter-georgianaelena" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"

I even recreated the cluster, thinking I missed something during the first phases :|

@yuvipanda
Copy link
Member

@GeorgianaElena i can take a look at this today.

@yuvipanda
Copy link
Member

On AWS, we unfortunately have to manually install and manage the cluster autoscaler. I looked to see if it was crashing, with k -n support get pod, and noticed it wasn't there. We need to explicitly enable it in https://github.com/2i2c-org/infrastructure/blob/main/config/clusters/dandi/support.values.yaml, like in

.

I have opened #3881 to automate this. Otherwise, I think if you enable the cluster autoscaler in the support chart and redeploy it, that should work!

And sorry for not catching it when I was reviewing #3866

@GeorgianaElena
Copy link
Member Author

On AWS, we unfortunately have to manually install and manage the cluster autoscaler.

Thank you @yuvipanda. I think this bit was recently removed from docs probably under the assumption that was already automated. Will work on fixing #3881. Thank you 🙏🏼

@GeorgianaElena
Copy link
Member Author

I believe this PR is now ready to be reviewed/deployed 🚀

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two changes but otherwise looks good to me!

terraform/aws/projects/dandi.tfvars Outdated Show resolved Hide resolved
GeorgianaElena and others added 2 commits April 2, 2024 21:11
Co-authored-by: Yuvi Panda <yuvipanda@gmail.com>
Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go!

@GeorgianaElena GeorgianaElena merged commit 30eaae5 into 2i2c-org:main Apr 2, 2024
11 checks passed
@GeorgianaElena GeorgianaElena deleted the dandi-hubs branch April 2, 2024 18:34
Copy link

github-actions bot commented Apr 2, 2024

🎉🎉🎉🎉

Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/8527749373

@yuvipanda
Copy link
Member

Looks like I missed spotting the equivalent of #3895 here too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Hub] DANDI (MIT Brain)
2 participants