Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prod deployment hangs on resque-pool hotswap step #1783

Closed
jmartin-sul opened this issue Jan 13, 2022 · 11 comments · Fixed by #1801
Closed

prod deployment hangs on resque-pool hotswap step #1783

jmartin-sul opened this issue Jan 13, 2022 · 11 comments · Fixed by #1801
Assignees
Labels

Comments

@jmartin-sul
Copy link
Member

Describe the bug
starting with @aaron-collier running dependency updates last week, we noticed that when deploying the app to production, the deployment process will hang on the resque-pool hotswap step (where the workers are restarted). if the resque pool hotswap cap task is run alone against prod, it will hang too.

other observations:

  • this doesn't seem to happen for stage or QA
  • this wasn't happening until last week
  • it does seem that the app is successfully deployed, and the worker pools successfully restarted (e.g. you can see a new resque pool master process with a start time stamp from right after deployment was started). but something must hang before the hotswap invocation returns. or something.

User Impact

nothing direct, this just makes deployment more of a pain for devs (as the deployment must be cancelled out of using ^C once it's apparent the process is hung, and then the dev should probably go check that things were deployed and restarted correctly). it makes it hard to include pres cat in the bulk deployments for weekly dependency updates (because it'll hang subsequent deployments that happen to fall after it, i.e. any project with a name lexically greater than preservation_catalog).

Object(s) In This Error State

N/A

Steps to Reproduce, if Known:

cap deploy or run the resque:pool:hotswap task for prod

Expected behavior

resque pool hotswap doesn't hang, and returns in a timely manner on success or error

Screenshots

n/a

Additional context

n/a

@jmartin-sul jmartin-sul self-assigned this Jan 13, 2022
@peetucket
Copy link
Member

If you control-c out of the hung resque-pool swap process and want to manually complete the deployment steps that were missed by capistrano (after confirming that the worker counts look ok and that there are no stale workers to deal with):

cap prod deploy:cleanup
cap prod honeybadger:deploy
cap prod deploy:log_revision

@peetucket
Copy link
Member

Note: deploy:log_revision seems to hang with run manually

@jmartin-sul
Copy link
Member Author

jmartin-sul commented Jan 19, 2022

I did some debugging on this last week, which stalled out since I was out sick part of Friday. What I've found so far:

  • bundle exec resque-pool --daemon --hot-swap --environment production runs fine from a login shell on the box
  • but it hangs when run via cap (or even when run slightly more directly by invoking on VM using ssh pres@ ... from laptop)
  • though the hotswap, and in @peetucket's testing the deploy:log_revision, commands seems like the only victims of that, and only on prod: none of this seems to happen on QA or stage, which both allow cap deployment of pres cat with no trouble
  • it doesn't seem to be an issue with bundle exec commands in general: i was able to run e.g. ssh pres@preservation-catalog-prod-04.stanford.edu 'cd preservation_catalog/current/ && bundle exec echo "will this ssh invocation hang?"' from my laptop, and that command echoed the string and returned without error.
  • congruent with the behavior we see where the pool does in fact get restarted successfully, the issue seems to occur after the resque pool hotswap command returns, e.g. from my laptop i can run ssh pres@preservation-catalog-prod-04.stanford.edu 'cd preservation_catalog/current/ && bundle exec resque-pool --daemon --hot-swap --environment production && echo $?', and an exit code of 0 is printed, but the invocation still hangs overall.
  • one other thing of note: the qa/stage/prod pres cat environments all use different operating systems, and prod is the only one on ubuntu ATM.

could probably use some pairing on further troubleshooting.

fuller debugging output from my testing last week:

preservation_catalog % ssh pres@preservation-catalog-prod-04.stanford.edu 'cd preservation_catalog/current/ && bundle exec resque-pool --daemon --hot-swap --environment production && echo $?'
0
^C%                                                                                                                          
preservation_catalog %                                                                                 
preservation_catalog % ssh pres@preservation-catalog-prod-04.stanford.edu 'echo "will this ssh invocation hang?"'
will this ssh invocation hang?
preservation_catalog %
preservation_catalog % ssh pres@preservation-catalog-prod-04.stanford.edu 'cd preservation_catalog/current/ && bundle exec echo "will this ssh invocation hang?"'
will this ssh invocation hang?
preservation_catalog %                                                                                 

@jmartin-sul
Copy link
Member Author

jmartin-sul commented Jan 19, 2022

note: pre-assembly devopsdocs has info on how to check whether there are stale workers in a resque-pool instance: https://github.com/sul-dlss/DevOpsDocs/blob/master/projects/pre-assembly/operations-concerns.md#stop-unresponsive-workers-workers-with-jobs-that-are-taking-too-long-and-zombie-workers

though also note, after running into this problem a number of times, we've discovered that the resque-pool instance generally restarts without a problem (despite the cap command hanging), and doesn't leave stale workers that need to be stopped manually. but as a quick check that all is well, it is worth confirming that the worker pools that are running on the worker VMs were brought up with the deployment that was just done, and that no strays are leftover from an old deployment.

@jmartin-sul
Copy link
Member Author

@mjgiarlo says: "I tried tweaking the cap task to print out more diagnostic information and it did so just fine. it's just something about that hot swap operation executed via cap? Like, it can invoke bundle exec resque-pool --help just fine."

@peetucket
Copy link
Member

Has anyone tried to see what happens if we invoke the command manually on the server that cap is executing? I would assume the issue is not necessarily capistrano, but rather the hot swap operation command (whatever that is).

@jmartin-sul
Copy link
Member Author

Has anyone tried to see what happens if we invoke the command manually on the server that cap is executing? I would assume the issue is not necessarily capistrano, but rather the hot swap operation command (whatever that is).

surprisingly, that seems to work fine, tested that and ran into no problems, noted above, i think

@jmartin-sul
Copy link
Member Author

the extra weird part to me was that running the command over ssh indicated that it returned, and then something else caused the cap command to hang mysteriously after that, see end of this comment: #1783 (comment)

@jmartin-sul
Copy link
Member Author

@edsu ran across this resque-pool issue, which may be relevant: resque/resque-pool#107

@edsu
Copy link
Contributor

edsu commented Jan 31, 2022

It's a longshot but perhaps changing the QUIT to a INT over in dlss-capistrano might help?

@ndushay
Copy link
Contributor

ndushay commented Feb 1, 2022

Hi ops - we are investigating a weird difference between prescat-prod worker boxes and prescat-stage worker box. When we run a capistrano command to spin up our resque workers, the command executes just fine, but the ssh connection to the prod box hangs, while that analogous ssh connection to the stage box does NOT hang. The command itself can be run just fine on the box, too. It’s just the ssh for prod never gets the “I’m done” signal. Might there be some firewall difference or puppet difference that could explain this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

4 participants