prod deployment hangs on resque-pool hotswap step #1783

jmartin-sul · 2022-01-13T19:36:39Z

Describe the bug
starting with @aaron-collier running dependency updates last week, we noticed that when deploying the app to production, the deployment process will hang on the resque-pool hotswap step (where the workers are restarted). if the resque pool hotswap cap task is run alone against prod, it will hang too.

other observations:

this doesn't seem to happen for stage or QA
this wasn't happening until last week
it does seem that the app is successfully deployed, and the worker pools successfully restarted (e.g. you can see a new resque pool master process with a start time stamp from right after deployment was started). but something must hang before the hotswap invocation returns. or something.

User Impact

nothing direct, this just makes deployment more of a pain for devs (as the deployment must be cancelled out of using ^C once it's apparent the process is hung, and then the dev should probably go check that things were deployed and restarted correctly). it makes it hard to include pres cat in the bulk deployments for weekly dependency updates (because it'll hang subsequent deployments that happen to fall after it, i.e. any project with a name lexically greater than preservation_catalog).

Object(s) In This Error State

N/A

Steps to Reproduce, if Known:

cap deploy or run the resque:pool:hotswap task for prod

Expected behavior

resque pool hotswap doesn't hang, and returns in a timely manner on success or error

Screenshots

n/a

Additional context

n/a

The text was updated successfully, but these errors were encountered:

peetucket · 2022-01-19T20:53:24Z

If you control-c out of the hung resque-pool swap process and want to manually complete the deployment steps that were missed by capistrano (after confirming that the worker counts look ok and that there are no stale workers to deal with):

cap prod deploy:cleanup
cap prod honeybadger:deploy
cap prod deploy:log_revision

peetucket · 2022-01-19T20:55:29Z

Note: deploy:log_revision seems to hang with run manually

jmartin-sul · 2022-01-19T20:59:28Z

I did some debugging on this last week, which stalled out since I was out sick part of Friday. What I've found so far:

bundle exec resque-pool --daemon --hot-swap --environment production runs fine from a login shell on the box
but it hangs when run via cap (or even when run slightly more directly by invoking on VM using ssh pres@ ... from laptop)
though the hotswap, and in @peetucket's testing the deploy:log_revision, commands seems like the only victims of that, and only on prod: none of this seems to happen on QA or stage, which both allow cap deployment of pres cat with no trouble
it doesn't seem to be an issue with bundle exec commands in general: i was able to run e.g. ssh pres@preservation-catalog-prod-04.stanford.edu 'cd preservation_catalog/current/ && bundle exec echo "will this ssh invocation hang?"' from my laptop, and that command echoed the string and returned without error.
congruent with the behavior we see where the pool does in fact get restarted successfully, the issue seems to occur after the resque pool hotswap command returns, e.g. from my laptop i can run ssh pres@preservation-catalog-prod-04.stanford.edu 'cd preservation_catalog/current/ && bundle exec resque-pool --daemon --hot-swap --environment production && echo $?', and an exit code of 0 is printed, but the invocation still hangs overall.
one other thing of note: the qa/stage/prod pres cat environments all use different operating systems, and prod is the only one on ubuntu ATM.

could probably use some pairing on further troubleshooting.

fuller debugging output from my testing last week:

preservation_catalog % ssh pres@preservation-catalog-prod-04.stanford.edu 'cd preservation_catalog/current/ && bundle exec resque-pool --daemon --hot-swap --environment production && echo $?'
0
^C%                                                                                                                          
preservation_catalog %                                                                                 
preservation_catalog % ssh pres@preservation-catalog-prod-04.stanford.edu 'echo "will this ssh invocation hang?"'
will this ssh invocation hang?
preservation_catalog %
preservation_catalog % ssh pres@preservation-catalog-prod-04.stanford.edu 'cd preservation_catalog/current/ && bundle exec echo "will this ssh invocation hang?"'
will this ssh invocation hang?
preservation_catalog %

jmartin-sul · 2022-01-19T21:15:06Z

note: pre-assembly devopsdocs has info on how to check whether there are stale workers in a resque-pool instance: https://github.com/sul-dlss/DevOpsDocs/blob/master/projects/pre-assembly/operations-concerns.md#stop-unresponsive-workers-workers-with-jobs-that-are-taking-too-long-and-zombie-workers

though also note, after running into this problem a number of times, we've discovered that the resque-pool instance generally restarts without a problem (despite the cap command hanging), and doesn't leave stale workers that need to be stopped manually. but as a quick check that all is well, it is worth confirming that the worker pools that are running on the worker VMs were brought up with the deployment that was just done, and that no strays are leftover from an old deployment.

jmartin-sul · 2022-01-25T18:51:34Z

@mjgiarlo says: "I tried tweaking the cap task to print out more diagnostic information and it did so just fine. it's just something about that hot swap operation executed via cap? Like, it can invoke bundle exec resque-pool --help just fine."

peetucket · 2022-01-25T18:59:57Z

Has anyone tried to see what happens if we invoke the command manually on the server that cap is executing? I would assume the issue is not necessarily capistrano, but rather the hot swap operation command (whatever that is).

jmartin-sul · 2022-01-25T19:06:05Z

Has anyone tried to see what happens if we invoke the command manually on the server that cap is executing? I would assume the issue is not necessarily capistrano, but rather the hot swap operation command (whatever that is).

surprisingly, that seems to work fine, tested that and ran into no problems, noted above, i think

jmartin-sul · 2022-01-25T19:08:15Z

the extra weird part to me was that running the command over ssh indicated that it returned, and then something else caused the cap command to hang mysteriously after that, see end of this comment: #1783 (comment)

jmartin-sul · 2022-01-31T22:06:31Z

@edsu ran across this resque-pool issue, which may be relevant: resque/resque-pool#107

edsu · 2022-01-31T22:20:04Z

It's a longshot but perhaps changing the QUIT to a INT over in dlss-capistrano might help?

ndushay · 2022-02-01T23:49:33Z

We tried using INT and TERM instead of QUIT and the result was the same -- the command executes fine on prod and stage but the ssh connection hangs for prod only.
We found this: https://askubuntu.com/questions/1343677/ssh-terminal-always-freezes-after-2minutes-of-inactivity so I sent the following to ops channel:

Hi ops - we are investigating a weird difference between prescat-prod worker boxes and prescat-stage worker box. When we run a capistrano command to spin up our resque workers, the command executes just fine, but the ssh connection to the prod box hangs, while that analogous ssh connection to the stage box does NOT hang. The command itself can be run just fine on the box, too. It’s just the ssh for prod never gets the “I’m done” signal. Might there be some firewall difference or puppet difference that could explain this?

jmartin-sul added the bug label Jan 13, 2022

jmartin-sul self-assigned this Jan 13, 2022

ndushay mentioned this issue Feb 2, 2022

cap resque:pool:hot_swap no longer hangs! #1801

Merged

jcoyne closed this as completed in #1801 Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prod deployment hangs on resque-pool hotswap step #1783

prod deployment hangs on resque-pool hotswap step #1783

jmartin-sul commented Jan 13, 2022

peetucket commented Jan 19, 2022

peetucket commented Jan 19, 2022

jmartin-sul commented Jan 19, 2022 •

edited

Loading

jmartin-sul commented Jan 19, 2022 •

edited

Loading

jmartin-sul commented Jan 25, 2022

peetucket commented Jan 25, 2022

jmartin-sul commented Jan 25, 2022

jmartin-sul commented Jan 25, 2022

jmartin-sul commented Jan 31, 2022

edsu commented Jan 31, 2022

ndushay commented Feb 1, 2022

prod deployment hangs on resque-pool hotswap step #1783

prod deployment hangs on resque-pool hotswap step #1783

Comments

jmartin-sul commented Jan 13, 2022

peetucket commented Jan 19, 2022

peetucket commented Jan 19, 2022

jmartin-sul commented Jan 19, 2022 • edited Loading

jmartin-sul commented Jan 19, 2022 • edited Loading

jmartin-sul commented Jan 25, 2022

peetucket commented Jan 25, 2022

jmartin-sul commented Jan 25, 2022

jmartin-sul commented Jan 25, 2022

jmartin-sul commented Jan 31, 2022

edsu commented Jan 31, 2022

ndushay commented Feb 1, 2022

jmartin-sul commented Jan 19, 2022 •

edited

Loading

jmartin-sul commented Jan 19, 2022 •

edited

Loading