Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume at the end of the last trained epoch #547

Merged
merged 1 commit into from
Sep 18, 2024
Merged

Conversation

SamuelLarkin
Copy link
Collaborator

@SamuelLarkin SamuelLarkin commented Sep 12, 2024

PR Goal?

Fix proper resuming of text-to-spec training.
The state at the end of the last epoch wasn't saved and resuming would be performed from the last saved checkpoint that was the last checkpoint used for validation. This was producing staggered runs as shown in tensorboard.

Fixes?

#534

Feedback sought?

merge approval

Priority?

low

Tests added?

None

How to test?

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.max_epochs=1 \

Check the state of the loops

python -c 'import torch; import json; m = torch.load("logs_and_checkpoints/FeaturePredictionExperiment/save_on_train_epoch_end/checkpoints/last.ckpt", map_location=torch.device("cpu")); print(json.dumps(m["loops"]["fit_loop"]["epoch_loop.batch_progress"], indent=2))'

Which will yield something like the following. You want to look at current's values. This run used 11790 training examples split across batches of 16 examples thus, one epoch is 11790/16 ~ 736 batches per epoch. If, instead, we see 500, the default val_check_interval, this would mean that we didn't save at the end of the epoch.

{
  "total": {
    "ready": 4421,
    "completed": 4421,
    "started": 4421,
    "processed": 4421
  },
  "current": {
    "ready": 736,
    "completed": 736,
    "started": 736,
    "processed": 736
  },
  "is_last_batch": true
}

Try resuming for a second epoch.

   srun everyvoice train text-to-spec \
      config/everyvoice-text-to-spec.yaml \
      --config-args training.finetune_checkpoint="logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/last.ckpt" \
      --config-args training.max_epochs=2 \

Use tensorboard and check that the second run's training is NOT staggered with your first run.

tensorboard --port=2024 --logdir=logs_and_checkpoints  --bind_all

Confidence?

Good

Version change?

No

Related PRs?

None

Copy link

semanticdiff-com bot commented Sep 12, 2024

Review changes with SemanticDiff.

Analyzed 1 of 1 files.

Filename Status
✔️ everyvoice/base_cli/helpers.py Analyzed

Copy link
Contributor

github-actions bot commented Sep 17, 2024

CLI load time: 0:00.23
Pull Request HEAD: 7cce58cb74a59ca919153ce22f72e49f4ee64024
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

Copy link

codecov bot commented Sep 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.63%. Comparing base (3a36240) to head (7cce58c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #547   +/-   ##
=======================================
  Coverage   74.63%   74.63%           
=======================================
  Files          46       46           
  Lines        3130     3130           
  Branches      510      510           
=======================================
  Hits         2336     2336           
  Misses        693      693           
  Partials      101      101           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SamuelLarkin SamuelLarkin changed the title [WIP] dev.sl/534 resume Resume at the end of the last trained epoch Sep 17, 2024
@marctessier
Copy link
Collaborator

Yes , confirming that the fin-tune checkpoint it resuming from the end of the previous run. ( 50 steps ahead) VS how it was definitely overlapping before.

I will open a new ticket for the 50 steps ahead but will close this since it is now resolved. :-)

@marctessier marctessier reopened this Sep 18, 2024
Copy link
Collaborator

@marctessier marctessier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good Samuel.

@SamuelLarkin SamuelLarkin merged commit eb460a2 into main Sep 18, 2024
8 checks passed
@SamuelLarkin SamuelLarkin deleted the dev.sl/534_resume branch September 18, 2024 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants