Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint file is not always cleaned up on VM Action #6729

Open
3 tasks
OpenNebulaSupport opened this issue Sep 17, 2024 · 1 comment
Open
3 tasks

Checkpoint file is not always cleaned up on VM Action #6729

OpenNebulaSupport opened this issue Sep 17, 2024 · 1 comment

Comments

@OpenNebulaSupport
Copy link
Collaborator

OpenNebulaSupport commented Sep 17, 2024

Description
When a virtual machine is Suspended or Stopped then later resumed, or after some Migrations, the checkpoint file may not be cleaned up properly leading to excess disk usage of the system datastore until the VM is terminated which cleans up any extra checkpoint files.

To Reproduce
Suspend a VM, then resume it, observe /var/lib/one/datastores/SYSTEM_DS/VM_ID/checkpoint*, suspend and resume the VM again to create another checkpoint file
The issue may also happen after migrations between hypervisors which use the checkpoint

Expected behavior
The checkpoint file should be cleaned up properly after it is no longer required

Details

  • Affected Component: VMM, Datastore
  • Hypervisor: KVM
  • Version: 6.10

Progress Status

  • Code committed
  • Testing - QA
  • Documentation (Release notes - resolved issues, compatibility, known issues)
@OpenNebulaSupport OpenNebulaSupport changed the title Checkpoint file is not always cleaned up on VM Resume Checkpoint file is not always cleaned up on VM Action Sep 17, 2024
@1gramos 1gramos self-assigned this Sep 18, 2024
@tinova tinova added this to the Release 6.10.1 milestone Sep 18, 2024
@gsperry2011
Copy link

gsperry2011 commented Sep 20, 2024

Hello.

I added a lot of logs onto a support ticket which I believed is what created this issue (they linked me here). I did a little bit more investigating in my environment and found something interesting that I thought might be of some value to you.

I have plenty of space in my system datastore, /var/lib/one/datastores/0 so it seems like the No space left on device errors i'm getting are referring to the actual checkpoint itself which I thought might be a good clue... I had expected the checkpoint files to be able to grow until my mountpoint ran out of free space but it seems limited somehow.

Mon Sep 9 12:40:05 2024 [Z0][VMM][I]: Successfully execute network driver operation: post.
Mon Sep 9 12:40:05 2024 [Z0][VM][I]: New LCM state is RUNNING
Wed Sep 18 12:12:29 2024 [Z0][VM][I]: New LCM state is SAVE_MIGRATE
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: Command execution fail (exit code: 1): cat << 'EOT' | /var/tmp/one/vmm/kvm/save '725e613d-5db8-42ff-b5f7-4b9f69c2601a' '/var/lib/one//datastores/0/222/checkpoint' 'opennebulahost01.domain.tld' 222 opennebulahost01.domain.tld
Wed Sep 18 12:17:52 2024 [Z0][VMM][E]: save: Command "virsh --connect qemu:///system save 725e613d-5db8-42ff-b5f7-4b9f69c2601a /var/lib/one//datastores/0/222/checkpoint" failed: error: Failed to save domain '725e613d-5db8-42ff-b5f7-4b9f69c2601a' to /var/lib/one//datastores/0/222/checkpoint error: operation failed: /usr/libexec/libvirt_iohelper: failure with /var/lib/one/datastores/0/222/checkpoint: unable to fsync /var/lib/one/datastores/0/222/checkpoint: No space left on device Could not save 725e613d-5db8-42ff-b5f7-4b9f69c2601a to /var/lib/one//datastores/0/222/checkpoint
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: ExitCode: 1
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: Failed to execute virtualization driver operation: save.
Wed Sep 18 12:17:52 2024 [Z0][VMM][E]: SAVE: ERROR: save: Command "virsh --connect qemu:///system save 725e613d-5db8-42ff-b5f7-4b9f69c2601a /var/lib/one//datastores/0/222/checkpoint" failed: error: Failed to save domain '725e613d-5db8-42ff-b5f7-4b9f69c2601a' to /var/lib/one//datastores/0/222/checkpoint error: operation failed: /usr/libexec/libvirt_iohelper: failure with /var/lib/one/datastores/0/222/checkpoint: unable to fsync /var/lib/one/datastores/0/222/checkpoint: No space left on device Could not save 725e613d-5db8-42ff-b5f7-4b9f69c2601a to /var/lib/one//datastores/0/222/checkpoint ExitCode: 1
Wed Sep 18 12:17:52 2024 [Z0][VM][I]: New LCM state is RUNNING
Wed Sep 18 12:17:52 2024 [Z0][LCM][I]: Fail to save VM state while migrating. Assuming that the VM is still RUNNING.

I originally theorized that the checkpoint would grow as large as the VM has memory assigned to it but it seems that this checkpoint file stopped ~3GB shy of how much RAM is on the VM so my theory seems incorrect.

For the time being, if our system datastore runs out of space due to checkpoint files being retained we simply terminate and re-instantiate the VM which removes the checkpoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants