Checkpoint file is not always cleaned up on VM Action #6729

OpenNebulaSupport · 2024-09-17T23:54:46Z

Description
When a virtual machine is Suspended or Stopped then later resumed, or after some Migrations, the checkpoint file may not be cleaned up properly leading to excess disk usage of the system datastore until the VM is terminated which cleans up any extra checkpoint files.

To Reproduce
Suspend a VM, then resume it, observe /var/lib/one/datastores/SYSTEM_DS/VM_ID/checkpoint*, suspend and resume the VM again to create another checkpoint file
The issue may also happen after migrations between hypervisors which use the checkpoint

Expected behavior
The checkpoint file should be cleaned up properly after it is no longer required

Details

Affected Component: VMM, Datastore
Hypervisor: KVM
Version: 6.10

Progress Status

Code committed
Testing - QA
Documentation (Release notes - resolved issues, compatibility, known issues)

The text was updated successfully, but these errors were encountered:

gsperry2011 · 2024-09-20T16:46:52Z

Hello.

I added a lot of logs onto a support ticket which I believed is what created this issue (they linked me here). I did a little bit more investigating in my environment and found something interesting that I thought might be of some value to you.

I have plenty of space in my system datastore, /var/lib/one/datastores/0 so it seems like the No space left on device errors i'm getting are referring to the actual checkpoint itself which I thought might be a good clue... I had expected the checkpoint files to be able to grow until my mountpoint ran out of free space but it seems limited somehow.

Mon Sep 9 12:40:05 2024 [Z0][VMM][I]: Successfully execute network driver operation: post.
Mon Sep 9 12:40:05 2024 [Z0][VM][I]: New LCM state is RUNNING
Wed Sep 18 12:12:29 2024 [Z0][VM][I]: New LCM state is SAVE_MIGRATE
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: Command execution fail (exit code: 1): cat << 'EOT' | /var/tmp/one/vmm/kvm/save '725e613d-5db8-42ff-b5f7-4b9f69c2601a' '/var/lib/one//datastores/0/222/checkpoint' 'opennebulahost01.domain.tld' 222 opennebulahost01.domain.tld
Wed Sep 18 12:17:52 2024 [Z0][VMM][E]: save: Command "virsh --connect qemu:///system save 725e613d-5db8-42ff-b5f7-4b9f69c2601a /var/lib/one//datastores/0/222/checkpoint" failed: error: Failed to save domain '725e613d-5db8-42ff-b5f7-4b9f69c2601a' to /var/lib/one//datastores/0/222/checkpoint error: operation failed: /usr/libexec/libvirt_iohelper: failure with /var/lib/one/datastores/0/222/checkpoint: unable to fsync /var/lib/one/datastores/0/222/checkpoint: No space left on device Could not save 725e613d-5db8-42ff-b5f7-4b9f69c2601a to /var/lib/one//datastores/0/222/checkpoint
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: ExitCode: 1
Wed Sep 18 12:17:52 2024 [Z0][VMM][I]: Failed to execute virtualization driver operation: save.
Wed Sep 18 12:17:52 2024 [Z0][VMM][E]: SAVE: ERROR: save: Command "virsh --connect qemu:///system save 725e613d-5db8-42ff-b5f7-4b9f69c2601a /var/lib/one//datastores/0/222/checkpoint" failed: error: Failed to save domain '725e613d-5db8-42ff-b5f7-4b9f69c2601a' to /var/lib/one//datastores/0/222/checkpoint error: operation failed: /usr/libexec/libvirt_iohelper: failure with /var/lib/one/datastores/0/222/checkpoint: unable to fsync /var/lib/one/datastores/0/222/checkpoint: No space left on device Could not save 725e613d-5db8-42ff-b5f7-4b9f69c2601a to /var/lib/one//datastores/0/222/checkpoint ExitCode: 1
Wed Sep 18 12:17:52 2024 [Z0][VM][I]: New LCM state is RUNNING
Wed Sep 18 12:17:52 2024 [Z0][LCM][I]: Fail to save VM state while migrating. Assuming that the VM is still RUNNING.

I originally theorized that the checkpoint would grow as large as the VM has memory assigned to it but it seems that this checkpoint file stopped ~3GB shy of how much RAM is on the VM so my theory seems incorrect.

For the time being, if our system datastore runs out of space due to checkpoint files being retained we simply terminate and re-instantiate the VM which removes the checkpoints.

OpenNebulaSupport added Category: Drivers - Storage Category: KVM Type: Bug labels Sep 17, 2024

OpenNebulaSupport changed the title ~~Checkpoint file is not always cleaned up on VM Resume~~ Checkpoint file is not always cleaned up on VM Action Sep 17, 2024

1gramos self-assigned this Sep 18, 2024

tinova added this to the Release 6.10.1 milestone Sep 18, 2024

tinova added Sponsored Status: Accepted Priority: Normal labels Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint file is not always cleaned up on VM Action #6729

Checkpoint file is not always cleaned up on VM Action #6729

OpenNebulaSupport commented Sep 17, 2024 •

edited

Loading

gsperry2011 commented Sep 20, 2024 •

edited

Loading

Checkpoint file is not always cleaned up on VM Action #6729

Checkpoint file is not always cleaned up on VM Action #6729

Comments

OpenNebulaSupport commented Sep 17, 2024 • edited Loading

Progress Status

gsperry2011 commented Sep 20, 2024 • edited Loading

OpenNebulaSupport commented Sep 17, 2024 •

edited

Loading

gsperry2011 commented Sep 20, 2024 •

edited

Loading