[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How can I check whether my VMware job is really checkpointed?



Rob,

If you look at the timestamps on the two checkpoint messages you'll see
that they're just 8 seconds apart.  It looks like a checkpoint was taken
and then the job was evicted 8 seconds later.  When the job was evicted
no additional checkpoint was made because it was too soon after the last
one.  Job eviction normally triggers an automatic checkpoint, but there
seems to be some minimum amount of time since the last checkpoint before
Condor will do another one.

- dave

On Thu, 2010-05-13 at 07:13 -0700, Rob wrote:
> Hi,
> 
> I have successfully submitted VMware jobs without checkpointing.
> Now I want to check the checkpoint feature, as it is described in the
> manual (no checkpoint server is needed).
> 
> The master is a linux/Fedora with condor 7.4.2.
> All pool PCs are Windows XP, with condor 7.2 and VMware 1.0.
> 
> I have changed the submission file such that it also allows
> checkpointing, like this:
> 
> Universe = vm
> Executable = any_name_you_like
> Log = vm.log
> vm_type = vmware
> vm_networking = false
> vm_checkpoint = true
> vm_memory = 64
> vmware_dir = /home/condor/VM
> vm_cdrom_files = input.dat
> vm_should_transfer_cdrom_files = YES
> vmware_should_transfer_files = YES
> Requirements = (target.Arch == "INTEL")
> Queue
> 
> 
> When I run the job, the vm.log Log file has lines like this:
> 
> 001 (007.000.000) 05/13 17:57:05 Job executing on host: <115.145.228.96:1034>
> ...
> 003 (007.000.000) 05/13 20:26:51 Job was checkpointed.
>     Usr 0 00:00:01, Sys 0 02:27:38  -  Run Remote Usage
>     Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>     68599016  -  Run Bytes Sent By Job For Checkpoint
> ...
> 004 (007.000.000) 05/13 20:26:59 Job was evicted.
>     (0) Job was not checkpointed.
>         Usr 0 00:00:01, Sys 0 02:27:38  -  Run Remote Usage
>         Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>     68599016  -  Run Bytes Sent By Job
>     79956464  -  Run Bytes Received By Job
> 
> 
> Notice, that it says
>   "Job was checkpointed."
> *and*
>   "Job was not checkpointed."
> 
> Meanwhile I do find the checkpoint files in the spool:
> 
> 15MB-000001.vmdk
> isohrDAAH.iso
> nvram
> vmBvHAAB_condor-Snapshot1.vmsn
> vmbvhaab_condor.vmem
> vmBvHAAB_condor.vmsd
> vmbvhaab_condor.vmss
> vmBvHAAB_condor.vmx
> vmware-0.log
> vmware-1.log
> vmware.log
> 
> 
> I'm quite confused by all this.
> Is the VMware condor job checkpointed or not?
> 
> Also, I don't know where and how I can verify this.
> 
> And if it's not checkpointed, why is it not?
> If it is checkpointed, why I can't see more evidence of it?
> 
> Thanks for your help!
> 
> Rob.
> 
> 
>       
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/