[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] VMware job "trapped" in a deadlock! What to do?



Hi,

The problem I encounter is:

1. The job's log file tells me that a VM job has been evicted.
2. However, condor keeps telling me that this VM job is still running.
3. And this condition persists for many, many hours, probably for ever!

How can I get out of this apparent deadlock of the job and
tell Condor to reschedule the job from the last checkpoint?

============================

Here are a few more details of a specific example, for which the job
is trapped for over 10 hours:


The pool PCs are Windows XP with VMware 1.0.10.
Master and pool PCs all have Condor 7.4.2 installed.

I have this Condor VM submission file:

#---start
Universe = vm
Executable = vm_job_on_skku
Log = vm.log
vm_type = vmware
vm_networking = false
vm_checkpoint = true
vm_memory = 64
vmware_dir = /home/condor/VM

vm_cdrom_files = myjob.sh
vm_should_transfer_cdrom_files = YES
vmware_should_transfer_files = YES

Queue
#---end

When I submit this to the condor pool, it starts nicely on a pool PC.
It also gets checkpointed and the checkpoint is restarted on another
pool PC. All seems to go fine, UNTIL at a certain moment the job
gets again evicted as follows:

$ tail -n 23 vm.log
011 (166.000.000) 08/06 11:35:22 Job was unsuspended.
...
003 (166.000.000) 08/06 12:01:51 Job was checkpointed.
    Usr 0 00:00:05, Sys 0 00:00:32  -  Run Remote Usage
    Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    206125632  -  Run Bytes Sent By Job For Checkpoint
...
004 (166.000.000) 08/06 12:02:11 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:00:05, Sys 0 00:00:32  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    1030781952  -  Run Bytes Sent By Job
    217662736  -  Run Bytes Received By Job
...
001 (166.000.000) 08/06 12:12:06 Job executing on host: <115.145.140.130:2964>
...
004 (166.000.000) 08/06 12:22:01 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:00:00, Sys 0 00:00:02  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    217624304  -  Run Bytes Received By Job
...


I would then think that the job is currently not running, but:

$  condor_q
-- Submitter: condor1.dyndns.org : <115.145.140.71:45778> : condor1.dyndns.org
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 166.0   myname          8/4  14:08   2+06:08:47 R  0   97.7 vm_job_on_skku    

1 jobs; 0 idle, 1 running, 0 held


I write this at around midnight, which means this condition exists for over 10 
hours!

During those 10 hours there were several pool PCs idle, which could have 
continued
this job, but condor didn't use them.....why? Because condor still thinks it's 
running?!?!

I don't know what to do with this job, except remove it from the pool and 
restart
from scratch.
Is there another way to smoothly enforce a restart from the last checkpoint?

Help is very much appreciated!

Thank you.
Rob.