[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] VMware job "trapped" in a deadlock! What to do?



On Aug 6, 2010, at 9:55 AM, Rob wrote:

> The problem I encounter is:
> 
> 1. The job's log file tells me that a VM job has been evicted.
> 2. However, condor keeps telling me that this VM job is still running.
> 3. And this condition persists for many, many hours, probably for ever!
> 
> How can I get out of this apparent deadlock of the job and
> tell Condor to reschedule the job from the last checkpoint?
> 
> ============================
> 
> Here are a few more details of a specific example, for which the job
> is trapped for over 10 hours:
> 
> 
> The pool PCs are Windows XP with VMware 1.0.10.
> Master and pool PCs all have Condor 7.4.2 installed.
> 
> I have this Condor VM submission file:
> 
> #---start
> Universe = vm
> Executable = vm_job_on_skku
> Log = vm.log
> vm_type = vmware
> vm_networking = false
> vm_checkpoint = true
> vm_memory = 64
> vmware_dir = /home/condor/VM
> 
> vm_cdrom_files = myjob.sh
> vm_should_transfer_cdrom_files = YES
> vmware_should_transfer_files = YES
> 
> Queue
> #---end
> 
> When I submit this to the condor pool, it starts nicely on a pool PC.
> It also gets checkpointed and the checkpoint is restarted on another
> pool PC. All seems to go fine, UNTIL at a certain moment the job
> gets again evicted as follows:
> 
> $ tail -n 23 vm.log
> 011 (166.000.000) 08/06 11:35:22 Job was unsuspended.
> ...
> 003 (166.000.000) 08/06 12:01:51 Job was checkpointed.
>    Usr 0 00:00:05, Sys 0 00:00:32  -  Run Remote Usage
>    Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>    206125632  -  Run Bytes Sent By Job For Checkpoint
> ...
> 004 (166.000.000) 08/06 12:02:11 Job was evicted.
>    (0) Job was not checkpointed.
>        Usr 0 00:00:05, Sys 0 00:00:32  -  Run Remote Usage
>        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>    1030781952  -  Run Bytes Sent By Job
>    217662736  -  Run Bytes Received By Job
> ...
> 001 (166.000.000) 08/06 12:12:06 Job executing on host: <115.145.140.130:2964>
> ...
> 004 (166.000.000) 08/06 12:22:01 Job was evicted.
>    (0) Job was not checkpointed.
>        Usr 0 00:00:00, Sys 0 00:00:02  -  Run Remote Usage
>        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>    0  -  Run Bytes Sent By Job
>    217624304  -  Run Bytes Received By Job
> ...
> 
> 
> I would then think that the job is currently not running, but:
> 
> $  condor_q
> -- Submitter: condor1.dyndns.org : <115.145.140.71:45778> : condor1.dyndns.org
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
> 166.0   myname          8/4  14:08   2+06:08:47 R  0   97.7 vm_job_on_skku    
> 
> 1 jobs; 0 idle, 1 running, 0 held
> 
> 
> I write this at around midnight, which means this condition exists for over 10 
> hours!
> 
> During those 10 hours there were several pool PCs idle, which could have 
> continued
> this job, but condor didn't use them.....why? Because condor still thinks it's 
> running?!?!
> 
> I don't know what to do with this job, except remove it from the pool and 
> restart
> from scratch.
> Is there another way to smoothly enforce a restart from the last checkpoint?


The first step here is to figure out what state the job is actually in. Can you use the VMware GUI to determine whether the vm is still running on the execute machine (115.145.140.130 in this case)? The following Condor logs will have more information about what's going on:
execution machine:
    StartLog
    StarterLog
    VMGahpLog
submit machine:
    ShadowLog

If you send me these logs off-list, I can see what's going wrong.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team