Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] VMware job "trapped" in a deadlock! What to do?
- Date: Fri, 6 Aug 2010 07:55:18 -0700 (PDT)
- From: Rob <spamrefuse@xxxxxxxxx>
- Subject: [Condor-users] VMware job "trapped" in a deadlock! What to do?
Hi,
The problem I encounter is:
1. The job's log file tells me that a VM job has been evicted.
2. However, condor keeps telling me that this VM job is still running.
3. And this condition persists for many, many hours, probably for ever!
How can I get out of this apparent deadlock of the job and
tell Condor to reschedule the job from the last checkpoint?
============================
Here are a few more details of a specific example, for which the job
is trapped for over 10 hours:
The pool PCs are Windows XP with VMware 1.0.10.
Master and pool PCs all have Condor 7.4.2 installed.
I have this Condor VM submission file:
#---start
Universe = vm
Executable = vm_job_on_skku
Log = vm.log
vm_type = vmware
vm_networking = false
vm_checkpoint = true
vm_memory = 64
vmware_dir = /home/condor/VM
vm_cdrom_files = myjob.sh
vm_should_transfer_cdrom_files = YES
vmware_should_transfer_files = YES
Queue
#---end
When I submit this to the condor pool, it starts nicely on a pool PC.
It also gets checkpointed and the checkpoint is restarted on another
pool PC. All seems to go fine, UNTIL at a certain moment the job
gets again evicted as follows:
$ tail -n 23 vm.log
011 (166.000.000) 08/06 11:35:22 Job was unsuspended.
...
003 (166.000.000) 08/06 12:01:51 Job was checkpointed.
Usr 0 00:00:05, Sys 0 00:00:32 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
206125632 - Run Bytes Sent By Job For Checkpoint
...
004 (166.000.000) 08/06 12:02:11 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:05, Sys 0 00:00:32 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
1030781952 - Run Bytes Sent By Job
217662736 - Run Bytes Received By Job
...
001 (166.000.000) 08/06 12:12:06 Job executing on host: <115.145.140.130:2964>
...
004 (166.000.000) 08/06 12:22:01 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:02 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
217624304 - Run Bytes Received By Job
...
I would then think that the job is currently not running, but:
$ condor_q
-- Submitter: condor1.dyndns.org : <115.145.140.71:45778> : condor1.dyndns.org
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
166.0 myname 8/4 14:08 2+06:08:47 R 0 97.7 vm_job_on_skku
1 jobs; 0 idle, 1 running, 0 held
I write this at around midnight, which means this condition exists for over 10
hours!
During those 10 hours there were several pool PCs idle, which could have
continued
this job, but condor didn't use them.....why? Because condor still thinks it's
running?!?!
I don't know what to do with this job, except remove it from the pool and
restart
from scratch.
Is there another way to smoothly enforce a restart from the last checkpoint?
Help is very much appreciated!
Thank you.
Rob.