[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] eviction problems with 7.4.2
- Date: Thu, 27 May 2010 12:58:48 +0100
- From: "Smith, Ian" <I.C.Smith@xxxxxxxxxxxxxxx>
- Subject: [Condor-users] eviction problems with 7.4.2
I've recently been taking a look at checkpointing under the vanilla
universe*. I had everything working fine using Condor 7.0.2 on
the execute hosts (running Win XP SP 3) but when I moved
to 7.4.2 there are problems when jobs get evicted.
When this happens because of mouse/keyboard activity I see
the machine go through the usual Claimed/Busy ->
Preempting/Vacating -> Preempting/Killing -> Owner
states but the job carries on running according to condor_q
(and the log file).
If I look on the execute host, then the
execute directory has been wiped but condor_q insists that
the job is still running. Eventually when the job starts again
I see a "job disconnected" error in the job's log file. As
well as this, none of the output files get returned to the $(SPOOL)
The execute hosts have this config:
WANT_SUSPEND = FALSE
WANT_VACATE = TRUE
START = ( $(UWCS_START) && $(OfficeHours) \
|| ( $(OfficeHours) == FALSE ) && ( $(ShutdownHours) == FALSE ) )
SUSPEND = FALSE
PREEMPT= $(UWCS_SUSPEND) && $(OfficeHours)
which worked fine with 7.0.2.
Any ideas what may be wrong. Could it be something to do with one
of the daemons not receiving a signal from condor_kbdd ?
* I've written up some detailed instructions on this for the benefit of
our users. If anyone is interested I'll post the link here.
Dr Ian C. Smith,
The University of Liverpool,
Computing Services Department