[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killed jobs hang around in idle state



Many thanks for all the replies on this. It may be
better if I explained what I'm trying to achieve which
is an _extremely_ simple policy for job execution.

We have a pool of condor execution hosts - all win 2k - and
a handful of submit hosts -again all win 2k. The master is
a solaris 8 box. Users should only be allowed to start jobs
after working hours (they should wait on the queue if submitted
before). If a job is still running at the start of the working day
it should be killed and _disappear_completely_ from condor
(users will need to restart it themselves if desired)
so as not to interfere with "owners" of the exec hosts.
Any output upto that point should be returned to the user so
that their execution time isn't wasted.

The first bit is working OK. With the second bit jobs get killed OK
but go back to the idle/waiting state and no output is
returned to the user. Everytime I read the admin guide on
the state machine I get more confused about this. Does
anyone know of a simple configuration that will allow me
to achieve this policy.

-ian.


--On 22 June 2004 13:15 +0100 "Dr Ian C. Smith" <i.c.smith@xxxxxxxxxxxxxxx> wrote:


Hi

I'm having problems trying to kill jobs at a certain
time when using Condor 6.6.5 on Win2K. When the job
is killed it continues to hang around in the idle
state indefinitely:

C:\Condor\ics>condor_q -analyze
-- Submitter: 102153-71130c.liv.ac.uk : <138.253.102.153:1042> :
102153-71130c.l
iv.ac.uk
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
187.000:  Run analysis summary.  Of 2 machines,
      1 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match, but are serving users with a better priority in the pool
      1 match, but prefer another specific job despite its worse
user-priority
      0 match, but will not currently preempt their existing job
      0 are available to run your job
        Last successful match: Tue Jun 22 13:05:31 2004

1 jobs; 1 idle, 0 running, 0 held

The config file looks like:

WANT_SUSPEND = FALSE
WANT_VACATE = TRUE
START = TRUE
SUSPEND = ClockMin > 660
CONTINUE	=	FALSE
PREEMPT = TRUE
KILL = TRUE

Something seems to be wrong judging by SchedLog:

6/22 13:05:57 DaemonCore: Command received via TCP from host
<138.253.102.153:1365>
6/22 13:05:57 DaemonCore: received command 443 (VACATE_SERVICE), calling
handler (vacate_service)
6/22 13:05:57 Got VACATE_SERVICE from <138.253.102.153:1365>
6/22 13:05:57 Sent RELEASE_CLAIM to startd on <138.253.102.153:1041>
6/22 13:05:57 Match record (<138.253.102.153:1041>, 187, 0) deleted
6/22 13:05:57 DaemonCore: Command received via UDP from host
<138.253.102.153:1367>
6/22 13:05:57 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
6/22 13:05:57 Scheduler::Relinquish - mrec is NULL, can't relinquish
6/22 13:05:57 Null parameter --- match not deleted
6/22 13:06:04 DaemonCore: Command received via UDP from host
<138.253.102.153:1371>

any ideas ?

thanks in advance

-ian.
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users