[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Weird problem (with condor-6.8.0)



I’ve seen this before with single (not cluster) jobs. If I just submit one then it can stay idle for ages but if
I repeatedly submit the same job the negotiator seems to get the message and runs them (this on an empty pool btw).

Could anyone from U-W let us know what the “out of servers” message means. I’ve seen this several times.

 

regards,

 

-ian,

 

PS I know people have asked for this before (and I realise there are good reasons why it’s difficult) – but could the –analyze

results provide more info about the scheduling than “for unknown reasons”.

 


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Rick Lan
Sent: 09 September 2006 21:08
To: Condor-Users Mail List
Subject: Re: [Condor-users] Weird problem (with condor-6.8.0)

 

We had seen similar log messages. Our setup has preemption disabled (setup recommended by section 3.6.10.5). However, setting to print more debug info  shows that, I believe, the negotiator is not dividing up the leftover "resource pie". So the condor guys told us to use

 

NEGOTIATOR_CONSIDER_PREEMPTION = True

 

It did help in our pool.

 

Hope this helps,

Rick

 

 


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Devaraj Das
Sent: Saturday, September 09, 2006 12:25 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Weird problem (with condor-6.8.0)

I am trying to submit 79 jobs through a single submit file with a “Queue 79”. The jobs remain idle for a long long time (approx 30 minutes the last time I saw this problem) before getting scheduled but once one of them starts executing, the others quickly follow. Although there are more than 79 idle nodes available, these 79 jobs don’t go into execution for a long time. Any idea why? For example if I do a condor_q –better-analyze for one of the jobs I see:

 

258.078:  Run analysis summary.  Of 84 machines,

      0 are rejected by your job's requirements

      0 reject your job because of their own requirements

      1 match but are serving users with a better priority in the pool

     83 match but reject the job for unknown reasons

      0 match but will not currently preempt their existing job

      0 are available to run your job

 

Here is a snippet of the SchedLog from the submit node:

 

9/9 18:51:26 (pid:9138) Started shadow for job 257.0 on "<66.196.90.7:32774>", (shadow pid = 26550)

9/9 18:51:31 (pid:9138) Sent ad to central manager for ddas@xxxxxxxx

9/9 18:51:31 (pid:9138) Sent ad to 1 collectors for ddas@xxxxxxxx

9/9 18:51:36 (pid:9138) DaemonCore: Command received via UDP from host <66.196.90.120:55382>

9/9 18:51:36 (pid:9138) DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)

9/9 18:51:36 (pid:9138) Sent ad to central manager for ddas@xxxxxxxx

9/9 18:51:36 (pid:9138) Sent ad to 1 collectors for ddas@xxxxxxxx

9/9 18:51:36 (pid:9138) Called reschedule_negotiator()

9/9 18:51:42 (pid:9138) Activity on stashed negotiator socket

9/9 18:51:42 (pid:9138) Negotiating for owner: ddas@xxxxxxxx

9/9 18:51:42 (pid:9138) Checking consistency running and runnable jobs

9/9 18:51:42 (pid:9138) Tables are consistent

9/9 18:51:42 (pid:9138) Out of servers - 0 jobs matched, 79 jobs idle, 1 jobs rejected

9/9 18:51:42 (pid:9138) Activity on stashed negotiator socket

9/9 18:51:42 (pid:9138) Negotiating for owner: ddas@xxxxxxxx

9/9 18:51:42 (pid:9138) Checking consistency running and runnable jobs

9/9 18:51:42 (pid:9138) Tables are consistent

9/9 18:51:42 (pid:9138) Out of servers - 0 jobs matched, 79 jobs idle, 1 jobs rejected

 

By the way, these jobs belong to the Java universe (and all nodes have Java) and I was able to successfully run these many jobs earlier (pretty quickly, without this long startup pause) and only recently I am seeing this problem. Didn’t restart the cluster yet. Will really appreciate any help in this regard…

 

Thanks,

Devaraj.