[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor overload???



Hi

I submitted 744,400 jobs to our condor cluster, was I a little over ambitious? Is there a recommended limit?

146 jobs ran then all stopped with loads of errors in the logs but specifically at the time jobs stopped running there was a timeout error in the Shadow log, see below.

Anyway, what I need is a method of clearing the jobs queued so I can get back to work on smaller batches but condor_q seems to hang so I can't actually determine what's in the queue and 'condor_rm job#' also seem to hang. I've tried restarting condor but obviously the queue remains. Is there a backdoor method of clearing this?

Many thanks
John


[root@galaxy ~]# grep '7/13 19' /home/condor/condor/local.galaxy/log/ShadowLog.old
7/13 19:37:43 Initializing a VANILLA shadow
7/13 19:37:43 (85272.107) (5781): Request to run on <192.168.0.11:45689> was ACCEPTED 7/13 19:37:43 (85272.94) (5655): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
7/13 19:38:03 (85272.106) (5773): condor_read(): timeout reading buffer.
7/13 19:38:03 (85272.107) (5781): condor_read(): timeout reading buffer.
7/13 19:42:43 (85272.98) (5679): condor_read(): timeout reading buffer.
7/13 19:42:43 (85272.102) (5705): condor_read(): timeout reading buffer.
7/13 19:42:43 (85272.101) (5696): condor_read(): timeout reading buffer.
7/13 19:42:44 (85272.104) (5723): condor_read(): timeout reading buffer.
7/13 19:42:44 (85272.103) (5714): condor_read(): timeout reading buffer.
7/13 19:42:44 (85272.105) (5765): condor_read(): timeout reading buffer.
7/13 19:43:04 (85272.106) (5773): condor_read(): timeout reading buffer.
7/13 19:43:04 (85272.107) (5781): condor_read(): timeout reading buffer.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/