[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_rm problems



Hi,

I had ~600 000 jobs in a single schedd queue, and I tried to delete
the jobs with condor_rm -all.
It marked all the jobs for removal, but after a while, this happened:

This is an automated email from the Condor system
on machine "btbeater001.xxx.xx".  Do not reply.

"/usr/sbin/condor_schedd" on "btbeater001.xxx.xx" was killed because
it was no longer responding.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file /var/log/condor/SchedLog:
01/31/14 08:10:10 (pid:3082203) Sent ad to 1 collectors for xxx@xxxxxx
01/31/14 08:12:31 (pid:3082203) Can't find address for startd btbeater001.xxx.xx
01/31/14 08:12:31 (pid:3082203) Can't find address for negotiator
01/31/14 08:12:31 (pid:3082203) Failed to send RESCHEDULE to unknown daemon:
01/31/14 08:18:41 (pid:3082203) TransferQueueManager stats: active
up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
01/31/14 08:18:41 (pid:3082203) TransferQueueManager upload 1m I/O
load: 0 bytes/s  0.000 disk load  0.000 net load
01/31/14 08:18:41 (pid:3082203) TransferQueueManager download 1m I/O
load: 0 bytes/s  0.000 disk load  0.000 net load
01/31/14 08:18:41 (pid:3082203) Sent ad to central manager for xxx@xxxxxx
01/31/14 08:18:41 (pid:3082203) Sent ad to 1 collectors for xxx@xxxxxx
01/31/14 08:20:57 (pid:3082203) Can't find address for startd btbeater001.xxx.xx
01/31/14 08:20:57 (pid:3082203) Can't find address for negotiator
01/31/14 08:20:57 (pid:3082203) Failed to send RESCHEDULE to unknown daemon:
01/31/14 08:36:12 (pid:3082203) TransferQueueManager stats: active
up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
01/31/14 08:36:12 (pid:3082203) TransferQueueManager upload 1m I/O
load: 0 bytes/s  0.000 disk load  0.000 net load
01/31/14 08:36:12 (pid:3082203) TransferQueueManager download 1m I/O
load: 0 bytes/s  0.000 disk load  0.000 net load
01/31/14 08:36:12 (pid:3082203) Sent ad to central manager for xxx@xxxxxx
01/31/14 08:36:12 (pid:3082203) Sent ad to 1 collectors for xxx@xxxxxx
01/31/14 08:43:01 (pid:3082203) Can't find address for startd btbeater001.xxx.xx
01/31/14 08:43:01 (pid:3082203) Can't find address for negotiator
01/31/14 08:43:01 (pid:3082203) Failed to send RESCHEDULE to unknown daemon:
*** End of file SchedLog

By the way, is there a rule of thumb for figuring out the number of
jobs a single schedd can safely take care of? For example, if I have
the peak value of the queued jobs in the system as an input, how can I
calculate the number of neccessary schedd (knowing the hardware
available) for reliably serve that amount of jobs?

Thanks,
Daniel