[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_rm problems



And after the restart of condor_schedd, it continued "working on" the
deletion of jobs, and it has been working for more than 1 hour now and
it uses up more and more memory (I don't know for what reason...)

2014-01-31 Pek Daniel <pekdaniel@xxxxxxxxx>:
> Hi,
>
> I had ~600 000 jobs in a single schedd queue, and I tried to delete
> the jobs with condor_rm -all.
> It marked all the jobs for removal, but after a while, this happened:
>
> This is an automated email from the Condor system
> on machine "btbeater001.xxx.xx".  Do not reply.
>
> "/usr/sbin/condor_schedd" on "btbeater001.xxx.xx" was killed because
> it was no longer responding.
> Condor will automatically restart this process in 10 seconds.
>
> *** Last 20 line(s) of file /var/log/condor/SchedLog:
> 01/31/14 08:10:10 (pid:3082203) Sent ad to 1 collectors for xxx@xxxxxx
> 01/31/14 08:12:31 (pid:3082203) Can't find address for startd btbeater001.xxx.xx
> 01/31/14 08:12:31 (pid:3082203) Can't find address for negotiator
> 01/31/14 08:12:31 (pid:3082203) Failed to send RESCHEDULE to unknown daemon:
> 01/31/14 08:18:41 (pid:3082203) TransferQueueManager stats: active
> up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 01/31/14 08:18:41 (pid:3082203) TransferQueueManager upload 1m I/O
> load: 0 bytes/s  0.000 disk load  0.000 net load
> 01/31/14 08:18:41 (pid:3082203) TransferQueueManager download 1m I/O
> load: 0 bytes/s  0.000 disk load  0.000 net load
> 01/31/14 08:18:41 (pid:3082203) Sent ad to central manager for xxx@xxxxxx
> 01/31/14 08:18:41 (pid:3082203) Sent ad to 1 collectors for xxx@xxxxxx
> 01/31/14 08:20:57 (pid:3082203) Can't find address for startd btbeater001.xxx.xx
> 01/31/14 08:20:57 (pid:3082203) Can't find address for negotiator
> 01/31/14 08:20:57 (pid:3082203) Failed to send RESCHEDULE to unknown daemon:
> 01/31/14 08:36:12 (pid:3082203) TransferQueueManager stats: active
> up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 01/31/14 08:36:12 (pid:3082203) TransferQueueManager upload 1m I/O
> load: 0 bytes/s  0.000 disk load  0.000 net load
> 01/31/14 08:36:12 (pid:3082203) TransferQueueManager download 1m I/O
> load: 0 bytes/s  0.000 disk load  0.000 net load
> 01/31/14 08:36:12 (pid:3082203) Sent ad to central manager for xxx@xxxxxx
> 01/31/14 08:36:12 (pid:3082203) Sent ad to 1 collectors for xxx@xxxxxx
> 01/31/14 08:43:01 (pid:3082203) Can't find address for startd btbeater001.xxx.xx
> 01/31/14 08:43:01 (pid:3082203) Can't find address for negotiator
> 01/31/14 08:43:01 (pid:3082203) Failed to send RESCHEDULE to unknown daemon:
> *** End of file SchedLog
>
> By the way, is there a rule of thumb for figuring out the number of
> jobs a single schedd can safely take care of? For example, if I have
> the peak value of the queued jobs in the system as an input, how can I
> calculate the number of neccessary schedd (knowing the hardware
> available) for reliably serve that amount of jobs?
>
> Thanks,
> Daniel