[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



>I was interpreting your original message to say that the jobs that
>dagman was submitting were themselves dagman executables (in some
>horrific recursive multi machine trashing setup if the execute nodes
>had schedd's running on them as well). I take it this is not the case.

Sorry, I'm not really well-versed in condorspeak.

 
>> >Have you tried 6.6.10 instead (assuming you don't absolutely require
>> >the features in the 6.7 series)
>> 
>> I just switched from 6.7.8 to 6.6.10. I'm very curious what will happen.
>> Although I'll miss the STARTD_EXPRS per vm very much...
>
>If it works pray to $(DEITY_OF_CHOICE) for 6.8 :)


I tried all that you suggested:
- Switched from 6.7.8 to 6.6.10 on the whole pool
- Moved the job submission to another computer
- Prayed a lot in advance, just in case it helps... ;)

Result:

90 % of the queued tasks completed without any problems, but after that the
same problem appeared (computers sitting idle, tasks that think they run while they do not).
I cleaned the whole pool (sound funny, isn't it?) , removed all jobs and start it from scratch.

Result:

All of the machines were used for about an hour, but after that computers started dropping the matched state
and rarely become claimed and busy. The match log only says that there is no match found.

And this one is from the sched log:

7/7 13:15:53 Sent RELEASE_CLAIM to startd on <192.168.0.117:1036>
7/7 13:15:53 Match record (<192.168.0.117:1036>, 225, 0) deleted
7/7 13:15:53 Sent RELEASE_CLAIM to startd on <192.168.0.108:1044>
7/7 13:15:53 Match record (<192.168.0.108:1044>, 216, 0) deleted
7/7 13:15:53 Sent RELEASE_CLAIM to startd on <192.168.0.108:1044>
7/7 13:15:53 Match record (<192.168.0.108:1044>, 215, 0) deleted
7/7 13:15:53 Sent RELEASE_CLAIM to startd on <192.168.0.101:1057>
7/7 13:15:53 Match record (<192.168.0.101:1057>, 196, 0) deleted
7/7 13:15:53 DaemonCore: Command received via UDP from host <192.168.0.50:3082>
7/7 13:15:53 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
7/7 13:15:53 Shadow pid 5324 for job 665.0 exited with status 100
7/7 13:15:53 Started shadow for job 471.0 on "<192.168.0.124:1037>", (shadow pid = 4128)
7/7 13:15:53 Sent ad to central manager for szabolcs@xxxxxxxxxxxxxxxxxxx
7/7 13:15:53 Zombie process has not been cleaned up by reaper - pid 4848
7/7 13:15:53 DaemonCore: Command received via UDP from host <192.168.0.50:3086>
7/7 13:15:53 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
7/7 13:15:53 Shadow pid 3344 for job 643.0 exited with status 100
7/7 13:15:53 DaemonCore: Command received via UDP from host <192.168.0.50:3087>
7/7 13:15:53 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
7/7 13:15:53 Shadow pid 4060 for job 670.0 exited with status 100
7/7 13:15:53 DaemonCore: Command received via UDP from host <192.168.0.50:3088>
7/7 13:15:53 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
7/7 13:15:53 Shadow pid 4264 for job 655.0 exited with status 100
7/7 13:15:53 DaemonCore: Command received via UDP from host <192.168.0.50:3089>
7/7 13:15:53 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
7/7 13:15:53 Shadow pid 1860 for job 885.0 exited with status 100
7/7 13:16:01 DaemonCore: Command received via UDP from host <192.168.0.50:3099>
7/7 13:16:01 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
7/7 13:16:01 Shadow pid 5120 for job 666.0 exited with status 100
7/7 13:16:02 Started shadow for job 667.0 on "<192.168.0.118:1039>", (shadow pid = 3788)
7/7 13:16:02 Sent ad to central manager for szabolcs@xxxxxxxxxxxxxxxxxxx
7/7 13:16:02 Zombie process has not been cleaned up by reaper - pid 4848
7/7 13:16:03 DaemonCore: Command received via TCP from host <192.168.0.118:4674>
7/7 13:16:03 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
7/7 13:16:03 Got VACATE_SERVICE from <192.168.0.118:4674>
7/7 13:16:03 Sent RELEASE_CLAIM to startd on <192.168.0.118:1039>
7/7 13:16:03 Match record (<192.168.0.118:1039>, 667, 0) deleted
7/7 13:16:03 DaemonCore: Command received via UDP from host <192.168.0.50:3111>
7/7 13:16:03 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
7/7 13:16:03 Shadow pid 3788 for job 667.0 exited with status 108
7/7 13:16:03 Scheduler::Relinquish - mrec is NULL, can't relinquish
7/7 13:16:03 Null parameter --- match not deleted
7/7 13:16:13 Activity on stashed negotiator socket
7/7 13:16:13 Negotiating for owner: szabolcs@xxxxxxxxxxxxxxxxxxx
7/7 13:16:13 Checking consistency running and runnable jobs
7/7 13:16:13 Tables are consistent
...
7/7 13:18:34 DaemonCore: Command received via UDP from host <192.168.0.50:3241>
7/7 13:18:34 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())
7/7 13:18:34 Shadow pid 5696 for job 806.0 exited with status 100
7/7 13:18:35 Started shadow for job 646.0 on "<192.168.0.113:1044>", (shadow pid = 4948)
7/7 13:18:35 DaemonCore: Command received via TCP from host <192.168.0.128:3709>
7/7 13:18:35 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
7/7 13:18:35 Got VACATE_SERVICE from <192.168.0.128:3709>
7/7 13:18:35 Sent RELEASE_CLAIM to startd on <192.168.0.128:1047>
7/7 13:18:35 Match record (<192.168.0.128:1047>, 808, 0) deleted
7/7 13:18:37 Started shadow for job 790.0 on "<192.168.0.118:1039>", (shadow pid = 3560)
7/7 13:18:39 match or classad for job 808.0 was deleted - not forking a shadow
7/7 13:18:46 Activity on stashed negotiator socket
7/7 13:18:46 Negotiating for owner: szabolcs@xxxxxxxxxxxxxxxxxxx
7/7 13:18:46 Checking consistency running and runnable jobs
7/7 13:18:46 Tables are consistent



What is a zombie process??? And is going on here?

Cheers,
Szabolcs