[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running

Hi Matt,

>status 108 is (from  http://www.cs.wisc.edu/~adesmet/status.html)
>108  JOB_NOT_STARTED  Can't connect to startd or request refused  

This page is a very useful page, never seen it before.
If I envounter the problem again I'll try to use fulldebug mode for the startd to
find out more about this.

>If you look at the startd log on the machine which job 667.0 was
>matched(should say in its user log or further up the schedd log) it
>might say why this was the case.

Sorry but in the meantime Condor 6.7.8 had been replaced with 6.8.10 and all log files were
removed to start from scratch.
Sadly I found that the problem was not related to the development version
but to the fact that I forced condor to drop claims to give similar chances to new users (to be able
to claim a machine). When more than 1000 jobs were queued somehow the claim process died
and I constantly received "DEACTIVATE_CLAIM_FORCIBLY" commands that shut down my 
already running processes.

>It is worth noting that you have 5 jobs all fiinishing at the same
>time - how rapidly do you churn through jobs?

The computation time varies between 1 and 100 minutes for a single job. Probably
the jobs you saw in the log file were doing similar computations and started at roughly
the same time. (I guess DAGMan submitted the jobs and at the first negotiation cycle
a few computers were matched.)