[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs stuck in condor_starter



I've found a job cluster that won't run. Jobs are matched against a slot,
output and error files are created, but condor_starter never transfers
control to the real Executable (which is a Perl script).

In the slot's StarterLog, there are those messages every hour:

8/19 13:40:48 ERROR "Assertion ERROR on (result)" at line 384 in file NTsenders.C
8/19 13:40:48 condor_write(): Socket closed when trying to write 168 bytes to <10.100.200.93:60802>, fd is 5
8/19 13:40:48 Buf::write(): condor_write() failed
8/19 13:40:48 ERROR "Assertion ERROR on (result)" at line 875 in file NTsenders.C

A by-product is that apparently there are more jobs in R state than slots
available (809 free slots, 814 R jobs)

How to interpret the assert() error?

Condor version 7.0.4

Regards, 
 Steffen

-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html