[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-related signal 11



Dear all,

I have a very mysterious problem that I suspect points to a problem in our Condor configuration or a bug in Condor.

* I submit a DAG and the several jobs come back failing with signal 11. Job .err and .out files are empty.
* I run locally and a job succeeds
* I run with condor_run and a job succeeds
* I rsh to a node that gave a signal 11 and a job succeeds
* Attaching an strace to the process shows that it dies mid- computation, not during any I/O or anything.

So the only way to get the signal 11 is to run the job through dagman. I believe we're running Condor 6.9.4 with the dagman 7.0 binaries pre-released to LIGO (this is the LIGO Nemo cluster at UWM). Any and all help would be appreciated.

Thanks,
Nick

===================================
Nickolas Fotopoulos
nvf@xxxxxxxxxxxxxxxxxxxx

Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471
===================================