[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor-related signal 11
- Date: Tue, 26 Feb 2008 18:03:56 -0600
- From: Nickolas Fotopoulos <nvf@xxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Condor-related signal 11
I have a very mysterious problem that I suspect points to a problem in
our Condor configuration or a bug in Condor.
* I submit a DAG and the several jobs come back failing with signal
11. Job .err and .out files are empty.
* I run locally and a job succeeds
* I run with condor_run and a job succeeds
* I rsh to a node that gave a signal 11 and a job succeeds
* Attaching an strace to the process shows that it dies mid-
computation, not during any I/O or anything.
So the only way to get the signal 11 is to run the job through
dagman. I believe we're running Condor 6.9.4 with the dagman 7.0
binaries pre-released to LIGO (this is the LIGO Nemo cluster at UWM).
Any and all help would be appreciated.
Office: (414) 229-6438
Fax: (414) 229-5589
University of Wisconsin - Milwaukee
Physics Bldg, Rm 471