[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-related signal 11

On Feb 26, 2008, at 6:03 PM, Nickolas Fotopoulos wrote:

I have a very mysterious problem that I suspect points to a problem in
our Condor configuration or a bug in Condor.

* I submit a DAG and the several jobs come back failing with signal
11.  Job .err and .out files are empty.
* I run locally and a job succeeds
* I run with condor_run and a job succeeds
* I rsh to a node that gave a signal 11 and a job succeeds
* Attaching an strace to the process shows that it dies mid-
computation, not during any I/O or anything.

So the only way to get the signal 11 is to run the job through
dagman.  I believe we're running Condor 6.9.4 with the dagman 7.0
binaries pre-released to LIGO (this is the LIGO Nemo cluster at UWM).
Any and all help would be appreciated.

Have you tried submitting the job to Condor using the submit description file? That should more closely resemble how the job is run under DAGMan.

Also try wrapping the job in a script that prints out the environment. Then duplicate that exact environment on one of the execution machines and run the job by hand.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team