We're seeing some strange
failures of migrating checkpointed jobs in our flocked
environment and I'd be grateful for any pointers to fixing this
behaviour. Firstly, our grid is made up of a number of flocked
pools (all Linux, of various distributions, but running v. 7.6.4
of Condor). One of users has now reported a problem where
checkpointed jobs that had been running satisfactorily in his
pool will fail to start when evicted and matched with a resource
in another pool (not a specific one) with the following error
message in the job's log file:|
005 (239620.001.000) 12/05 13:09:56 Job terminated. (0) Abnormal termination (signal 4)
And here's the corresponding snippet from the StarterLog:
12/05/11 13:08:29 Started user job - PID = 16369
12/05/11 13:08:29 cmd_fp = 0x163f220
12/05/11 13:08:29 restart
12/05/11 13:08:29 end
12/05/11 13:08:29 *FSM* Transitioning to state "SUPERVISE"
12/05/11 13:08:29 *FSM* Got asynchronous event "CHILD_EXIT"
12/05/11 13:08:29 *FSM* Executing transition function "reaper"
12/05/11 13:08:29 *FSM* Aborting transition function "reaper"
12/05/11 13:08:29 *FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT ]
12/05/11 13:09:56 *FSM* Got asynchronous event "CHILD_EXIT"
12/05/11 13:09:56 *FSM* Executing transition function "reaper"
12/05/11 13:09:56 Process 16369 killed by signal 4
12/05/11 13:09:56 Process exited abnormally
12/05/11 13:09:56 *FSM* Transitioning to state "PROC_EXIT"
12/05/11 13:09:56 *FSM* Executing state func "proc_exit()" [ DIE ]
12/05/11 13:09:56 *FSM* Transitioning to state "SEND_CORE"
12/05/11 13:09:56 *FSM* Executing state func "send_core()" [ SUSPEND VACATE DIE ]
12/05/11 13:09:56 No core file to send - probably ran out of disk
12/05/11 13:09:56 *FSM* Executing transition function "dispose_one"
12/05/11 13:09:56 Sending final status for process 239620.1
12/05/11 13:09:56 STATUS encoded as ABNORMAL, NO CORE
The strange thing is that if his jobs had started originally to run in any pool, then they'll happily run to completion. His jobs will also migrate happily as long as its between machines in his own pool (details of which to follow). The problem only seems to arise when a checkpointed image that was created by a machine in his pool tries to restart in another pool.
The user's pool has execute nodes running x86_64 Debian 6.0.3, whereas his submit host is a 32 bit Ubuntu 10.04.3 machine (using the Debian 5 Condor binary). Many of the pools where his jobs fail to restart are also running Debian 6.0.3, and his executable was compiled and linked on a Debian 6.0.3 machine.
Thanks for any helpful suggestions.