[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Processes run, but quit immediately



Hi there,
We've installed condor on three of our 40 machines here, one acting as central manager/submit, the others as executing nodes. I had some problems initially with file permissions, but as far as I can see, these are all ironed out. I'm not using a shared file system.
However, whenever the test jobs that came with condor are run, they submit to the two executing nodes, then are pre-empted immediately (as far as I can see from the status changes). The shadow log on the central manager looks like this:
 
2/10 15:12:32 (59.1) (26137):Shadow: RSC_SOCK connected, fd = 17
2/10 15:12:32 (59.1) (26137):Shadow: CLIENT_LOG connected, fd = 18
2/10 15:12:32 (59.1) (26137):My_Filesystem_Domain = "beo"
2/10 15:12:32 (59.1) (26137):My_UID_Domain = "beo"
2/10 15:12:32 (59.1) (26137):   Entering pseudo_get_file_stream
2/10 15:12:32 (59.1) (26137):   file = "/home/condor/spool/cluster59.ickpt.subproc0"
2/10 15:12:32 (59.1) (26137):    Weird 0xc0a801fe
2/10 15:12:32 (59.1) (26137):    Weird 0xc0a801fe
2/10 15:12:32 (59.1) (26137):Reaped child status - pid 26138 exited with status 0
2/10 15:12:33 (59.1) (26137):Shadow: Job 59.1 exited, termsig = 9, coredump = 0, retcode = 129
2/10 15:12:33 (59.1) (26137):Shadow: Job was kicked off without a checkpoint
2/10 15:12:33 (59.1) (26137):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/spool/cluster59.proc1.subproc0.tmp'
2/10 15:12:33 (59.1) (26137):Trying to unlink /home/condor/spool/cluster59.proc1.subproc0.tmp
2/10 15:12:33 (59.1) (26137):user_time = 1 ticks
2/10 15:12:33 (59.1) (26137):sys_time = 1 ticks
2/10 15:12:33 (59.1) (26137):********** Shadow Exiting(107) **********
 
The StarterLog on the machine the job was allocated to seems to receive the files fine, but then gives this:
 
2/10 15:07:56 Started user job - PID = 3910
2/10 15:07:56 cmd_fp = 0x828be78
2/10 15:07:56 end
2/10 15:07:56   *FSM* Transitioning to state "SUPERVISE"
2/10 15:07:56   *FSM* Got asynchronous event "CHILD_EXIT"
2/10 15:07:56   *FSM* Executing transition function "reaper"
2/10 15:07:56 Process 3910 exited with status 129
2/10 15:07:56 EXEC of user process failed, probably insufficient swap
 
Does anyone have any ideas? I'll be happy to send any other details.
 
Many thanks.
This message is intended for the addressee(s) only and should not be read, copied or disclosed to anyone else outwith the University without the permission of the sender. It is your responsibility to ensure that this message and any attachments are scanned for viruses or other defects. Napier University does not accept liability for any loss or damage which may result from this email or any attachment, or for errors or omissions arising after it was sent. Email is not a secure medium. Email entering the University's system is subject to routine monitoring and filtering by the University.