[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job Failure



Hi,

I have a problem jobs which are ending prematurely. The jobs stop without finishing or giving any type of error, it simply stops. The program I am running operates normally when used without Condor. We have a local cluster that runs condor across 24 dual opteron nodes each running suse linux 9.2 with condor 6.6.9 with 4 virtual machines per node. We have turned off preemption and checkpointing.

Below are examples from the condor StaterLog on the node the jobs stop. There is no indication of the jobs stopping in the MasterLong. While the times listed below are at night the same occurrences have happened at various times throughout the day. Is there a reason jobs would end prematurely? Do the below StarterLog files indicate a pathology?


Example one
******************************
***********************************************************
From condor StarterLog

1/17 01:06:36 entering FileTransfer::Upload
1/17 01:06:36 entering FileTransfer::DoUpload
1/17 01:06:36 DoUpload: send file mono_basis-1_job-100a.out
1/17 01:06:36 ReliSock: put_file: sent 71147 bytes
1/17 01:06:36 DoUpload: send file mono_basis-1_job-100a.rwf
1/17 01:06:36 ReliSock: put_file: sent 5391 bytes
1/17 01:06:36 DoUpload: send file mono_basis-1_job-100a.gbs
1/17 01:06:36 ReliSock: put_file: sent 802 bytes
1/17 01:06:36 DoUpload: send file mono_basis-1_job-100a.pot
1/17 01:06:36 ReliSock: put_file: sent 607 bytes
1/17 01:06:36 DoUpload: exiting at 1413
1/17 01:06:36 Inside OsProc::JobExit()
1/17 01:06:36 In VanillaProc::PublishUpdateAd()
1/17 01:06:36 ProcAPI::buildFamily failed: parent 16052 not found on system.
1/17 01:06:36 Inside OsProc::PublishUpdateAd()
1/17 01:06:36 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:06:36 IO: Incoming packet is too big
1/17 01:06:36 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:06:36 IO: Incoming packet is too big
1/17 01:06:36 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:06:36 Got SIGQUIT.  Performing fast shutdown.
1/17 01:06:36 ShutdownFast all jobs.
1/17 01:06:36 Got ShutdownFast when no jobs running.
******************************************************************************************

Example Two
******************************************************************************************
From Condor StarterLog

1/17 01:02:42 FileTransfer::UploadFiles: sent TransKey=1#45ad294b7b3797a06413011b
1/17 01:02:42 entering FileTransfer::Upload
1/17 01:02:42 entering FileTransfer::DoUpload
1/17 01:02:42 DoUpload: send file mono_basis-1_job- 100.out
1/17 01:02:42 ReliSock: put_file: sent 61511 bytes
1/17 01:02:42 DoUpload: send file mono_basis-1_job-100.rwf
1/17 01:02:42 ReliSock: put_file: sent 5391 bytes
1/17 01:02:42 DoUpload: send file mono_basis-1_job- 100.gbs
1/17 01:02:42 ReliSock: put_file: sent 802 bytes
1/17 01:02:42 DoUpload: send file mono_basis-1_job-100.pot
1/17 01:02:42 ReliSock: put_file: sent 607 bytes
1/17 01:02:42 DoUpload: exiting at 1413
1/17 01:02:42 Inside OsProc::JobExit()
1/17 01:02:42 In VanillaProc::PublishUpdateAd()
1/17 01:02:42 ProcAPI::buildFamily failed: parent 6062 not found on system.
1/17 01:02:42 Inside OsProc::PublishUpdateAd()
1/17 01:02:42 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:02:42 IO: Incoming packet is too big
1/17 01:02:42 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:02:42 IO: Incoming packet is too big
1/17 01:02:42 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:02:42 Got SIGQUIT.  Performing fast shutdown.
1/17 01:02:42 ShutdownFast all jobs.
1/17 01:02:42 Got ShutdownFast when no jobs running.
******************************************************************************************

Take Care,
Glen