[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Successfully reproducing the job suspension problem(return value 143)



Dear all,

 

Further to the email below, we can’t reproduce this problem on non Matlab compiled executable jobs on condor.  For other jobs we run, jobs suspend fine.

 

It seems that the problem exists only when running a Matlab compiled executable on Condor and a user logs into a compute node and then back out within a 10 minute time frame.

 

Has anybody else had any problems like this?

 

Shaun

 

 


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Shaun J. O'Callaghan
Sent: 30 November 2006 16:26
To: Condor-Users Mail List
Cc: condor-admin@xxxxxxxxxxx
Subject: [Condor-users] Successfully reproducing the job suspension problem(return value 143)

 

Dear all,

 

I can successfully reproduce the strange job suspension and termination problem by logging into a machine which is currently executing a job and then logging out within the 15 minute suspension interval.  When the job is resumed it terminates with the return value 143.

 

Can somebody please confirm whether this is a bug in Condor 6.8.0 or not?  Again I’m not running Java universe jobs as other people have done in the past, this is a vanilla job executing a compiled matlab executable across Condor.

 

 

Regards,

 

Shaun

 

 

 

 


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Shaun J. O'Callaghan
Sent: 29 November 2006 09:44
To: Condor-Users Mail List
Subject: [Condor-users] Strange termination - Return value 143 ?

 

Dear All,

 

We’re experiencing some very strange intermittent problems when dealing with large batches of jobs (14,000+) on Condor v 6.8.0 (Linux Central Manager/XP Compute nodes).

 

Some jobs are terminating with a return value of 143.  These are vanilla jobs and are actually compiled Matlab executables that rely on the Matlab Component Runtime (MCR) which is present and in the paths of all of the compute nodes.  This is not a required library issue as only 120 or so jobs failed in a batch of 14,000+.

 

I’ve pulled out the details that depict the lifetime of one of the jobs that failed with this return value and listed the details below.  I’ve read some details on the Condor list that said something along the lines of: this problem can occur when a job is suspended and a user logs out of a machine.

 

As mentioned we’re running 6.8.0 across the pool.  Has this problem been rectified in 6.8.2 or can anyone provide any further information on this?

 

Kind Regards,

 

Shaun

 

Job details below

 

 

000 (014.1804.000) 11/08 08:34:57 Job submitted from host: <xxx.xxx.xxx.xxx:1058>

 

001 (014.1804.000) 11/09 15:30:12 Job executing on host: < xxx.xxx.xxx.xxx:2956>

 

006 (014.1804.000) 11/09 15:50:21 Image size of job updated: 97528

 

010 (014.1804.000) 11/09 16:08:57 Job was suspended.

            Number of processes actually suspended: 2

 

011 (014.1804.000) 11/09 16:18:47 Job was unsuspended.

 

010 (014.1804.000) 11/09 17:40:31 Job was suspended.

            Number of processes actually suspended: 2

 

011 (014.1804.000) 11/09 17:49:33 Job was unsuspended.

 

006 (014.1804.000) 11/09 17:49:41 Image size of job updated: 97536

 

005 (014.1804.000) 11/09 17:49:42 Job terminated.

            (1) Normal termination (return value 143)

                        Usr 0 01:56:36, Sys 0 00:01:26  -  Run Remote Usage

                        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

                        Usr 0 01:56:36, Sys 0 00:01:26  -  Total Remote Usage

                        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage

            151  -  Run Bytes Sent By Job

            3897398  -  Run Bytes Received By Job

            151  -  Total Bytes Sent By Job

            3897398  -  Total Bytes Received By Job