[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] pseudo_ops.cpp on starter



Hello all,

I am having a problem where jobs are restarting on their own. The job
runcount is more than 1 for many jobs.

We keep seeing ... on our start log,  "line 649 in file pseudo_ops.cpp"

Is this a known issue?



Schedd version: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009
BuildID: 159529 $"
Startd: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $"
Operating System Version=RHEL 5.2

I also have MAX_JOBS_RUNNING = 10000


ShadowLog:
The job is running fine then suddenly it gives this error in the ShadowLog

10/19 17:33:56 (28.3) (29653): DaemonCore: Leaving SendAliveToParent() - success
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(RemoteSysCpu = 682.000000)
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(RemoteUserCpu = 62981.000000)
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(LastJobLeaseRenewal = 1287523744)
10/19 17:34:04 (28.3) (29653): Inside RemoteResource::updateFromStarter()
10/19 17:38:35 (28.3) (29653): FileLock::obtain(1) -
@1287524315.981362 lock on /home/mech1//job.out.28 now WRITE
10/19 17:38:35 (28.3) (29653): FileLock::obtain(2) -
@1287524315.985954 lock on /home/mech1//job.out.28 now UNLOCKED
10/19 17:38:35 (28.3) (29653): ERROR "Error from starter on
slot1@xxxxxxxxxxxxxxxxxxxx: ProcD has failed" at line 649 in file
pseudo_ops.cpp
10/19 17:43:27 Initializing a VANILLA shadow for job 28.3
10/19 17:43:27 (28.3) (10012): FileLock object is updating timestamp
on: /home/mech1//job.out.28
10/19 17:43:27 (28.3) (10012): UserLog = /home/mech1//job.out.28
10/19 17:43:27 (28.3) (10012): *** Reserved Swap = 5120
10/19 17:43:27 (28.3) (10012): *** Free Swap = 8388440
10/19 17:43:27 (28.3) (10012): in RemoteResource::initStartdInfo()
10/19 17:43:27 (28.3) (10012): Entering DCStartd::activateClaim()
10/19 17:43:27 (28.3) (10012): Initialized the following authorization table:
10/19 17:43:27 (28.3) (10012): Authorizations yet to be resolved:
10/19 17:43:27 (28.3) (10012): allow READ:  */*.mech.mich.edu
10/19 17:43:27 (28.3) (10012): allow WRITE:  */*.mech.mich.edu


StarterLog.slot1.7:10/18 22:03:56 Job 28.3 set to execute immediately
StarterLog.slot1.7:10/18 22:03:56 Starting a VANILLA universe job with ID: 28.3
StarterLog.slot1.7:10/18 22:03:56 Output file: /home/mech1//stdout.28.3
StarterLog.slot1.7:10/18 22:03:56 Error file: /home/mech1//stderr.28.3

The job runcount now stands to 3.