Re: [Condor-users] pseudo_ops.cpp on starter

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

10/20 14:53:23 (7.0) (20283): DoDownload: exiting at 1810

10/20 14:53:23 (7.0) (20283): Return from HandleReq <FileTransfer::HandleCommands()> (handler: 0.011s, sec: 0.000s)

10/20 14:53:23 (7.0) (20283): Calling Handler <HandleSyscalls> (1)

10/20 14:53:23 (7.0) (20283): FileLock::obtain(1) - @1287600803.728554 lock on /mnt/tools/software/CycleComputing/test/condor_submit/runl

og.out now WRITE

10/20 14:53:24 (7.0) (20283): FileLock::obtain(2) - @1287600804.009739 lock on /mnt/tools/software/CycleComputing/test/condor_submit/runl

og.out now UNLOCKED

10/20 14:53:24 (7.0) (20283): ERROR "Error from slot1@192-168-0-98: Assertion ERROR on (m_ft_info.hold_code != 0

)" at line 687 in file pseudo_ops.cpp

Seems to happen on job completion when file copy back is trying to happen.

I am not setting an on_exit_hold _expression_ on submission or via the configs.

Regards,

- Ian

On Wed, Oct 20, 2010 at 7:22 AM, Mag Gam <magawake@xxxxxxxxx> wrote:

Hello all,

I am having a problem where jobs are restarting on their own. The job
runcount is more than 1 for many jobs.

We keep seeing ... on our start log, "line 649 in file pseudo_ops.cpp"

Is this a known issue?

Schedd version: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009
BuildID: 159529 $"
Startd: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $"
Operating System Version=RHEL 5.2

I also have MAX_JOBS_RUNNING = 10000

ShadowLog:
The job is running fine then suddenly it gives this error in the ShadowLog

10/19 17:33:56 (28.3) (29653): DaemonCore: Leaving SendAliveToParent() - success
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(RemoteSysCpu = 682.000000)
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(RemoteUserCpu = 62981.000000)
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(LastJobLeaseRenewal = 1287523744)
10/19 17:34:04 (28.3) (29653): Inside RemoteResource::updateFromStarter()
10/19 17:38:35 (28.3) (29653): FileLock::obtain(1) -
@1287524315.981362 lock on /home/mech1//job.out.28 now WRITE
10/19 17:38:35 (28.3) (29653): FileLock::obtain(2) -
@1287524315.985954 lock on /home/mech1//job.out.28 now UNLOCKED
10/19 17:38:35 (28.3) (29653): ERROR "Error from starter on
slot1@xxxxxxxxxxxxxxxxxxxx: ProcD has failed" at line 649 in file
pseudo_ops.cpp
10/19 17:43:27 Initializing a VANILLA shadow for job 28.3
10/19 17:43:27 (28.3) (10012): FileLock object is updating timestamp
on: /home/mech1//job.out.28
10/19 17:43:27 (28.3) (10012): UserLog = /home/mech1//job.out.28
10/19 17:43:27 (28.3) (10012): *** Reserved Swap = 5120
10/19 17:43:27 (28.3) (10012): *** Free Swap = 8388440
10/19 17:43:27 (28.3) (10012): in RemoteResource::initStartdInfo()
10/19 17:43:27 (28.3) (10012): Entering DCStartd::activateClaim()
10/19 17:43:27 (28.3) (10012): Initialized the following authorization table:
10/19 17:43:27 (28.3) (10012): Authorizations yet to be resolved:
10/19 17:43:27 (28.3) (10012): allow READ: */*.mech.mich.edu
10/19 17:43:27 (28.3) (10012): allow WRITE: */*.mech.mich.edu

StarterLog.slot1.7:10/18 22:03:56 Job 28.3 set to execute immediately
StarterLog.slot1.7:10/18 22:03:56 Starting a VANILLA universe job with ID: 28.3
StarterLog.slot1.7:10/18 22:03:56 Output file: /home/mech1//stdout.28.3
StarterLog.slot1.7:10/18 22:03:56 Error file: /home/mech1//stderr.28.3

The job runcount now stands to 3.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Mailing List Archives

Public Access

Re: [Condor-users] pseudo_ops.cpp on starter