[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] pseudo_ops.cpp on starter



I'm seeing something similar with 7.4.4 on Linux:

10/20 14:53:23 (7.0) (20283): DoDownload: exiting at 1810
10/20 14:53:23 (7.0) (20283): Return from HandleReq <FileTransfer::HandleCommands()> (handler: 0.011s, sec: 0.000s)
10/20 14:53:23 (7.0) (20283): Calling Handler <HandleSyscalls> (1)
10/20 14:53:23 (7.0) (20283): FileLock::obtain(1) - @1287600803.728554 lock on /mnt/tools/software/CycleComputing/test/condor_submit/runl
og.out now WRITE
10/20 14:53:24 (7.0) (20283): FileLock::obtain(2) - @1287600804.009739 lock on /mnt/tools/software/CycleComputing/test/condor_submit/runl
og.out now UNLOCKED
10/20 14:53:24 (7.0) (20283): ERROR "Error from slot1@192-168-0-98: Assertion ERROR on (m_ft_info.hold_code != 0
)" at line 687 in file pseudo_ops.cpp

Seems to happen on job completion when file copy back is trying to happen.

I am not setting an on_exit_hold _expression_ on submission or via the configs.

Regards,
- Ian

On Wed, Oct 20, 2010 at 7:22 AM, Mag Gam <magawake@xxxxxxxxx> wrote:
Hello all,

I am having a problem where jobs are restarting on their own. The job
runcount is more than 1 for many jobs.

We keep seeing ... on our start log,  "line 649 in file pseudo_ops.cpp"

Is this a known issue?



Schedd version: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009
BuildID: 159529 $"
Startd: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $"
Operating System Version=RHEL 5.2

I also have MAX_JOBS_RUNNING = 10000


ShadowLog:
The job is running fine then suddenly it gives this error in the ShadowLog

10/19 17:33:56 (28.3) (29653): DaemonCore: Leaving SendAliveToParent() - success
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(RemoteSysCpu = 682.000000)
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(RemoteUserCpu = 62981.000000)
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(LastJobLeaseRenewal = 1287523744)
10/19 17:34:04 (28.3) (29653): Inside RemoteResource::updateFromStarter()
10/19 17:38:35 (28.3) (29653): FileLock::obtain(1) -
@1287524315.981362 lock on /home/mech1//job.out.28 now WRITE
10/19 17:38:35 (28.3) (29653): FileLock::obtain(2) -
@1287524315.985954 lock on /home/mech1//job.out.28 now UNLOCKED
10/19 17:38:35 (28.3) (29653): ERROR "Error from starter on
slot1@xxxxxxxxxxxxxxxxxxxx: ProcD has failed" at line 649 in file
pseudo_ops.cpp
10/19 17:43:27 Initializing a VANILLA shadow for job 28.3
10/19 17:43:27 (28.3) (10012): FileLock object is updating timestamp
on: /home/mech1//job.out.28
10/19 17:43:27 (28.3) (10012): UserLog = /home/mech1//job.out.28
10/19 17:43:27 (28.3) (10012): *** Reserved Swap = 5120
10/19 17:43:27 (28.3) (10012): *** Free Swap = 8388440
10/19 17:43:27 (28.3) (10012): in RemoteResource::initStartdInfo()
10/19 17:43:27 (28.3) (10012): Entering DCStartd::activateClaim()
10/19 17:43:27 (28.3) (10012): Initialized the following authorization table:
10/19 17:43:27 (28.3) (10012): Authorizations yet to be resolved:
10/19 17:43:27 (28.3) (10012): allow READ:  */*.mech.mich.edu
10/19 17:43:27 (28.3) (10012): allow WRITE:  */*.mech.mich.edu


StarterLog.slot1.7:10/18 22:03:56 Job 28.3 set to execute immediately
StarterLog.slot1.7:10/18 22:03:56 Starting a VANILLA universe job with ID: 28.3
StarterLog.slot1.7:10/18 22:03:56 Output file: /home/mech1//stdout.28.3
StarterLog.slot1.7:10/18 22:03:56 Error file: /home/mech1//stderr.28.3

The job runcount now stands to 3.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/