[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] pseudo_ops.cpp on starter



strange.

What do you see in your starterlog? and startlog?



On Wed, Oct 20, 2010 at 2:58 PM, Ian Chesal <ichesal@xxxxxxxxxxxxxxxxxx> wrote:
> I'm seeing something similar with 7.4.4 on Linux:
> 10/20 14:53:23 (7.0) (20283): DoDownload: exiting at 1810
> 10/20 14:53:23 (7.0) (20283): Return from HandleReq
> <FileTransfer::HandleCommands()> (handler: 0.011s, sec: 0.000s)
> 10/20 14:53:23 (7.0) (20283): Calling Handler <HandleSyscalls> (1)
> 10/20 14:53:23 (7.0) (20283): FileLock::obtain(1) - @1287600803.728554 lock
> on /mnt/tools/software/CycleComputing/test/condor_submit/runl
> og.out now WRITE
> 10/20 14:53:24 (7.0) (20283): FileLock::obtain(2) - @1287600804.009739 lock
> on /mnt/tools/software/CycleComputing/test/condor_submit/runl
> og.out now UNLOCKED
> 10/20 14:53:24 (7.0) (20283): ERROR "Error from slot1@192-168-0-98:
> Assertion ERROR on (m_ft_info.hold_code != 0
> )" at line 687 in file pseudo_ops.cpp
> Seems to happen on job completion when file copy back is trying to happen.
> I am not setting an on_exit_hold expression on submission or via the
> configs.
> Regards,
> - Ian
> On Wed, Oct 20, 2010 at 7:22 AM, Mag Gam <magawake@xxxxxxxxx> wrote:
>>
>> Hello all,
>>
>> I am having a problem where jobs are restarting on their own. The job
>> runcount is more than 1 for many jobs.
>>
>> We keep seeing ... on our start log,  "line 649 in file pseudo_ops.cpp"
>>
>> Is this a known issue?
>>
>>
>>
>> Schedd version: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009
>> BuildID: 159529 $"
>> Startd: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529
>> $"
>> Operating System Version=RHEL 5.2
>>
>> I also have MAX_JOBS_RUNNING = 10000
>>
>>
>> ShadowLog:
>> The job is running fine then suddenly it gives this error in the ShadowLog
>>
>> 10/19 17:33:56 (28.3) (29653): DaemonCore: Leaving SendAliveToParent() -
>> success
>> 10/19 17:33:58 (28.3) (29653): Updating Job Queue:
>> SetAttribute(RemoteSysCpu = 682.000000)
>> 10/19 17:33:58 (28.3) (29653): Updating Job Queue:
>> SetAttribute(RemoteUserCpu = 62981.000000)
>> 10/19 17:33:58 (28.3) (29653): Updating Job Queue:
>> SetAttribute(LastJobLeaseRenewal = 1287523744)
>> 10/19 17:34:04 (28.3) (29653): Inside RemoteResource::updateFromStarter()
>> 10/19 17:38:35 (28.3) (29653): FileLock::obtain(1) -
>> @1287524315.981362 lock on /home/mech1//job.out.28 now WRITE
>> 10/19 17:38:35 (28.3) (29653): FileLock::obtain(2) -
>> @1287524315.985954 lock on /home/mech1//job.out.28 now UNLOCKED
>> 10/19 17:38:35 (28.3) (29653): ERROR "Error from starter on
>> slot1@xxxxxxxxxxxxxxxxxxxx: ProcD has failed" at line 649 in file
>> pseudo_ops.cpp
>> 10/19 17:43:27 Initializing a VANILLA shadow for job 28.3
>> 10/19 17:43:27 (28.3) (10012): FileLock object is updating timestamp
>> on: /home/mech1//job.out.28
>> 10/19 17:43:27 (28.3) (10012): UserLog = /home/mech1//job.out.28
>> 10/19 17:43:27 (28.3) (10012): *** Reserved Swap = 5120
>> 10/19 17:43:27 (28.3) (10012): *** Free Swap = 8388440
>> 10/19 17:43:27 (28.3) (10012): in RemoteResource::initStartdInfo()
>> 10/19 17:43:27 (28.3) (10012): Entering DCStartd::activateClaim()
>> 10/19 17:43:27 (28.3) (10012): Initialized the following authorization
>> table:
>> 10/19 17:43:27 (28.3) (10012): Authorizations yet to be resolved:
>> 10/19 17:43:27 (28.3) (10012): allow READ:  */*.mech.mich.edu
>> 10/19 17:43:27 (28.3) (10012): allow WRITE:  */*.mech.mich.edu
>>
>>
>> StarterLog.slot1.7:10/18 22:03:56 Job 28.3 set to execute immediately
>> StarterLog.slot1.7:10/18 22:03:56 Starting a VANILLA universe job with ID:
>> 28.3
>> StarterLog.slot1.7:10/18 22:03:56 Output file: /home/mech1//stdout.28.3
>> StarterLog.slot1.7:10/18 22:03:56 Error file: /home/mech1//stderr.28.3
>>
>> The job runcount now stands to 3.
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>