[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job restarts



Hi Dan,

I've set these log settings and will get back to you once I see the
error appear in the shadow logs again.

Regards,

--Mike

Dan Bradley wrote:
> Mike,
> 
> 
> This line from your ShadowLog would certainly cause the sort of behavior 
> you mention:
> 
> 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to
> unknown user from host <10.255.255.146:59311> for command 61001
> (FILETRANS_DOWNLOAD), access level WRITE
> 
> 
> It sounds like the starter is connecting to the shadow to do file 
> transfers, but no authentication is taking place and the shadow is 
> rejecting the command.
> 
> Please add D_SECURITY to SHADOW_DEBUG in your submit node condor 
> configuration.  Since this may get a little involved, let's take this 
> off list.  Once a new shadow starts and hits this problem, there should 
> be a bunch of extra information in the log.  Could you please post that 
> to condor-admin@xxxxxxxxxxx?
> 
> Thanks,
> --Dan
> 
> 
> Michael Thomas wrote:
>> Hi Dan,
>>
>> Not for this particular job:
>>
>> # grep 875666 ShadowLog
>> 8/25 12:12:13 (875666.0) (8844): Asked to write event of number 6.
>> 8/25 12:17:05 (875666.0) (8844): Asked to write event of number 6.
>> 8/25 12:17:05 (875666.0) (8844): ZKM: setting default map to (null)
>> 8/25 12:27:05 (875666.0) (8844): Asked to write event of number 6.
>>
>> # grep 875666 ShadowLog.old
>> 8/25 11:45:40 (875666.0) (28489): Got SIGTERM. Performing graceful shutdown.
>> 8/25 11:45:40 (875666.0) (28489): Job 875666.0 is being evicted
>> 8/25 11:45:40 (875666.0) (28489): Asked to write event of number 4.
>> 8/25 11:45:44 (875666.0) (28489): ZKM: setting default map to (null)
>> 8/25 11:45:44 (875666.0) (28489): **** condor_shadow (condor_SHADOW)
>> EXITING WITH STATUS 107
>> 8/25 11:46:59 Initializing a VANILLA shadow for job 875666.0
>> 8/25 11:46:59 (875666.0) (8844): ZKM: setting default map to (null)
>> 8/25 11:46:59 (875666.0) (8844): Request to run on
>> <10.255.255.207:52493> was ACCEPTED
>> 8/25 11:47:05 (875666.0) (8844): Asked to write event of number 1.
>> 8/25 11:52:13 (875666.0) (8844): Asked to write event of number 6.
>> 8/25 11:57:13 (875666.0) (8844): Asked to write event of number 6.
>> 8/25 12:02:05 (875666.0) (8844): ZKM: setting default map to (null)
>> 8/25 12:02:13 (875666.0) (8844): Asked to write event of number 6.
>> 8/25 12:07:13 (875666.0) (8844): Asked to write event of number 6.
>>
>> The entry "Got SIGTERM. Performing graceful shutdown." corresponds to
>> the time that I ran 'condor_vacate_job' manually to restart the job.
>> Any older shadow log entries for this particular job
>>
>> However, I do see the following for another job:
>>
>> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration: 1200 seconds
>> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1166
>> 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter
>> 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787>
>> 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter
>> <10.255.255.146:36787>
>> 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection
>> re-established
>> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23.
>> 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to
>> unknown user from host <10.255.255.146:59311> for command 61001
>> (FILETRANS_DOWNLOAD), access level WRITE
>> 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter
>> <10.255.255.146:36787>
>> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 22.
>> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1200
>> 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter
>> 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787>
>> 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter
>> <10.255.255.146:36787>
>> 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection
>> re-established
>> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23.
>> 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to
>> unknown user from host <10.255.255.146:33141> for command 61001
>> (FILETRANS_DOWNLOAD), access level WRITE
>> 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter
>> <10.255.255.146:36787>
>>
>> How can I determine who this "unknown user" is?
>>
>> --Mike
>>
>> Dan Bradley wrote:
>>   
>>> Hi Mike,
>>>
>>> Are there any clues in the corresponding ShadowLog (on the submit side)?
>>>
>>> --Dan
>>>
>>> Michael Thomas wrote:
>>>     
>>>> I recently started seeing jobs fail with the errors below.  These jobs
>>>> come into our cluster from the globus job manager, which explicitly
>>>> disables streaming output and transfers the output files when the jobs
>>>> finish (via the NFSLite package from the VDT).  The file transfer is now
>>>> failing, which ultimately results in jobs being requeued and run again
>>>> and again.
>>>>
>>>> These errors seem to have started at about the same time that I changed
>>>> this particular grid user's shell from /bin/bash to /bin/true.  But
>>>> other users with a shell of /bin/true don't have problems with this
>>>> output file transfer.
>>>>
>>>> Where else should I look for more information on what's going wrong?
>>>>
>>>> --Mike
>>>>
>>>> 8/24 20:49:03 ******************************************************
>>>> 8/24 20:49:03 ** condor_starter (CONDOR_STARTER) STARTING UP
>>>> 8/24 20:49:03 ** /opt/condor/sbin/condor_starter
>>>> 8/24 20:49:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
>>>> 8/24 20:49:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
>>>> 8/24 20:49:03 ** PID = 25226
>>>> 8/24 20:49:03 ** Log last touched 8/24 20:49:01
>>>> 8/24 20:49:03 ******************************************************
>>>> 8/24 20:49:03 Using config source: /home/condor/condor_config
>>>> 8/24 20:49:03 Using local config sources:
>>>> 8/24 20:49:03    /share/apps/condor/hosts/cithep230/condor_config.local
>>>> 8/24 20:49:03 DaemonCore: Command Socket at <10.255.255.156:45962>
>>>> 8/24 20:49:03 Done setting resource limits
>>>> 8/24 20:49:03 Communicating with shadow <10.255.255.216:48267>
>>>> 8/24 20:49:03 Submitting machine is "gatekeeper-0-2.local"
>>>> 8/24 20:49:03 setting the orig job name in starter
>>>> 8/24 20:49:03 setting the orig job iwd in starter
>>>> 8/24 20:49:03 File transfer completed successfully.
>>>> 8/24 20:49:04 Job 875666.0 set to execute immediately
>>>> 8/24 20:49:04 Starting a VANILLA universe job with ID: 875666.0
>>>> 8/24 20:49:04 IWD: /state/partition1/tmp/cithep230/execute/dir_25226
>>>> 8/24 20:49:04 Output file:
>>>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stdout
>>>> 8/24 20:49:04 Error file:
>>>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stderr
>>>> 8/24 20:49:10 Using wrapper
>>>> /opt/condor/bin/condor_nfslite_job_wrapper.sh to exec
>>>> Summer08-QCD_EMenriched_Pt30to80-IDEAL_V6_v1-32774-JobSpec.xml
>>>> 8/24 20:49:10 Create_Process succeeded, pid=25229
>>>> 8/25 08:19:58 Process exited, pid=25229, status=0
>>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>>> failure reading 5 bytes from unknown source.
>>>> 8/25 08:19:58 IO: Failed to read packet header
>>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>>> or for a reconnect attempt
>>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>>> failure reading 5 bytes from unknown source.
>>>> 8/25 08:19:58 IO: Failed to read packet header
>>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>>> or for a reconnect attempt
>>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>>> failure reading 5 bytes from unknown source.
>>>> 8/25 08:19:58 IO: Failed to read packet header
>>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>>> or for a reconnect attempt
>>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>>> failure reading 5 bytes from unknown source.
>>>> 8/25 08:19:58 IO: Failed to read packet header
>>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>>> or for a reconnect attempt
>>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>>> failure reading 5 bytes from unknown source.
>>>> 8/25 08:19:58 IO: Failed to read packet header
>>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>>> or for a reconnect attempt
>>>> 8/25 08:19:58 Got SIGQUIT.  Performing fast shutdown.
>>>> 8/25 08:19:58 ShutdownFast all jobs.
>>>> 8/25 08:19:58 Result of "get_usage" operation from ProcD: ERROR: No
>>>> family with the given PID is registered
>>>> 8/25 08:19:58 error getting family usage in VanillaProc::PublishUpdateAd()
>>>> 8/25 08:19:58 condor_write(): Socket closed when trying to write 67
>>>> bytes to <10.255.255.216:43187>, fd is 5
>>>> 8/25 08:19:58 Buf::write(): condor_write() failed
>>>> 8/25 08:19:58 Failed to send job exit status to shadow
>>>> 8/25 08:19:58 JobExit() failed, waiting for job lease to expire or for a
>>>> reconnect attempt
>>>> 8/25 08:19:58 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
>>>>   
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Condor-users mailing list
>>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>>
>>>> The archives can be found at: 
>>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>>   
>>>>       
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at: 
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>     
>>   
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at: 
>> https://lists.cs.wisc.edu/archive/condor-users/
>>   
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature