[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job restarts



Not sure if it matters, but this did not happen with condor-7.0.3, but
only started after a couple of weeks running condor 7.0.4.

We are also using GSI security with our condor setup, but didn't change
the GSI config settings when we upgraded from 7.0.3 to 7.0.4.

--Mike

Michael Thomas wrote:
> Hi Dan,
> 
> Not for this particular job:
> 
> # grep 875666 ShadowLog
> 8/25 12:12:13 (875666.0) (8844): Asked to write event of number 6.
> 8/25 12:17:05 (875666.0) (8844): Asked to write event of number 6.
> 8/25 12:17:05 (875666.0) (8844): ZKM: setting default map to (null)
> 8/25 12:27:05 (875666.0) (8844): Asked to write event of number 6.
> 
> # grep 875666 ShadowLog.old
> 8/25 11:45:40 (875666.0) (28489): Got SIGTERM. Performing graceful shutdown.
> 8/25 11:45:40 (875666.0) (28489): Job 875666.0 is being evicted
> 8/25 11:45:40 (875666.0) (28489): Asked to write event of number 4.
> 8/25 11:45:44 (875666.0) (28489): ZKM: setting default map to (null)
> 8/25 11:45:44 (875666.0) (28489): **** condor_shadow (condor_SHADOW)
> EXITING WITH STATUS 107
> 8/25 11:46:59 Initializing a VANILLA shadow for job 875666.0
> 8/25 11:46:59 (875666.0) (8844): ZKM: setting default map to (null)
> 8/25 11:46:59 (875666.0) (8844): Request to run on
> <10.255.255.207:52493> was ACCEPTED
> 8/25 11:47:05 (875666.0) (8844): Asked to write event of number 1.
> 8/25 11:52:13 (875666.0) (8844): Asked to write event of number 6.
> 8/25 11:57:13 (875666.0) (8844): Asked to write event of number 6.
> 8/25 12:02:05 (875666.0) (8844): ZKM: setting default map to (null)
> 8/25 12:02:13 (875666.0) (8844): Asked to write event of number 6.
> 8/25 12:07:13 (875666.0) (8844): Asked to write event of number 6.
> 
> The entry "Got SIGTERM. Performing graceful shutdown." corresponds to
> the time that I ran 'condor_vacate_job' manually to restart the job.
> Any older shadow log entries for this particular job
> 
> However, I do see the following for another job:
> 
> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration: 1200 seconds
> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1166
> 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter
> 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787>
> 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter
> <10.255.255.146:36787>
> 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection
> re-established
> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23.
> 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to
> unknown user from host <10.255.255.146:59311> for command 61001
> (FILETRANS_DOWNLOAD), access level WRITE
> 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter
> <10.255.255.146:36787>
> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 22.
> 8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1200
> 8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter
> 8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787>
> 8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter
> <10.255.255.146:36787>
> 8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection
> re-established
> 8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23.
> 8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to
> unknown user from host <10.255.255.146:33141> for command 61001
> (FILETRANS_DOWNLOAD), access level WRITE
> 8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter
> <10.255.255.146:36787>
> 
> How can I determine who this "unknown user" is?
> 
> --Mike
> 
> Dan Bradley wrote:
>> Hi Mike,
>>
>> Are there any clues in the corresponding ShadowLog (on the submit side)?
>>
>> --Dan
>>
>> Michael Thomas wrote:
>>> I recently started seeing jobs fail with the errors below.  These jobs
>>> come into our cluster from the globus job manager, which explicitly
>>> disables streaming output and transfers the output files when the jobs
>>> finish (via the NFSLite package from the VDT).  The file transfer is now
>>> failing, which ultimately results in jobs being requeued and run again
>>> and again.
>>>
>>> These errors seem to have started at about the same time that I changed
>>> this particular grid user's shell from /bin/bash to /bin/true.  But
>>> other users with a shell of /bin/true don't have problems with this
>>> output file transfer.
>>>
>>> Where else should I look for more information on what's going wrong?
>>>
>>> --Mike
>>>
>>> 8/24 20:49:03 ******************************************************
>>> 8/24 20:49:03 ** condor_starter (CONDOR_STARTER) STARTING UP
>>> 8/24 20:49:03 ** /opt/condor/sbin/condor_starter
>>> 8/24 20:49:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
>>> 8/24 20:49:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
>>> 8/24 20:49:03 ** PID = 25226
>>> 8/24 20:49:03 ** Log last touched 8/24 20:49:01
>>> 8/24 20:49:03 ******************************************************
>>> 8/24 20:49:03 Using config source: /home/condor/condor_config
>>> 8/24 20:49:03 Using local config sources:
>>> 8/24 20:49:03    /share/apps/condor/hosts/cithep230/condor_config.local
>>> 8/24 20:49:03 DaemonCore: Command Socket at <10.255.255.156:45962>
>>> 8/24 20:49:03 Done setting resource limits
>>> 8/24 20:49:03 Communicating with shadow <10.255.255.216:48267>
>>> 8/24 20:49:03 Submitting machine is "gatekeeper-0-2.local"
>>> 8/24 20:49:03 setting the orig job name in starter
>>> 8/24 20:49:03 setting the orig job iwd in starter
>>> 8/24 20:49:03 File transfer completed successfully.
>>> 8/24 20:49:04 Job 875666.0 set to execute immediately
>>> 8/24 20:49:04 Starting a VANILLA universe job with ID: 875666.0
>>> 8/24 20:49:04 IWD: /state/partition1/tmp/cithep230/execute/dir_25226
>>> 8/24 20:49:04 Output file:
>>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stdout
>>> 8/24 20:49:04 Error file:
>>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stderr
>>> 8/24 20:49:10 Using wrapper
>>> /opt/condor/bin/condor_nfslite_job_wrapper.sh to exec
>>> Summer08-QCD_EMenriched_Pt30to80-IDEAL_V6_v1-32774-JobSpec.xml
>>> 8/24 20:49:10 Create_Process succeeded, pid=25229
>>> 8/25 08:19:58 Process exited, pid=25229, status=0
>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>> failure reading 5 bytes from unknown source.
>>> 8/25 08:19:58 IO: Failed to read packet header
>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>> or for a reconnect attempt
>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>> failure reading 5 bytes from unknown source.
>>> 8/25 08:19:58 IO: Failed to read packet header
>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>> or for a reconnect attempt
>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>> failure reading 5 bytes from unknown source.
>>> 8/25 08:19:58 IO: Failed to read packet header
>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>> or for a reconnect attempt
>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>> failure reading 5 bytes from unknown source.
>>> 8/25 08:19:58 IO: Failed to read packet header
>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>> or for a reconnect attempt
>>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>>> failure reading 5 bytes from unknown source.
>>> 8/25 08:19:58 IO: Failed to read packet header
>>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>>> or for a reconnect attempt
>>> 8/25 08:19:58 Got SIGQUIT.  Performing fast shutdown.
>>> 8/25 08:19:58 ShutdownFast all jobs.
>>> 8/25 08:19:58 Result of "get_usage" operation from ProcD: ERROR: No
>>> family with the given PID is registered
>>> 8/25 08:19:58 error getting family usage in VanillaProc::PublishUpdateAd()
>>> 8/25 08:19:58 condor_write(): Socket closed when trying to write 67
>>> bytes to <10.255.255.216:43187>, fd is 5
>>> 8/25 08:19:58 Buf::write(): condor_write() failed
>>> 8/25 08:19:58 Failed to send job exit status to shadow
>>> 8/25 08:19:58 JobExit() failed, waiting for job lease to expire or for a
>>> reconnect attempt
>>> 8/25 08:19:58 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
>>>   
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at: 
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>   
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at: 
>> https://lists.cs.wisc.edu/archive/condor-users/
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature