[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job restarts



Hi Dan,

Not for this particular job:

# grep 875666 ShadowLog
8/25 12:12:13 (875666.0) (8844): Asked to write event of number 6.
8/25 12:17:05 (875666.0) (8844): Asked to write event of number 6.
8/25 12:17:05 (875666.0) (8844): ZKM: setting default map to (null)
8/25 12:27:05 (875666.0) (8844): Asked to write event of number 6.

# grep 875666 ShadowLog.old
8/25 11:45:40 (875666.0) (28489): Got SIGTERM. Performing graceful shutdown.
8/25 11:45:40 (875666.0) (28489): Job 875666.0 is being evicted
8/25 11:45:40 (875666.0) (28489): Asked to write event of number 4.
8/25 11:45:44 (875666.0) (28489): ZKM: setting default map to (null)
8/25 11:45:44 (875666.0) (28489): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 107
8/25 11:46:59 Initializing a VANILLA shadow for job 875666.0
8/25 11:46:59 (875666.0) (8844): ZKM: setting default map to (null)
8/25 11:46:59 (875666.0) (8844): Request to run on
<10.255.255.207:52493> was ACCEPTED
8/25 11:47:05 (875666.0) (8844): Asked to write event of number 1.
8/25 11:52:13 (875666.0) (8844): Asked to write event of number 6.
8/25 11:57:13 (875666.0) (8844): Asked to write event of number 6.
8/25 12:02:05 (875666.0) (8844): ZKM: setting default map to (null)
8/25 12:02:13 (875666.0) (8844): Asked to write event of number 6.
8/25 12:07:13 (875666.0) (8844): Asked to write event of number 6.

The entry "Got SIGTERM. Performing graceful shutdown." corresponds to
the time that I ran 'condor_vacate_job' manually to restart the job.
Any older shadow log entries for this particular job

However, I do see the following for another job:

8/25 11:38:07 (873157.0) (4571): JobLeaseDuration: 1200 seconds
8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1166
8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter
8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787>
8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter
<10.255.255.146:36787>
8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection
re-established
8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23.
8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to
unknown user from host <10.255.255.146:59311> for command 61001
(FILETRANS_DOWNLOAD), access level WRITE
8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter
<10.255.255.146:36787>
8/25 11:38:07 (873157.0) (4571): Asked to write event of number 22.
8/25 11:38:07 (873157.0) (4571): JobLeaseDuration remaining: 1200
8/25 11:38:07 (873157.0) (4571): Attempting to locate disconnected starter
8/25 11:38:07 (873157.0) (4571): Found starter: <10.255.255.146:36787>
8/25 11:38:07 (873157.0) (4571): Attempting to reconnect to starter
<10.255.255.146:36787>
8/25 11:38:07 (873157.0) (4571): Reconnect SUCCESS: connection
re-established
8/25 11:38:07 (873157.0) (4571): Asked to write event of number 23.
8/25 11:38:07 (873157.0) (4571): DaemonCore: PERMISSION DENIED to
unknown user from host <10.255.255.146:33141> for command 61001
(FILETRANS_DOWNLOAD), access level WRITE
8/25 11:38:07 (873157.0) (4571): Can no longer talk to condor_starter
<10.255.255.146:36787>

How can I determine who this "unknown user" is?

--Mike

Dan Bradley wrote:
> Hi Mike,
> 
> Are there any clues in the corresponding ShadowLog (on the submit side)?
> 
> --Dan
> 
> Michael Thomas wrote:
>> I recently started seeing jobs fail with the errors below.  These jobs
>> come into our cluster from the globus job manager, which explicitly
>> disables streaming output and transfers the output files when the jobs
>> finish (via the NFSLite package from the VDT).  The file transfer is now
>> failing, which ultimately results in jobs being requeued and run again
>> and again.
>>
>> These errors seem to have started at about the same time that I changed
>> this particular grid user's shell from /bin/bash to /bin/true.  But
>> other users with a shell of /bin/true don't have problems with this
>> output file transfer.
>>
>> Where else should I look for more information on what's going wrong?
>>
>> --Mike
>>
>> 8/24 20:49:03 ******************************************************
>> 8/24 20:49:03 ** condor_starter (CONDOR_STARTER) STARTING UP
>> 8/24 20:49:03 ** /opt/condor/sbin/condor_starter
>> 8/24 20:49:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
>> 8/24 20:49:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
>> 8/24 20:49:03 ** PID = 25226
>> 8/24 20:49:03 ** Log last touched 8/24 20:49:01
>> 8/24 20:49:03 ******************************************************
>> 8/24 20:49:03 Using config source: /home/condor/condor_config
>> 8/24 20:49:03 Using local config sources:
>> 8/24 20:49:03    /share/apps/condor/hosts/cithep230/condor_config.local
>> 8/24 20:49:03 DaemonCore: Command Socket at <10.255.255.156:45962>
>> 8/24 20:49:03 Done setting resource limits
>> 8/24 20:49:03 Communicating with shadow <10.255.255.216:48267>
>> 8/24 20:49:03 Submitting machine is "gatekeeper-0-2.local"
>> 8/24 20:49:03 setting the orig job name in starter
>> 8/24 20:49:03 setting the orig job iwd in starter
>> 8/24 20:49:03 File transfer completed successfully.
>> 8/24 20:49:04 Job 875666.0 set to execute immediately
>> 8/24 20:49:04 Starting a VANILLA universe job with ID: 875666.0
>> 8/24 20:49:04 IWD: /state/partition1/tmp/cithep230/execute/dir_25226
>> 8/24 20:49:04 Output file:
>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stdout
>> 8/24 20:49:04 Error file:
>> /state/partition1/tmp/cithep230/execute/dir_25226/_condor_stderr
>> 8/24 20:49:10 Using wrapper
>> /opt/condor/bin/condor_nfslite_job_wrapper.sh to exec
>> Summer08-QCD_EMenriched_Pt30to80-IDEAL_V6_v1-32774-JobSpec.xml
>> 8/24 20:49:10 Create_Process succeeded, pid=25229
>> 8/25 08:19:58 Process exited, pid=25229, status=0
>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>> failure reading 5 bytes from unknown source.
>> 8/25 08:19:58 IO: Failed to read packet header
>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>> or for a reconnect attempt
>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>> failure reading 5 bytes from unknown source.
>> 8/25 08:19:58 IO: Failed to read packet header
>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>> or for a reconnect attempt
>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>> failure reading 5 bytes from unknown source.
>> 8/25 08:19:58 IO: Failed to read packet header
>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>> or for a reconnect attempt
>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>> failure reading 5 bytes from unknown source.
>> 8/25 08:19:58 IO: Failed to read packet header
>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>> 8/25 08:19:58 File transfer failed, forcing disconnect.
>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>> or for a reconnect attempt
>> 8/25 08:19:58 Accepted request to reconnect from <0.0.0.0:0>
>> 8/25 08:19:58 Ignoring old shadow <10.255.255.216:48267>
>> 8/25 08:19:58 Communicating with shadow <10.255.255.216:48267>
>> 8/25 08:19:58 condor_read(): recv() returned -1, errno = 104, assuming
>> failure reading 5 bytes from unknown source.
>> 8/25 08:19:58 IO: Failed to read packet header
>> 8/25 08:19:58 Failed to receive GoAhead message from 10.255.255.156.
>> 8/25 08:19:58 JIC::allJobsDone() failed, waiting for job lease to expire
>> or for a reconnect attempt
>> 8/25 08:19:58 Got SIGQUIT.  Performing fast shutdown.
>> 8/25 08:19:58 ShutdownFast all jobs.
>> 8/25 08:19:58 Result of "get_usage" operation from ProcD: ERROR: No
>> family with the given PID is registered
>> 8/25 08:19:58 error getting family usage in VanillaProc::PublishUpdateAd()
>> 8/25 08:19:58 condor_write(): Socket closed when trying to write 67
>> bytes to <10.255.255.216:43187>, fd is 5
>> 8/25 08:19:58 Buf::write(): condor_write() failed
>> 8/25 08:19:58 Failed to send job exit status to shadow
>> 8/25 08:19:58 JobExit() failed, waiting for job lease to expire or for a
>> reconnect attempt
>> 8/25 08:19:58 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
>>   
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at: 
>> https://lists.cs.wisc.edu/archive/condor-users/
>>   
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature