[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor



Hi Matt,

Thank you for a quick reply.

incidentally did you snip a bunch of stuff from the logs after the second line?
No i didn't snip.

This is starting to smell like it might be a bug but the condor guys
would probably be much better at debugging it from here...
Ok,  i will contact CondorTeam.

Thanks,
Kohei


----- Original Message ----- From: "Matt Hope" <matthew.hope@xxxxxxxxx>
To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
Sent: Thursday, August 10, 2006 5:42 PM
Subject: Re: [Condor-users] Antwort: Re: Fault Behaviour of Condor


On 8/10/06, Nomura Kohei <kh-nomura@xxxxxxxxx> wrote:
> Have you set these to numbers which would give you 2 hours delay?
No. I checked my config file, these parameters are comment field.
 #POLLING_INTERVAL=5
 #ALIVE_INTERVAL = 300
 #MAX_SHADOW_EXCEPTIONS = 5
Is this wrong?

no this is fine it just means it uses the defaults (indicated in the file)

> What is the schedd/shadow log indicating during this time?
I attached  schedd and shadow log. (ClusterID of the job is 290.0)
Please check these files.

OK these definitely look bad:

8/3 13:41:29 Initializing a VANILLA shadow for job 290.0
8/3 13:41:30 (290.0) (372): Request to run on <192.168.0.2:3817> was ACCEPTED
8/3 15:34:57 ******************************************************
8/3 15:34:57 ** condor_shadow (CONDOR_SHADOW) STARTING UP
8/3 15:34:57 ** C:\condor\bin\condor_shadow.exe
8/3 15:34:57 ** $CondorVersion: 6.8.0 Jul 19 2006 $
8/3 15:34:57 ** $CondorPlatform: INTEL-WINNT50 $
8/3 15:34:57 ** PID = 3472
8/3 15:34:57 ** Log last touched 8/3 15:34:30
8/3 15:34:57 ******************************************************
... snip
8/3 15:41:36 (290.0) (372): condor_read(): recv() returned -1, errno =
10054, assuming failure.
8/3 15:41:36 (290.0) (372): Can no longer talk to condor_starter
<192.168.0.2:3817>
8/3 15:41:37 (290.0) (372): Trying to reconnect to disconnected job
8/3 15:41:37 (290.0) (372): LastJobLeaseRenewal: 1154580099 Thu Aug 03
13:41:39 2006
8/3 15:41:37 (290.0) (372): JobLeaseDuration: 60 seconds
8/3 15:41:37 (290.0) (372): JobLeaseDuration remaining: EXPIRED!
8/3 15:41:37 (290.0) (372): Reconnect FAILED: Job disconnected too
long: JobLeaseDuration (60 seconds) expired
8/3 15:41:37 (290.0) (372): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 107

incidentally did you snip a bunch of stuff from the logs after the second line?
If you did are you sure there are no messages related to that pid
(372) in the log.

This is starting to smell like it might be a bug but the condor guys
would probably be much better at debugging it from here...

Since I am planning on using JobLeases I'll prob take a look at the
logic in a bit but don't have the time right now

Matt
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR