[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] RE: [condor-admin #8957] Condor dying overnight

try increasing your shadow exception maximum (forget the exact name of the setting)
It defaults to 5, this may be exactly the number of exceptions that happen over 15 mins.

Suggest you look a the starter logs on the calc machine and shadow logs on your machine.

Sanity check - have you tried running exactly the same jobs outside of condor and checking they don't die after 15 mins...

One similar nasty behaviour that could do with fixing is when the user changes their password and then condor bounces at high speed their job around the pool trying it on different machines when they all fail immediately.
Would be nice if condor could spot that it was failing rather than the job itdself failing and give up with an appropriate notification and state.

-----Original Message-----
From: owner-condor-users@xxxxxxxxxxx
[mailto:owner-condor-users@xxxxxxxxxxx]On Behalf Of Simon Hoyle
Sent: 26 April 2004 19:59
To: condor-users@xxxxxxxxxxx
Subject: [condor-users] RE: [condor-admin #8957] Condor dying overnight


I hope someone on the list can shed some light on this problem. 

Jobs (Windows XP Pro) are failing after exactly 15 minutes. I have looked
through all the condor-user messages, and no-one else seems to have reported
it. Setting (DEFAULT_SESSION_DURATION = 864000) or the alternative posted by
Zach (SEC_DEFAULT_SESSION_DURATION = 864000) didn't solve this problem. 
It seems to be related to a security access violation. I checked all the
'store_cred' passwords  and they're correct. Condor was installed on each
computer on an administrator login, and users have read access to the condor
folder. Jobs that run less than 15 minutes finish correctly, but anything
over 15 is aborted, then restarts after a waiting period. 


- Jobs starts (Schedlog)

4/23 16:08:14 Started shadow for job 69.1 on "<>", (shadow
pid = 1944)

- First evidence of failure appears in the schedlog. The message that 0 jobs
are matched , 3 idle and 1 rejected seems wrong - there should be 2 matched,
4 idle (I scheduled 6 instances of the job). Then it all turns ugly. 

4/23 16:23:12 Activity on stashed negotiator socket
4/23 16:23:12 Negotiating for owner: shoyle@xxxxxxxxxxxxxxxxx
4/23 16:23:12 Checking consistency running and runnable jobs
4/23 16:23:12 Tables are consistent
4/23 16:23:12 Out of servers - 0 jobs matched, 3 jobs idle, 1 jobs rejected
4/23 16:23:14 Sent ad to central manager for shoyle@xxxxxxxxxxxxxxxxx
4/23 16:23:17 condor_read(): recv() returned -1, errno = 10054, assuming
4/23 16:23:17 DaemonCore: Command received via UDP from host
4/23 16:23:17 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling
handler (HandleProcessExitCommand())
4/23 16:23:17 Shadow pid 1944 died with exception ACCESS_VIOLATION
4/23 16:23:18 Started shadow for job 69.1 on "<>", (shadow
pid = 1356)

Simon Hoyle, 
Associate Scientist, 
Inter-American Tropical Tuna Commission, 
8604 La Jolla Shores Drive, La Jolla, CA 92037, USA
Tel: (858) 546-7027   Fax: (858) 546-7133 

-----Original Message-----
From: Simon Hoyle [mailto:shoyle@xxxxxxxxx] 
Sent: Wednesday, February 25, 2004 11:29 AM
To: 'condor-admin@xxxxxxxxxxx'
Subject: RE: [condor-admin #8957] Condor dying overnight

Thanks, I will try that. It might have caused the problem on jsuter1. I'm
not sure about model1 issue though. Will have to see what happens. What
usually happens when communication breaks down between the daemons? Are
there any side effects from using the extended session duration setting?

It seems unlikely to be anything to do with the 15 minute rejection problem
we were having with jobs submitted by the sharley username, but I have not
rechecked that problem lately (he is away at the moment). 

I am going away myself for 2 weeks tomorrow, so there will be a delay until
I get back to you. 


Simon Hoyle, 
Inter-American Tropical Tuna Commission
Scripps Institute of Oceanography
8604 La Jolla Shores Drive, La Jolla, CA 92037, USA
Tel: (858) 546-7027   Fax: (858) 546-7133 

-----Original Message-----
From: condor-admin response tracking system
Sent: Wednesday, February 25, 2004 11:16 AM
To: Simon Hoyle
Subject: Re: [condor-admin #8957] Condor dying overnight


I am looking into your analysis. By the way, after looking at your starter
log, I was reminded that there is an issue in 6.6.x series that causes
security sessions to expire after 1 hour, and unfortunately, this causes
communication to break down between the daemons. The somewhat crummy
solution for now is to extend your default session duration to longer
than the duration of your job. A popular setting is 10 days, although
for you that may be overkill. It's set in seconds, in your config file:

DEFAULT_SESSION_DURATION = 864000 # (10 days)

This might help you a lot, especially if your jobs tend to run for longer
than 1 hour.

UW Madison Condor Team

* From: Colin Stolley <stolley@xxxxxxxxxxx>
* Ticket Email List: shoyle@xxxxxxxxx, 

This mail was sent from the RUST Mail System
Please direct all replies to condor-admin@xxxxxxxxxxx
Please include the current subject line in your reply.

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Gloucester Research Limited believes the information 
provided herein is reliable. While every care has been 
taken to ensure accuracy, the information is furnished 
to the recipients with no warranty as to the completeness 
and accuracy of its contents and on condition that any 
errors or omissions shall not be made the basis for any 
claim, demand or cause for action.

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>