[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] RE: [condor-admin #8957] Condor dying overnight

Thanks for the suggestions. 

>try increasing your shadow exception maximum (forget the exact name of the
>It defaults to 5, this may be exactly the number of exceptions that happen
>over 15 mins.

I changed the exception maximum to 3 but jobs were still halted after 15
minutes instead of 9.  


>Suggest you look a the starter logs on the calc machine and shadow logs on
>your machine.

Did that, and found a problem when the job started - one that was discussed
earlier under ' Re: [condor-users] problem with Java execution'. Mine says
"perm::init: Lookup Account Name condorrun failed (err=1332), using

Why does the lookup fail? The account 'condorrun' is there. The previous
posting talked about using Everyone instead, but my system manager (who set
up condorrun) wants only condorrun to be used. Is there anything particular
about the condorrun account setup? And what is Everyone trying, and failing,
to do? 

I was mistakenly using "when_to_transfer_output = ON_EXIT_OR_EVICT". After
changing to "ON_EXIT" the first job to run on my PC actually ran for 18
minutes and completed! However, the next one only ran for 15 minutes before
being kicked off. 

I am also getting some bad behavior - similar to the message 'Bad Condor
jobs killing GUI apps'. Several times I've returned to my PC and found some
(not all) windows shut down, and the system tray empty. Ten minutes ago
Outlook was killed as I was using it. 

Here's the starterlog from about that time (I'm not certain this is exactly
when it happened - didn't think to look at the time, but it's about right). 

4/27 13:09:02 Got SIGQUIT.  Performing fast shutdown.
4/27 13:09:02 ShutdownFast all jobs.
4/27 13:09:18 Got SIGTERM. Performing graceful shutdown.
4/27 13:09:18 ShutdownGraceful all jobs.
4/27 13:09:22 Process exited, pid=2124, status=0
4/27 13:09:22 condor_write(): send() returned -1, timeout=300, errno=10054.
Assuming failure.
4/27 13:09:22 Buf::write(): condor_write() failed
4/27 13:09:22 ERROR "Assertion ERROR on (result)" at line 266 in file
4/27 13:09:22 ShutdownFast all jobs.
4/27 13:10:50 ******************************************************


>Sanity check - have you tried running exactly the same jobs outside of
>condor and checking they don't die after 15 mins...

My standard job runs to completion on my PC. Every type of job I've tried
fails after 15 minutes under condor. 



>One similar nasty behaviour that could do with fixing is when the user
>changes their password and then condor bounces at high speed their job
>around the pool trying it on different machines when they all fail
>Would be nice if condor could spot that it was failing rather than the job
>itdself failing and give up with an appropriate notification and state.

-----Original Message-----
From: owner-condor-users@xxxxxxxxxxx
[mailto:owner-condor-users@xxxxxxxxxxx]On Behalf Of Simon Hoyle
Sent: 26 April 2004 19:59
To: condor-users@xxxxxxxxxxx
Subject: [condor-users] RE: [condor-admin #8957] Condor dying overnight


I hope someone on the list can shed some light on this problem. 

Jobs (Windows XP Pro) are failing after exactly 15 minutes. I have looked
through all the condor-user messages, and no-one else seems to have reported
it. Setting (DEFAULT_SESSION_DURATION = 864000) or the alternative posted by
Zach (SEC_DEFAULT_SESSION_DURATION = 864000) didn't solve this problem. 
It seems to be related to a security access violation. I checked all the
'store_cred' passwords  and they're correct. Condor was installed on each
computer on an administrator login, and users have read access to the condor
folder. Jobs that run less than 15 minutes finish correctly, but anything
over 15 is aborted, then restarts after a waiting period. 


- Jobs starts (Schedlog)

4/23 16:08:14 Started shadow for job 69.1 on "<>", (shadow
pid = 1944)

- First evidence of failure appears in the schedlog. The message that 0 jobs
are matched , 3 idle and 1 rejected seems wrong - there should be 2 matched,
4 idle (I scheduled 6 instances of the job). Then it all turns ugly. 

4/23 16:23:12 Activity on stashed negotiator socket
4/23 16:23:12 Negotiating for owner: shoyle@xxxxxxxxxxxxxxxxx
4/23 16:23:12 Checking consistency running and runnable jobs
4/23 16:23:12 Tables are consistent
4/23 16:23:12 Out of servers - 0 jobs matched, 3 jobs idle, 1 jobs rejected
4/23 16:23:14 Sent ad to central manager for shoyle@xxxxxxxxxxxxxxxxx
4/23 16:23:17 condor_read(): recv() returned -1, errno = 10054, assuming
4/23 16:23:17 DaemonCore: Command received via UDP from host
4/23 16:23:17 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling
handler (HandleProcessExitCommand())
4/23 16:23:17 Shadow pid 1944 died with exception ACCESS_VIOLATION
4/23 16:23:18 Started shadow for job 69.1 on "<>", (shadow
pid = 1356)

Simon Hoyle, 
Associate Scientist, 
Inter-American Tropical Tuna Commission, 
8604 La Jolla Shores Drive, La Jolla, CA 92037, USA
Tel: (858) 546-7027   Fax: (858) 546-7133 

-----Original Message-----
From: Simon Hoyle [mailto:shoyle@xxxxxxxxx] 
Sent: Wednesday, February 25, 2004 11:29 AM
To: 'condor-admin@xxxxxxxxxxx'
Subject: RE: [condor-admin #8957] Condor dying overnight

Thanks, I will try that. It might have caused the problem on jsuter1. I'm
not sure about model1 issue though. Will have to see what happens. What
usually happens when communication breaks down between the daemons? Are
there any side effects from using the extended session duration setting?

It seems unlikely to be anything to do with the 15 minute rejection problem
we were having with jobs submitted by the sharley username, but I have not
rechecked that problem lately (he is away at the moment). 

I am going away myself for 2 weeks tomorrow, so there will be a delay until
I get back to you. 


Simon Hoyle, 
Inter-American Tropical Tuna Commission
Scripps Institute of Oceanography
8604 La Jolla Shores Drive, La Jolla, CA 92037, USA
Tel: (858) 546-7027   Fax: (858) 546-7133 

-----Original Message-----
From: condor-admin response tracking system
Sent: Wednesday, February 25, 2004 11:16 AM
To: Simon Hoyle
Subject: Re: [condor-admin #8957] Condor dying overnight


I am looking into your analysis. By the way, after looking at your starter
log, I was reminded that there is an issue in 6.6.x series that causes
security sessions to expire after 1 hour, and unfortunately, this causes
communication to break down between the daemons. The somewhat crummy
solution for now is to extend your default session duration to longer
than the duration of your job. A popular setting is 10 days, although
for you that may be overkill. It's set in seconds, in your config file:

DEFAULT_SESSION_DURATION = 864000 # (10 days)

This might help you a lot, especially if your jobs tend to run for longer
than 1 hour.

UW Madison Condor Team

* From: Colin Stolley <stolley@xxxxxxxxxxx>
* Ticket Email List: shoyle@xxxxxxxxx, 

This mail was sent from the RUST Mail System
Please direct all replies to condor-admin@xxxxxxxxxxx
Please include the current subject line in your reply.

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Gloucester Research Limited believes the information 
provided herein is reliable. While every care has been 
taken to ensure accuracy, the information is furnished 
to the recipients with no warranty as to the completeness 
and accuracy of its contents and on condition that any 
errors or omissions shall not be made the basis for any 
claim, demand or cause for action.

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>