Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes

Date: Mon, 31 Aug 2009 17:51:42 -0500
From: Craig Struble <craig.struble@xxxxxxxxxxxxx>
Subject: Re: [Condor-users] Intermittent Condor startd crashes

Ok, I started down the path of using launchd on OS X to launch condor.This dusted off the cobwebs from when I tried using launchd before andrealized launchd and condor_master don't get along becausecondor_master creates a daemon of itself. (See the launchd.plistmanual page for more details.)

I've worked around the problem by creating a local condor user andgroup instead of using the LDAP directory I typically use and thenstarting condor again from the command line. Now condor behaves as itshould, stopping and restarting starters as it needs to.

The moral of the story is that the condor user on OS X should be alocally created user, not one stored in a network directory. I've seenother strange situations like this on OS X, such as sudo ignoringnetwork users, so it's not unique to condor.


    Craig

On Aug 31, 2009, at 1:26 PM, Craig Struble wrote:

Hi Nick,

Sorry about the delay. I had to try to reproduce the situation which
is both easy and tricky. In the StartLog, I start to receive messages
like

08/28 16:36:41 slot3: Called deactivate_claim_forcibly()
08/28 16:36:41 Starter pid 85829 exited with status 0
08/28 16:36:41 slot3: State change: starter exited
08/28 16:36:41 slot3: Changing activity: Busy -> Idle
08/28 16:36:42 slot3: Got activate_claim request from shadow
(<192.168.10.16:641
36>)
08/28 16:36:42 slot3: Remote job ID is 36563.0
08/28 16:36:42 slot3: Got universe "VANILLA" (5) from request classad

08/28 16:36:42 slot3: State change: claim-activation protocolsuccessful

08/28 16:36:42 slot3: Changing activity: Idle -> Busy
08/28 16:36:42 Starter pid 89438 exited with status 1
08/28 16:36:42 slot3: State change: starter exited
08/28 16:36:42 slot3: Changing activity: Busy -> Idle
08/28 16:36:42 slot3: Got activate_claim request from shadow
(<192.168.10.16:641
39>)
08/28 16:36:42 slot3: Remote job ID is 36563.0
08/28 16:36:42 slot3: Got universe "VANILLA" (5) from request classad

08/28 16:36:42 slot3: State change: claim-activation protocolsuccessful

08/28 16:36:42 slot3: Changing activity: Idle -> Busy
08/28 16:36:42 Starter pid 89439 exited with status 1

You'll notice initially the pid exits with status 0, but then
everything starts to exit with status 1, which indicates that the
starter for slot3 can no longer be launched. (The jobs being submitted
are identical and run when executed on the other machines in the
pool.) The other slots eventually do the same thing and are never able
to launch jobs.

My visit to the Condor team on Friday was informative though, and I
think I found a potential source of the issue. I had to start
condor_master by hand on this one machine and the condor user is
stored in an LDAP directory. I think that what I'm seeing is related
to a ticket (#294). The other machines had condor started at boot time
using an OS X StartupItem. I'm going to see if I can fix the problem
by starting Condor using launchd instead. If that works, I'll post the
launchd scripts.

    Craig

On Aug 26, 2009, at 11:24 AM, Nick LeRoy wrote:

On Wednesday 26 August 2009, Craig Struble wrote:
Craig,

Well, I had hoped that <8 slots would fix things, but after running

Condor longer, even 4 slots fails on this one OS X machine (whilethe

other 22 with 2 slots each run fine, running the same operating
system
and condor binaries).

I'm not sure my problem is directly related, being on OS X. In the
StarterLog.slot1 on my machine, the end looks like:

08/22 10:22:19 Job 26912.0 set to execute immediately
08/22 10:22:19 Starting a VANILLA universe job with ID: 26912.0
08/22 10:22:19 IWD: /var/condor/execute/dir_94482
08/22 10:22:19 Output file: /var/condor/execute/dir_94482/
job_cluster-2.stdout
08/22 10:22:20 About to exec /var/condor/execute/dir_94482/
condor_exec.exe cluster_wrapper job_cluster-2.data job- 9 16
08/22 10:22:20 Create_Process succeeded, pid=94490
08/22 11:14:59 Process exited, pid=94490, status=0
08/22 11:14:59 Got SIGQUIT.  Performing fast shutdown.
08/22 11:14:59 ShutdownFast all jobs.

08/22 11:14:59 **** condor_starter (condor_STARTER) pid 94482EXITING

WITH STATUS 0

After that, no jobs will run on that slot and running condor_restart

fails to relaunch condor (all daemons except condor_master arekilled

but execing new ones fails for some unknown reason).


Just to be clear...  The startd has crashed before the starter gets
the QUIT?
And, after that, the master can't even exec daemons?  Is that
right?  Is
there anything interesting in the MasterLog or StartLog?

-Nick

--
         <<< The Matrix is everywhere. >>>
/`-_    Nicholas R. LeRoy               The Condor Project

{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor

\    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
|_*_|   608-265-5761                    Department of Computer
Sciences


--
Craig A. Struble, Ph.D. | 369 Cudahy Hall  | Marquette University
Associate Professor of Computer Science    | (414)288-3783
Director, Master of Bioinformatics Program | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx



_______________________________________________
Condor-users mailing list

To unsubscribe, send a message to condor-users-request@xxxxxxxxxxxwith a

subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
Craig A. Struble, Ph.D. | 369 Cudahy Hall  | Marquette University
Associate Professor of Computer Science    | (414)288-3783
Director, Master of Bioinformatics Program | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx

References:
- Re: [Condor-users] Intermittent Condor startd crashes
  - From: Ian Chesal
- Re: [Condor-users] Intermittent Condor startd crashes
  - From: Ian Chesal
- Re: [Condor-users] Intermittent Condor startd crashes
  - From: Craig Struble
- Re: [Condor-users] Intermittent Condor startd crashes
  - From: Nick LeRoy
- Re: [Condor-users] Intermittent Condor startd crashes
  - From: Craig Struble

Prev by Date: Re: [Condor-users] Errors with HAD setup
Next by Date: Re: [Condor-users] Errors with HAD setup
Previous by thread: Re: [Condor-users] Intermittent Condor startd crashes
Next by thread: [Condor-users] Xeon + Opteron
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Intermittent Condor startd crashes