[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes



Ok, I started down the path of using launchd on OS X to launch condor. This dusted off the cobwebs from when I tried using launchd before and realized launchd and condor_master don't get along because condor_master creates a daemon of itself. (See the launchd.plist manual page for more details.)

I've worked around the problem by creating a local condor user and group instead of using the LDAP directory I typically use and then starting condor again from the command line. Now condor behaves as it should, stopping and restarting starters as it needs to.

The moral of the story is that the condor user on OS X should be a locally created user, not one stored in a network directory. I've seen other strange situations like this on OS X, such as sudo ignoring network users, so it's not unique to condor.

    Craig

On Aug 31, 2009, at 1:26 PM, Craig Struble wrote:

Hi Nick,

Sorry about the delay. I had to try to reproduce the situation which
is both easy and tricky. In the StartLog, I start to receive messages
like

08/28 16:36:41 slot3: Called deactivate_claim_forcibly()
08/28 16:36:41 Starter pid 85829 exited with status 0
08/28 16:36:41 slot3: State change: starter exited
08/28 16:36:41 slot3: Changing activity: Busy -> Idle
08/28 16:36:42 slot3: Got activate_claim request from shadow
(<192.168.10.16:641
36>)
08/28 16:36:42 slot3: Remote job ID is 36563.0
08/28 16:36:42 slot3: Got universe "VANILLA" (5) from request classad
08/28 16:36:42 slot3: State change: claim-activation protocol successful
08/28 16:36:42 slot3: Changing activity: Idle -> Busy
08/28 16:36:42 Starter pid 89438 exited with status 1
08/28 16:36:42 slot3: State change: starter exited
08/28 16:36:42 slot3: Changing activity: Busy -> Idle
08/28 16:36:42 slot3: Got activate_claim request from shadow
(<192.168.10.16:641
39>)
08/28 16:36:42 slot3: Remote job ID is 36563.0
08/28 16:36:42 slot3: Got universe "VANILLA" (5) from request classad
08/28 16:36:42 slot3: State change: claim-activation protocol successful
08/28 16:36:42 slot3: Changing activity: Idle -> Busy
08/28 16:36:42 Starter pid 89439 exited with status 1

You'll notice initially the pid exits with status 0, but then
everything starts to exit with status 1, which indicates that the
starter for slot3 can no longer be launched. (The jobs being submitted
are identical and run when executed on the other machines in the
pool.) The other slots eventually do the same thing and are never able
to launch jobs.

My visit to the Condor team on Friday was informative though, and I
think I found a potential source of the issue. I had to start
condor_master by hand on this one machine and the condor user is
stored in an LDAP directory. I think that what I'm seeing is related
to a ticket (#294). The other machines had condor started at boot time
using an OS X StartupItem. I'm going to see if I can fix the problem
by starting Condor using launchd instead. If that works, I'll post the
launchd scripts.

    Craig

On Aug 26, 2009, at 11:24 AM, Nick LeRoy wrote:

On Wednesday 26 August 2009, Craig Struble wrote:
Craig,

Well, I had hoped that <8 slots would fix things, but after running
Condor longer, even 4 slots fails on this one OS X machine (while the
other 22 with 2 slots each run fine, running the same operating
system
and condor binaries).

I'm not sure my problem is directly related, being on OS X. In the
StarterLog.slot1 on my machine, the end looks like:

08/22 10:22:19 Job 26912.0 set to execute immediately
08/22 10:22:19 Starting a VANILLA universe job with ID: 26912.0
08/22 10:22:19 IWD: /var/condor/execute/dir_94482
08/22 10:22:19 Output file: /var/condor/execute/dir_94482/
job_cluster-2.stdout
08/22 10:22:20 About to exec /var/condor/execute/dir_94482/
condor_exec.exe cluster_wrapper job_cluster-2.data job- 9 16
08/22 10:22:20 Create_Process succeeded, pid=94490
08/22 11:14:59 Process exited, pid=94490, status=0
08/22 11:14:59 Got SIGQUIT.  Performing fast shutdown.
08/22 11:14:59 ShutdownFast all jobs.
08/22 11:14:59 **** condor_starter (condor_STARTER) pid 94482 EXITING
WITH STATUS 0

After that, no jobs will run on that slot and running condor_restart
fails to relaunch condor (all daemons except condor_master are killed
but execing new ones fails for some unknown reason).

Just to be clear...  The startd has crashed before the starter gets
the QUIT?
And, after that, the master can't even exec daemons?  Is that
right?  Is
there anything interesting in the MasterLog or StartLog?

-Nick

--
         <<< The Matrix is everywhere. >>>
/`-_    Nicholas R. LeRoy               The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/ condor
\    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
|_*_|   608-265-5761                    Department of Computer
Sciences

--
Craig A. Struble, Ph.D. | 369 Cudahy Hall  | Marquette University
Associate Professor of Computer Science    | (414)288-3783
Director, Master of Bioinformatics Program | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

--
Craig A. Struble, Ph.D. | 369 Cudahy Hall  | Marquette University
Associate Professor of Computer Science    | (414)288-3783
Director, Master of Bioinformatics Program | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx