Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!

Date: Thu, 10 Jul 2008 15:07:40 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!

Mark,

Here's the patch:

diff --git a/src/condor_schedd.V6/schedd.C b/src/condor_schedd.V6/schedd.C
index 85bfbd7..cc5c03a 100644
--- a/src/condor_schedd.V6/schedd.C
+++ b/src/condor_schedd.V6/schedd.C
@@ -223,6 +223,15 @@ match_rec::~match_rec()
       if( pool ) {
               free(pool);
       }
+       if( request_claim_sock ) {

+ // NOTE: the value passed to Register_DataPtr()for this+ // registered socket is just a pointer to thismatch_rec,+ // so there is no need to worry aboutdeallocating that.

+               daemonCore->Cancel_Socket( request_claim_sock );
+               delete request_claim_sock;
+               request_claim_sock = NULL;
+               scheduler.rescheduleContactQueue();
+       }
}


@@ -11049,16 +11058,6 @@ Scheduler::DelMrec(char const* id)
               return -1;
       }

-       if( rec->request_claim_sock ) {

- // NOTE: the value passed to Register_DataPtr()for this- // registered socket is just a pointer to thismatch_rec,- // so there is no need to worry aboutdeallocating that.

-               daemonCore->Cancel_Socket( rec->request_claim_sock );
-               delete rec->request_claim_sock;
-               rec->request_claim_sock = NULL;
-               rescheduleContactQueue();
-       }
-
       // release the claim on the startd
       if( rec->needs_release_claim) {
               send_vacate(rec, RELEASE_CLAIM);

--Dan

Mark Calleja wrote:

Dan,
That's great news. While we wait for 7.0.4 to appear, is there a chanceof a patch being released so we can at least build our own fixedschedd's from source please?
Cheers,
Mark

Dan Bradley wrote:
Mark,
Thanks for the report. I have found the source of trouble. The problemwas introduced in 7.0.1. It affects parrallel and MPI universe jobs onall platforms. When there is a problem claiming a startd, the scheddcrashes.
This will be fixed in 7.0.4, which we hope to release in the near future(on the order of weeks, not months).
--Dan

Mark Calleja wrote:
Hi chaps,
We've hit a problem and we'd urgently like to hear of any solution. Wehave a Solaris box that acts as a submit host and its schedd dies everyfew minutes; the following snippet from the SchedLog is typical of thesymptom:
7/10 10:50:28 (pid:16042) Calling Handler <to startd <172.24.89.88:9108>>
7/10 10:50:28 (pid:16042) ERROR "Assertion ERROR on(mrec->request_claim_sock == sock)" at line 1361 in filededicated_scheduler.C7/10 10:50:43 (pid:16188)******************************************************
7/10 10:50:43 (pid:16188) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
7/10 10:50:43 (pid:16188) ** /prg/condor/sbin/condor_schedd
7/10 10:50:43 (pid:16188) ** $CondorVersion: 7.0.3 Jun 20 2008 BuildID:91405 $
7/10 10:50:43 (pid:16188) ** $CondorPlatform: SUN4X-SOLARIS29 $
7/10 10:50:43 (pid:16188) ** PID = 16188
7/10 10:50:43 (pid:16188) ** Log last touched 7/10 10:50:28
7/10 10:50:43 (pid:16188)******************************************************
The OS details are:

% uname -a
SunOS <hostname> 5.9 Generic_112233-10 sun4u sparc SUNW,Sun-Fire-880

This seems related to the problem mentioned here:

http://www.cs.wisc.edu/condor/ligo-tickets/2237.html
Was that problem resolved? It's not apparent from the link. For nowwe're downgrading that box to 6.8.8, but that can only be a short termsolution
Any clues/fixes out there?

Best regards,
Mark


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

Follow-Ups:
- Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!
  - From: Mark Calleja

References:
- [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!
  - From: Mark Calleja
- Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!
  - From: Dan Bradley
- Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!
  - From: Mark Calleja

Prev by Date: Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!
Next by Date: Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!
Previous by thread: Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!
Next by thread: Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Schedd keeps dying in 7.0.3 under Solaris!