Mark, Here's the patch: diff --git a/src/condor_schedd.V6/schedd.C b/src/condor_schedd.V6/schedd.C index 85bfbd7..cc5c03a 100644 --- a/src/condor_schedd.V6/schedd.C +++ b/src/condor_schedd.V6/schedd.C @@ -223,6 +223,15 @@ match_rec::~match_rec() if( pool ) { free(pool); } + if( request_claim_sock ) {+ // NOTE: the value passed to Register_DataPtr() for this + // registered socket is just a pointer to this match_rec, + // so there is no need to worry about deallocating that.
+ daemonCore->Cancel_Socket( request_claim_sock ); + delete request_claim_sock; + request_claim_sock = NULL; + scheduler.rescheduleContactQueue(); + } } @@ -11049,16 +11058,6 @@ Scheduler::DelMrec(char const* id) return -1; } - if( rec->request_claim_sock ) {- // NOTE: the value passed to Register_DataPtr() for this - // registered socket is just a pointer to this match_rec, - // so there is no need to worry about deallocating that.
- daemonCore->Cancel_Socket( rec->request_claim_sock ); - delete rec->request_claim_sock; - rec->request_claim_sock = NULL; - rescheduleContactQueue(); - } - // release the claim on the startd if( rec->needs_release_claim) { send_vacate(rec, RELEASE_CLAIM); --Dan Mark Calleja wrote:
Dan,That's great news. While we wait for 7.0.4 to appear, is there a chance of a patch being released so we can at least build our own fixed schedd's from source please?Cheers, Mark Dan Bradley wrote:Mark,Thanks for the report. I have found the source of trouble. The problem was introduced in 7.0.1. It affects parrallel and MPI universe jobs on all platforms. When there is a problem claiming a startd, the schedd crashes.This will be fixed in 7.0.4, which we hope to release in the near future (on the order of weeks, not months).--Dan Mark Calleja wrote:Hi chaps,We've hit a problem and we'd urgently like to hear of any solution. We have a Solaris box that acts as a submit host and its schedd dies every few minutes; the following snippet from the SchedLog is typical of the symptom:7/10 10:50:28 (pid:16042) Calling Handler <to startd <172.24.89.88:9108>>7/10 10:50:28 (pid:16042) ERROR "Assertion ERROR on (mrec->request_claim_sock == sock)" at line 1361 in file dedicated_scheduler.C 7/10 10:50:43 (pid:16188) ******************************************************7/10 10:50:43 (pid:16188) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 7/10 10:50:43 (pid:16188) ** /prg/condor/sbin/condor_schedd7/10 10:50:43 (pid:16188) ** $CondorVersion: 7.0.3 Jun 20 2008 BuildID: 91405 $7/10 10:50:43 (pid:16188) ** $CondorPlatform: SUN4X-SOLARIS29 $ 7/10 10:50:43 (pid:16188) ** PID = 16188 7/10 10:50:43 (pid:16188) ** Log last touched 7/10 10:50:287/10 10:50:43 (pid:16188) ******************************************************The OS details are: % uname -a SunOS <hostname> 5.9 Generic_112233-10 sun4u sparc SUNW,Sun-Fire-880 This seems related to the problem mentioned here: http://www.cs.wisc.edu/condor/ligo-tickets/2237.htmlWas that problem resolved? It's not apparent from the link. For now we're downgrading that box to 6.8.8, but that can only be a short term solutionAny clues/fixes out there? Best regards, Mark _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-usersThe archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-usersThe archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/