[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FW: [Condor] Problem condor_startd died (11)



FYI: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3260

should be in 7.8.5

Cheers,
Tim

----- Original Message -----
> From: "Ian Cottam" <Ian.Cottam@xxxxxxxxxxxxxxxx>
> To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
> Sent: Wednesday, October 10, 2012 3:12:29 AM
> Subject: Re: [Condor-users] FW: [Condor] Problem condor_startd died (11)
> 
> Greg has kindly provided this workaround of the segfault problem
> (startd
> 7.8.4)
> 
>   MUST_MODIFY_REQUEST_EXPRS = True
> 
> in case others need it.
> regards
> -Ian
> 
> 
> 
> 
> 
> On 08/10/2012 10:26, "Ian Cottam" <Ian.Cottam@xxxxxxxxxxxxxxxx>
> wrote:
> 
> >We are all Vanilla jobs.
> >All slots are partitionable.
> >It is happening frequently, rather than occasionally, for us.
> >Users tend to get the amount of Request_Memory wrong (asking for too
> >little), so that may be a factor.
> >We could do with a fix fairly quickly to this as it didn't show up
> >in our
> >testing and we are rolling out 7.8.4 now.
> >-Ian
> >
> >
> >On 08/10/2012 10:06, "Mark Calleja" <mc321@xxxxxxxxx> wrote:
> >
> >>Ian,
> >>
> >>I'm seeing similar behaviour while evaluating 7.8.4 for deployment
> >>on our
> >>campus grid. Indeed, occasionally the Startd dies and leaves a core
> >>(we're using Debian 6.0.6 x86_64), with this sort of message in the
> >>StartLog:
> >>
> >>10/08/12 09:21:16 slot1: Partitionable slot can't be split to
> >>allocate a
> >>dynamic slot large enough for the claim
> >>Stack dump for process 7352 at timestamp 1349684476 (4 frames)
> >>/Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor
> >>_
> >>utils_7_8_4.so(dprintf_dump_stack+0x131)[0x7f4c0f428051]
> >>/Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor
> >>_
> >>utils_7_8_4.so(_Z18linux_sig_coredumpi+0x40)[0x7f4c0f596a00]
> >>/lib64/libpthread.so.0(+0xeff0)[0x7f4c0b208ff0]
> >>
> >>This particular failure happened while trying to use partitionable
> >>slots
> >>with ParallelSchedulingGroups under the parallel universe.  I know
> >>that
> >>one can run this particular type of job from within the vanilla
> >>universe
> >>(which works), but I need this test for
> >> backward compatibility in case users stick to using old scripts
> >> that
> >>they may have.
> >>
> >>
> >>Mark
> >>
> >>On 06/10/12 07:51, Ian Cottam wrote:
> >>
> >>
> >>We are getting a ton of these messages from our Pool after updating
> >>from
> >>7.4 to 7.8.4.
> >>Does it mean we are obliged to run the new daemon that clears out
> >>partitioned slots?
> >>Or is it showing up a bug, which seems likely as startd should not
> >>seg
> >>fault?
> >>-Ian
> >>--
> >>Ian Cottam
> >>IT Services - supporting research
> >>Faculty of EPS
> >>The University of Manchester
> >>
> >>
> >>
> >>
> >>
> >>On 06/10/2012 04:18, "Owner of Condor Daemons"
> >><condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
> >><mailto:condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >>
> >>This is an automated email from the Condor system
> >>on machine "xxx".  Do not reply.
> >>
> >>"/usr/sbin/condor_startd" on "e-c07atg105057.it.manchester.ac.uk"
> >>died
> >>due to signal 11 (Segmentation fault).
> >>Condor will automatically restart this process in 10 seconds.
> >>
> >>*** Last 20 line(s) of file /var/log/condor/StartLog:
> >>10/05/12 20:51:13 slot1_3: State change: claim-activation protocol
> >>successful
> >>10/05/12 20:51:13 slot1_3: Changing activity: Idle -> Busy
> >>10/05/12 20:51:13 slot1_1: match_info called
> >>10/05/12 20:51:13 slot1_4: Got activate_claim request from shadow
> >>(130.88.203.22)
> >>10/05/12 20:51:13 slot1_4: Remote job ID is 329729.2744
> >>10/05/12 20:51:13 slot1_4: Got universe "VANILLA" (5) from request
> >>classad
> >>10/05/12 20:51:13 slot1_4: State change: claim-activation protocol
> >>successful
> >>10/05/12 20:51:13 slot1_4: Changing activity: Idle -> Busy
> >>10/06/12 04:18:21 slot1_1: Called deactivate_claim_forcibly()
> >>10/06/12 04:18:21 slot1_1: Changing state and activity:
> >>Claimed/Busy ->
> >>Preempting/Vacating
> >>10/06/12 04:18:21 Starter pid 2555 exited with status 0
> >>10/06/12 04:18:21 slot1_1: State change: starter exited
> >>10/06/12 04:18:21 slot1_1: State change: No preempting claim,
> >>returning
> >>to owner
> >>10/06/12 04:18:21 slot1_1: Changing state and activity:
> >>Preempting/Vacating -> Owner/Idle
> >>10/06/12 04:18:21 slot1_1: State change: IS_OWNER is false
> >>10/06/12 04:18:21 slot1_1: Changing state: Owner -> Unclaimed
> >>10/06/12 04:18:21 slot1_1: Changing state: Unclaimed -> Delete
> >>10/06/12 04:18:21 slot1_1: Resource no longer needed, deleting
> >>10/06/12 04:18:27 Job no longer matches partitionable slot after
> >>MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
> >>10/06/12 04:18:27 slot1: Partitionable slot can't be split to
> >>allocate a
> >>dynamic slot large enough for the claim
> >>*** End of file StartLog
> >>
> >>
> >>
> >>-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> >>Questions about this message or Condor in general?
> >>Email address of the local Condor administrator:
> >>ian.cottam@xxxxxxxxxxxxxxxx
> >>The Official Condor Homepage is http://www.cs.wisc.edu/condor
> >>
> >>_______________________________________________
> >>Condor-users mailing list
> >>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> >>with a
> >>subject: Unsubscribe
> >>You can also unsubscribe by visiting
> >>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >>The archives can be found at:
> >>https://lists.cs.wisc.edu/archive/condor-users/
> >>
> >>
> >>
> >
> >
> >--
> >Ian Cottam
> >
> >IT Services -- supporting research
> >Faculty of Engineering and Physical Sciences
> >The University of Manchester
> >"The only strategy that is guaranteed to fail is not taking risks."
> >Mark
> >Zuckerberg
> >
> >
> >
> >
> >_______________________________________________
> >Condor-users mailing list
> >To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> >with a
> >subject: Unsubscribe
> >You can also unsubscribe by visiting
> >https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> >The archives can be found at:
> >https://lists.cs.wisc.edu/archive/condor-users/
> >
> 
> 
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>