[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FW: [Condor] Problem condor_startd died (11)



Greg has kindly provided this workaround of the segfault problem (startd
7.8.4)

  MUST_MODIFY_REQUEST_EXPRS = True

in case others need it.
regards
-Ian





On 08/10/2012 10:26, "Ian Cottam" <Ian.Cottam@xxxxxxxxxxxxxxxx> wrote:

>We are all Vanilla jobs.
>All slots are partitionable.
>It is happening frequently, rather than occasionally, for us.
>Users tend to get the amount of Request_Memory wrong (asking for too
>little), so that may be a factor.
>We could do with a fix fairly quickly to this as it didn't show up in our
>testing and we are rolling out 7.8.4 now.
>-Ian
>
>
>On 08/10/2012 10:06, "Mark Calleja" <mc321@xxxxxxxxx> wrote:
>
>>Ian,
>>
>>I'm seeing similar behaviour while evaluating 7.8.4 for deployment on our
>>campus grid. Indeed, occasionally the Startd dies and leaves a core
>>(we're using Debian 6.0.6 x86_64), with this sort of message in the
>>StartLog:
>>
>>10/08/12 09:21:16 slot1: Partitionable slot can't be split to allocate a
>>dynamic slot large enough for the claim
>>Stack dump for process 7352 at timestamp 1349684476 (4 frames)
>>/Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor
>>_
>>utils_7_8_4.so(dprintf_dump_stack+0x131)[0x7f4c0f428051]
>>/Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor
>>_
>>utils_7_8_4.so(_Z18linux_sig_coredumpi+0x40)[0x7f4c0f596a00]
>>/lib64/libpthread.so.0(+0xeff0)[0x7f4c0b208ff0]
>>
>>This particular failure happened while trying to use partitionable slots
>>with ParallelSchedulingGroups under the parallel universe.  I know that
>>one can run this particular type of job from within the vanilla universe
>>(which works), but I need this test for
>> backward compatibility in case users stick to using old scripts that
>>they may have.
>>
>>
>>Mark
>>
>>On 06/10/12 07:51, Ian Cottam wrote:
>>
>>
>>We are getting a ton of these messages from our Pool after updating from
>>7.4 to 7.8.4.
>>Does it mean we are obliged to run the new daemon that clears out
>>partitioned slots?
>>Or is it showing up a bug, which seems likely as startd should not seg
>>fault?
>>-Ian
>>--
>>Ian Cottam
>>IT Services - supporting research
>>Faculty of EPS
>>The University of Manchester
>>
>>
>>
>>
>>
>>On 06/10/2012 04:18, "Owner of Condor Daemons"
>><condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
>><mailto:condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>
>>This is an automated email from the Condor system
>>on machine "xxx".  Do not reply.
>>
>>"/usr/sbin/condor_startd" on "e-c07atg105057.it.manchester.ac.uk" died
>>due to signal 11 (Segmentation fault).
>>Condor will automatically restart this process in 10 seconds.
>>
>>*** Last 20 line(s) of file /var/log/condor/StartLog:
>>10/05/12 20:51:13 slot1_3: State change: claim-activation protocol
>>successful
>>10/05/12 20:51:13 slot1_3: Changing activity: Idle -> Busy
>>10/05/12 20:51:13 slot1_1: match_info called
>>10/05/12 20:51:13 slot1_4: Got activate_claim request from shadow
>>(130.88.203.22)
>>10/05/12 20:51:13 slot1_4: Remote job ID is 329729.2744
>>10/05/12 20:51:13 slot1_4: Got universe "VANILLA" (5) from request
>>classad
>>10/05/12 20:51:13 slot1_4: State change: claim-activation protocol
>>successful
>>10/05/12 20:51:13 slot1_4: Changing activity: Idle -> Busy
>>10/06/12 04:18:21 slot1_1: Called deactivate_claim_forcibly()
>>10/06/12 04:18:21 slot1_1: Changing state and activity: Claimed/Busy ->
>>Preempting/Vacating
>>10/06/12 04:18:21 Starter pid 2555 exited with status 0
>>10/06/12 04:18:21 slot1_1: State change: starter exited
>>10/06/12 04:18:21 slot1_1: State change: No preempting claim, returning
>>to owner
>>10/06/12 04:18:21 slot1_1: Changing state and activity:
>>Preempting/Vacating -> Owner/Idle
>>10/06/12 04:18:21 slot1_1: State change: IS_OWNER is false
>>10/06/12 04:18:21 slot1_1: Changing state: Owner -> Unclaimed
>>10/06/12 04:18:21 slot1_1: Changing state: Unclaimed -> Delete
>>10/06/12 04:18:21 slot1_1: Resource no longer needed, deleting
>>10/06/12 04:18:27 Job no longer matches partitionable slot after
>>MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
>>10/06/12 04:18:27 slot1: Partitionable slot can't be split to allocate a
>>dynamic slot large enough for the claim
>>*** End of file StartLog
>>
>>
>>
>>-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>>Questions about this message or Condor in general?
>>Email address of the local Condor administrator:
>>ian.cottam@xxxxxxxxxxxxxxxx
>>The Official Condor Homepage is http://www.cs.wisc.edu/condor
>>
>>_______________________________________________
>>Condor-users mailing list
>>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>subject: Unsubscribe
>>You can also unsubscribe by visiting
>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>The archives can be found at:
>>https://lists.cs.wisc.edu/archive/condor-users/
>>
>>
>>
>
>
>-- 
>Ian Cottam
>
>IT Services -- supporting research
>Faculty of Engineering and Physical Sciences
>The University of Manchester
>"The only strategy that is guaranteed to fail is not taking risks." Mark
>Zuckerberg
>
>
>
>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at:
>https://lists.cs.wisc.edu/archive/condor-users/
>