[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FW: [Condor] Problem condor_startd died (11)



We are all Vanilla jobs.
All slots are partitionable.
It is happening frequently, rather than occasionally, for us.
Users tend to get the amount of Request_Memory wrong (asking for too
little), so that may be a factor.
We could do with a fix fairly quickly to this as it didn't show up in our
testing and we are rolling out 7.8.4 now.
-Ian


On 08/10/2012 10:06, "Mark Calleja" <mc321@xxxxxxxxx> wrote:

>Ian,
>
>I'm seeing similar behaviour while evaluating 7.8.4 for deployment on our
>campus grid. Indeed, occasionally the Startd dies and leaves a core
>(we're using Debian 6.0.6 x86_64), with this sort of message in the
>StartLog:
>
>10/08/12 09:21:16 slot1: Partitionable slot can't be split to allocate a
>dynamic slot large enough for the claim
>Stack dump for process 7352 at timestamp 1349684476 (4 frames)
>/Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor_
>utils_7_8_4.so(dprintf_dump_stack+0x131)[0x7f4c0f428051]
>/Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor_
>utils_7_8_4.so(_Z18linux_sig_coredumpi+0x40)[0x7f4c0f596a00]
>/lib64/libpthread.so.0(+0xeff0)[0x7f4c0b208ff0]
>
>This particular failure happened while trying to use partitionable slots
>with ParallelSchedulingGroups under the parallel universe.  I know that
>one can run this particular type of job from within the vanilla universe
>(which works), but I need this test for
> backward compatibility in case users stick to using old scripts that
>they may have.
>
>
>Mark
>
>On 06/10/12 07:51, Ian Cottam wrote:
>
>
>We are getting a ton of these messages from our Pool after updating from
>7.4 to 7.8.4.
>Does it mean we are obliged to run the new daemon that clears out
>partitioned slots?
>Or is it showing up a bug, which seems likely as startd should not seg
>fault?
>-Ian
>--
>Ian Cottam
>IT Services - supporting research
>Faculty of EPS
>The University of Manchester
>
>
>
>
>
>On 06/10/2012 04:18, "Owner of Condor Daemons"
><condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
><mailto:condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
>This is an automated email from the Condor system
>on machine "xxx".  Do not reply.
>
>"/usr/sbin/condor_startd" on "e-c07atg105057.it.manchester.ac.uk" died
>due to signal 11 (Segmentation fault).
>Condor will automatically restart this process in 10 seconds.
>
>*** Last 20 line(s) of file /var/log/condor/StartLog:
>10/05/12 20:51:13 slot1_3: State change: claim-activation protocol
>successful
>10/05/12 20:51:13 slot1_3: Changing activity: Idle -> Busy
>10/05/12 20:51:13 slot1_1: match_info called
>10/05/12 20:51:13 slot1_4: Got activate_claim request from shadow
>(130.88.203.22)
>10/05/12 20:51:13 slot1_4: Remote job ID is 329729.2744
>10/05/12 20:51:13 slot1_4: Got universe "VANILLA" (5) from request classad
>10/05/12 20:51:13 slot1_4: State change: claim-activation protocol
>successful
>10/05/12 20:51:13 slot1_4: Changing activity: Idle -> Busy
>10/06/12 04:18:21 slot1_1: Called deactivate_claim_forcibly()
>10/06/12 04:18:21 slot1_1: Changing state and activity: Claimed/Busy ->
>Preempting/Vacating
>10/06/12 04:18:21 Starter pid 2555 exited with status 0
>10/06/12 04:18:21 slot1_1: State change: starter exited
>10/06/12 04:18:21 slot1_1: State change: No preempting claim, returning
>to owner
>10/06/12 04:18:21 slot1_1: Changing state and activity:
>Preempting/Vacating -> Owner/Idle
>10/06/12 04:18:21 slot1_1: State change: IS_OWNER is false
>10/06/12 04:18:21 slot1_1: Changing state: Owner -> Unclaimed
>10/06/12 04:18:21 slot1_1: Changing state: Unclaimed -> Delete
>10/06/12 04:18:21 slot1_1: Resource no longer needed, deleting
>10/06/12 04:18:27 Job no longer matches partitionable slot after
>MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
>10/06/12 04:18:27 slot1: Partitionable slot can't be split to allocate a
>dynamic slot large enough for the claim
>*** End of file StartLog
>
>
>
>-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>Questions about this message or Condor in general?
>Email address of the local Condor administrator:
>ian.cottam@xxxxxxxxxxxxxxxx
>The Official Condor Homepage is http://www.cs.wisc.edu/condor
>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at:
>https://lists.cs.wisc.edu/archive/condor-users/
>
>
>


-- 
Ian Cottam

IT Services -- supporting research
Faculty of Engineering and Physical Sciences
The University of Manchester
"The only strategy that is guaranteed to fail is not taking risks." Mark
Zuckerberg