[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FW: [Condor] Problem condor_startd died (11)



Ian,

I'm seeing similar behaviour while evaluating 7.8.4 for deployment on our campus grid. Indeed, occasionally the Startd dies and leaves a core (we're using Debian 6.0.6 x86_64), with this sort of message in the StartLog:

10/08/12 09:21:16 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim
Stack dump for process 7352 at timestamp 1349684476 (4 frames)
/Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor_utils_7_8_4.so(dprintf_dump_stack+0x131)[0x7f4c0f428051]
/Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor_utils_7_8_4.so(_Z18linux_sig_coredumpi+0x40)[0x7f4c0f596a00]
/lib64/libpthread.so.0(+0xeff0)[0x7f4c0b208ff0]

This particular failure happened while trying to use partitionable slots with ParallelSchedulingGroups under the parallel universe.  I know that one can run this particular type of job from within the vanilla universe (which works), but I need this test for backward compatibility in case users stick to using old scripts that they may have.

Mark

On 06/10/12 07:51, Ian Cottam wrote:
We are getting a ton of these messages from our Pool after updating from
7.4 to 7.8.4.
Does it mean we are obliged to run the new daemon that clears out
partitioned slots?
Or is it showing up a bug, which seems likely as startd should not seg
fault?
-Ian
--
Ian Cottam
IT Services - supporting research
Faculty of EPS
The University of Manchester





On 06/10/2012 04:18, "Owner of Condor Daemons"
<condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

This is an automated email from the Condor system
on machine "xxx".  Do not reply.

"/usr/sbin/condor_startd" on "e-c07atg105057.it.manchester.ac.uk" died
due to signal 11 (Segmentation fault).
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file /var/log/condor/StartLog:
10/05/12 20:51:13 slot1_3: State change: claim-activation protocol
successful
10/05/12 20:51:13 slot1_3: Changing activity: Idle -> Busy
10/05/12 20:51:13 slot1_1: match_info called
10/05/12 20:51:13 slot1_4: Got activate_claim request from shadow
(130.88.203.22)
10/05/12 20:51:13 slot1_4: Remote job ID is 329729.2744
10/05/12 20:51:13 slot1_4: Got universe "VANILLA" (5) from request classad
10/05/12 20:51:13 slot1_4: State change: claim-activation protocol
successful
10/05/12 20:51:13 slot1_4: Changing activity: Idle -> Busy
10/06/12 04:18:21 slot1_1: Called deactivate_claim_forcibly()
10/06/12 04:18:21 slot1_1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
10/06/12 04:18:21 Starter pid 2555 exited with status 0
10/06/12 04:18:21 slot1_1: State change: starter exited
10/06/12 04:18:21 slot1_1: State change: No preempting claim, returning
to owner
10/06/12 04:18:21 slot1_1: Changing state and activity:
Preempting/Vacating -> Owner/Idle
10/06/12 04:18:21 slot1_1: State change: IS_OWNER is false
10/06/12 04:18:21 slot1_1: Changing state: Owner -> Unclaimed
10/06/12 04:18:21 slot1_1: Changing state: Unclaimed -> Delete
10/06/12 04:18:21 slot1_1: Resource no longer needed, deleting
10/06/12 04:18:27 Job no longer matches partitionable slot after
MODIFY_REQUEST_EXPR_ edits, retrying w/o edits
10/06/12 04:18:27 slot1: Partitionable slot can't be split to allocate a
dynamic slot large enough for the claim
*** End of file StartLog



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator:
ian.cottam@xxxxxxxxxxxxxxxx
The Official Condor Homepage is http://www.cs.wisc.edu/condor
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/