[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Nodes rejecting jobs after a few runs



Forgot corresponding entry from SchedLog:

09/13/12 10:47:51 (pid:44237) Shadow pid 51953 for job 149.0 exited with status 4
09/13/12 10:47:51 (pid:44237) ERROR: Shadow exited with job exception code!
09/13/12 10:47:51 (pid:44237) match (slot4@xxxxxxxxxxx <10.0.0.2:51239> for drod) out of jobs; relinquishing
09/13/12 10:47:51 (pid:44237) Completed RELEASE_CLAIM to startd slot4@xxxxxxxxxxx <10.0.0.2:51239> for drod
09/13/12 10:47:51 (pid:44237) Match record (slot4@xxxxxxxxxxx <10.0.0.2:51239> for drod, 149.-1) deleted

Thanks,
	Dmitry

On 2012-09-13, at 11:13 AM, Dmitry Rodionov wrote:

> Good day everyone!
> I have Condor 7.8.2 set up on 6 Mac workstations running 10.6.
> 
> I start a job consisting of 1000 identical simulations, 1 simulation= 1 job, 22 jobs queued (I have 22 cores).
> The working folder is mounted via nfs across all hosts. All_squash. 
> All hosts are on 1Gbps LAN less then 5m away from the switch.
> 
> Initial situation: all idle nodes accept jobs and start crunching numbers. So far good.
> After completing 2-3 jobs or so nodes stop accepting new jobs "for unknown reason".
> The submitting node is the last one to start refusing jobs.
> 
> This is all condor_q -global -better-analyze had to say on the subject.
> 
> -- Schedd: sioux.local : <10.0.0.15:62904>
> ---
> 149.000:  Run analysis summary.  Of 22 machines,
>      0 are rejected by your job's requirements 
>      4 reject your job because of their own requirements 
>      2 match but are serving users with a better priority in the pool 
>     16 match but reject the job for unknown reasons 
>      0 match but will not currently preempt their existing job 
>      0 match but are currently offline 
>      0 are available to run your job
> 	Last successful match: Thu Sep 13 10:41:19 2012
> 
> The following attributes are missing from the job ClassAd:
> 
> CheckpointPlatform
> 
> SchedLog is filled up with variations of this:
> 
> 09/13/12 10:43:40 (pid:44237) Shadow pid 51139 for job 149.0 exited with status 4
> 09/13/12 10:43:40 (pid:44237) Match for cluster 149 has had 5 shadow exceptions, relinquishing.
> 09/13/12 10:43:40 (pid:44237) Match record (slot3@xxxxxxxxxxxxx <10.0.0.30:49774> for drod, 149.0) deleted
> 09/13/12 10:43:40 (pid:44237) Shadow pid 51149 for job 133.0 exited with status 4
> 09/13/12 10:44:20 (pid:44237) Starting add_shadow_birthdate(149.0)
> 09/13/12 10:44:20 (pid:44237) Started shadow for job 149.0 on slot2@xxxxxxxxxx <10.0.0.54:51729> for drod, (shadow pid = 51262)
> 09/13/12 10:44:20 (pid:44237) Shadow pid 51262 for job 149.0 exited with status 4
> 09/13/12 10:44:20 (pid:44237) match (slot2@xxxxxxxxxx <10.0.0.54:51729> for drod) switching to job 149.0
> 09/13/12 10:44:20 (pid:44237) Starting add_shadow_birthdate(149.0)
> 09/13/12 10:44:20 (pid:44237) Started shadow for job 149.0 on slot2@xxxxxxxxxx <10.0.0.54:51729> for drod, (shadow pid = 51296)
> 09/13/12 10:44:20 (pid:44237) Shadow pid 51296 for job 149.0 exited with status 4
> 
> Shadow log has numerous copies of stuff like
> 
> 09/13/12 10:47:51 Initializing a VANILLA shadow for job 149.0
> 09/13/12 10:47:51 (149.0) (51919): Request to run on slot4@xxxxxxxxxxx <10.0.0.2:51239> was ACCEPTED
> --
> 09/13/12 10:47:51 Setting maximum accepts per cycle 8.
> 09/13/12 10:47:51 ******************************************************
> 09/13/12 10:47:51 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 09/13/12 10:47:51 ** /condor/condor-installed/sbin/condor_shadow
> 09/13/12 10:47:51 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
> 09/13/12 10:47:51 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
> 09/13/12 10:47:51 ** $CondorVersion: 7.8.2 Aug 08 2012 $
> 09/13/12 10:47:51 ** $CondorPlatform: x86_64_macos_10.7 $
> 09/13/12 10:47:51 ** PID = 51919
> 09/13/12 10:47:51 ** Log last touched 9/13 10:47:51
> --
> 09/13/12 10:47:51 (149.0) (51919): ERROR "Can no longer talk to condor_starter <10.0.0.2:51239>" at line 219 in file /Volumes/disk1/condor/execute/slot1/dir_49805/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> 
> Where is "/Volumes/disk1/condor/" in the line above coming from? There is no such thing on my systems. Condor is in "/condor/condor-installed"
> 
> From StarterLog on terra.local:
> 
> 09/13/12 10:47:51 slot4: Got activate_claim request from shadow (10.0.0.15)
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 slot4: Remote job ID is 149.0
> 09/13/12 10:47:51 slot4: Got universe "VANILLA" (5) from request classad
> 09/13/12 10:47:51 slot4: State change: claim-activation protocol successful
> 09/13/12 10:47:51 slot4: Changing activity: Idle -> Busy
> 09/13/12 10:47:51 Starter pid 74324 exited with status 1
> 09/13/12 10:47:51 slot4: State change: starter exited
> 09/13/12 10:47:51 slot4: Changing activity: Busy -> Idle
> 09/13/12 10:47:51 slot2: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 09/13/12 10:47:51 slot2: State change: No preempting claim, returning to owner
> 09/13/12 10:47:51 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot2: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot2: Changing state: Owner -> Unclaimed
> 09/13/12 10:47:51 slot1: Got activate_claim request from shadow (10.0.0.15)
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 slot1: Remote job ID is 153.0
> 09/13/12 10:47:51 slot1: Got universe "VANILLA" (5) from request classad
> 09/13/12 10:47:51 slot1: State change: claim-activation protocol successful
> 09/13/12 10:47:51 slot1: Changing activity: Idle -> Busy
> 09/13/12 10:47:51 slot3: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot3: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 09/13/12 10:47:51 slot3: State change: No preempting claim, returning to owner
> 09/13/12 10:47:51 slot3: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot3: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot3: Changing state: Owner -> Unclaimed
> 09/13/12 10:47:51 slot4: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot4: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 09/13/12 10:47:51 slot4: State change: No preempting claim, returning to owner
> 09/13/12 10:47:51 slot4: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot4: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot4: Changing state: Owner -> Unclaimed
> 
> What  does  "sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX" mean?
> 
> Please help me troubleshoot this problem.
> I am new to Condor and not sure where to start.
> 
> Thanks!
> 	Dmitry
>