[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Nodes rejecting jobs after a few runs



Dmitry - 

comments are inline below. 

----- Original Message -----
> From: "Dmitry Rodionov" <d.rodionov@xxxxxxxxx>
> To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
> Sent: Thursday, September 13, 2012 10:13:55 AM
> Subject: [Condor-users] Nodes rejecting jobs after a few runs
> 
> Good day everyone!
> I have Condor 7.8.2 set up on 6 Mac workstations running 10.6.
> 
> I start a job consisting of 1000 identical simulations, 1 simulation=
> 1 job, 22 jobs queued (I have 22 cores).
> The working folder is mounted via nfs across all hosts. All_squash.
> All hosts are on 1Gbps LAN less then 5m away from the switch.
> 
> Initial situation: all idle nodes accept jobs and start crunching
> numbers. So far good.
> After completing 2-3 jobs or so nodes stop accepting new jobs "for
> unknown reason".
> The submitting node is the last one to start refusing jobs.
> 
> This is all condor_q -global -better-analyze had to say on the
> subject.
> 
> -- Schedd: sioux.local : <10.0.0.15:62904>
> ---
> 149.000:  Run analysis summary.  Of 22 machines,
>       0 are rejected by your job's requirements
>       4 reject your job because of their own requirements
>       2 match but are serving users with a better priority in the
>       pool
>      16 match but reject the job for unknown reasons
>       0 match but will not currently preempt their existing job
>       0 match but are currently offline
>       0 are available to run your job
> 	Last successful match: Thu Sep 13 10:41:19 2012
> 
> The following attributes are missing from the job ClassAd:
> 
> CheckpointPlatform
> 
> SchedLog is filled up with variations of this:
> 
> 09/13/12 10:43:40 (pid:44237) Shadow pid 51139 for job 149.0 exited
> with status 4
> 09/13/12 10:43:40 (pid:44237) Match for cluster 149 has had 5 shadow
> exceptions, relinquishing.
> 09/13/12 10:43:40 (pid:44237) Match record (slot3@xxxxxxxxxxxxx
> <10.0.0.30:49774> for drod, 149.0) deleted
> 09/13/12 10:43:40 (pid:44237) Shadow pid 51149 for job 133.0 exited
> with status 4
> 09/13/12 10:44:20 (pid:44237) Starting add_shadow_birthdate(149.0)
> 09/13/12 10:44:20 (pid:44237) Started shadow for job 149.0 on
> slot2@xxxxxxxxxx <10.0.0.54:51729> for drod, (shadow pid = 51262)
> 09/13/12 10:44:20 (pid:44237) Shadow pid 51262 for job 149.0 exited
> with status 4
> 09/13/12 10:44:20 (pid:44237) match (slot2@xxxxxxxxxx
> <10.0.0.54:51729> for drod) switching to job 149.0
> 09/13/12 10:44:20 (pid:44237) Starting add_shadow_birthdate(149.0)
> 09/13/12 10:44:20 (pid:44237) Started shadow for job 149.0 on
> slot2@xxxxxxxxxx <10.0.0.54:51729> for drod, (shadow pid = 51296)
> 09/13/12 10:44:20 (pid:44237) Shadow pid 51296 for job 149.0 exited
> with status 4
> 
> Shadow log has numerous copies of stuff like
> 
> 09/13/12 10:47:51 Initializing a VANILLA shadow for job 149.0
> 09/13/12 10:47:51 (149.0) (51919): Request to run on
> slot4@xxxxxxxxxxx <10.0.0.2:51239> was ACCEPTED
> --
> 09/13/12 10:47:51 Setting maximum accepts per cycle 8.
> 09/13/12 10:47:51
> ******************************************************
> 09/13/12 10:47:51 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 09/13/12 10:47:51 ** /condor/condor-installed/sbin/condor_shadow
> 09/13/12 10:47:51 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
> class=DAEMON(1)
> 09/13/12 10:47:51 ** Configuration: subsystem:SHADOW local:<NONE>
> class:DAEMON
> 09/13/12 10:47:51 ** $CondorVersion: 7.8.2 Aug 08 2012 $
> 09/13/12 10:47:51 ** $CondorPlatform: x86_64_macos_10.7 $
> 09/13/12 10:47:51 ** PID = 51919
> 09/13/12 10:47:51 ** Log last touched 9/13 10:47:51
> --
> 09/13/12 10:47:51 (149.0) (51919): ERROR "Can no longer talk to
> condor_starter <10.0.0.2:51239>" at line 219 in file
> /Volumes/disk1/condor/execute/slot1/dir_49805/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> 
> Where is "/Volumes/disk1/condor/" in the line above coming from?
> There is no such thing on my systems. Condor is in
> "/condor/condor-installed"

The error above shows an issue in network connectivity, which can be caused by many reasons.  (firewalls, etc.) 
The reference location above pertains to a source code line reference from where condor was built. 

> 
> >From StarterLog on terra.local:
> 
> 09/13/12 10:47:51 slot4: Got activate_claim request from shadow
> (10.0.0.15)
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
> overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
> overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
> overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
> overflow, capping to INT_MAX
> 09/13/12 10:47:51 slot4: Remote job ID is 149.0
> 09/13/12 10:47:51 slot4: Got universe "VANILLA" (5) from request
> classad
> 09/13/12 10:47:51 slot4: State change: claim-activation protocol
> successful
> 09/13/12 10:47:51 slot4: Changing activity: Idle -> Busy
> 09/13/12 10:47:51 Starter pid 74324 exited with status 1
> 09/13/12 10:47:51 slot4: State change: starter exited
> 09/13/12 10:47:51 slot4: Changing activity: Busy -> Idle
> 09/13/12 10:47:51 slot2: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot2: Changing state and activity: Claimed/Idle ->
> Preempting/Vacating
> 09/13/12 10:47:51 slot2: State change: No preempting claim, returning
> to owner
> 09/13/12 10:47:51 slot2: Changing state and activity:
> Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot2: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot2: Changing state: Owner -> Unclaimed
> 09/13/12 10:47:51 slot1: Got activate_claim request from shadow
> (10.0.0.15)
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
> overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
> overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
> overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
> overflow, capping to INT_MAX
> 09/13/12 10:47:51 slot1: Remote job ID is 153.0
> 09/13/12 10:47:51 slot1: Got universe "VANILLA" (5) from request
> classad
> 09/13/12 10:47:51 slot1: State change: claim-activation protocol
> successful
> 09/13/12 10:47:51 slot1: Changing activity: Idle -> Busy
> 09/13/12 10:47:51 slot3: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot3: Changing state and activity: Claimed/Idle ->
> Preempting/Vacating
> 09/13/12 10:47:51 slot3: State change: No preempting claim, returning
> to owner
> 09/13/12 10:47:51 slot3: Changing state and activity:
> Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot3: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot3: Changing state: Owner -> Unclaimed
> 09/13/12 10:47:51 slot4: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot4: Changing state and activity: Claimed/Idle ->
> Preempting/Vacating
> 09/13/12 10:47:51 slot4: State change: No preempting claim, returning
> to owner
> 09/13/12 10:47:51 slot4: Changing state and activity:
> Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot4: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot4: Changing state: Owner -> Unclaimed
> 
> What  does  "sysapi_disk_space_raw: Free disk space kbytes overflow,
> capping to INT_MAX" mean?

The above should not really be a problem, it's just noting that there is plenty of space available. 

> 
> Please help me troubleshoot this problem.
> I am new to Condor and not sure where to start.

Please see: https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=GettingStarted

> 
> Thanks!
> 	Dmitry
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>