[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Nodes rejecting jobs after a few runs



Hi Tim,
Thanks for answering so quickly!

>> All hosts are on 1Gbps LAN less then 5m away from the switch.
I find it hard to believe that network itself is to blame. All 7-8 nodes? These workstations have user folders mounted via NFS and have been working without a glitch for a couple of years. By working I mean that people handle large (3-10 Gb) datasets over the network.
Also, I mentioned that the submitting node will drop out as well. Same thing happens if I submit from the manager. Correct me if I am wrong: manager/execution node is self-sufficient and should be able to run jobs by itself.


I don't see anything on network troubleshooting in the WIKI.
The only idea I get from the manual is setting
UPDATE_COLLECTOR_WITH_TCP = True
But it seems like people are advised not to use it unless hosts are on WAN, which is not the case.

Could you please point me to the article you have in mind?

Thanks!
	Dmitry

On 2012-09-13, at 11:21 AM, Tim St Clair wrote:

> Dmitry - 
> 
> comments are inline below. 
> 
> ----- Original Message -----
>> From: "Dmitry Rodionov" <d.rodionov@xxxxxxxxx>
>> To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
>> Sent: Thursday, September 13, 2012 10:13:55 AM
>> Subject: [Condor-users] Nodes rejecting jobs after a few runs
>> 
>> Good day everyone!
>> I have Condor 7.8.2 set up on 6 Mac workstations running 10.6.
>> 
>> I start a job consisting of 1000 identical simulations, 1 simulation=
>> 1 job, 22 jobs queued (I have 22 cores).
>> The working folder is mounted via nfs across all hosts. All_squash.
>> All hosts are on 1Gbps LAN less then 5m away from the switch.
>> 
>> Initial situation: all idle nodes accept jobs and start crunching
>> numbers. So far good.
>> After completing 2-3 jobs or so nodes stop accepting new jobs "for
>> unknown reason".
>> The submitting node is the last one to start refusing jobs.
>> 
>> This is all condor_q -global -better-analyze had to say on the
>> subject.
>> 
>> -- Schedd: sioux.local : <10.0.0.15:62904>
>> ---
>> 149.000:  Run analysis summary.  Of 22 machines,
>>      0 are rejected by your job's requirements
>>      4 reject your job because of their own requirements
>>      2 match but are serving users with a better priority in the
>>      pool
>>     16 match but reject the job for unknown reasons
>>      0 match but will not currently preempt their existing job
>>      0 match but are currently offline
>>      0 are available to run your job
>> 	Last successful match: Thu Sep 13 10:41:19 2012
>> 
>> The following attributes are missing from the job ClassAd:
>> 
>> CheckpointPlatform
>> 
>> SchedLog is filled up with variations of this:
>> 
>> 09/13/12 10:43:40 (pid:44237) Shadow pid 51139 for job 149.0 exited
>> with status 4
>> 09/13/12 10:43:40 (pid:44237) Match for cluster 149 has had 5 shadow
>> exceptions, relinquishing.
>> 09/13/12 10:43:40 (pid:44237) Match record (slot3@xxxxxxxxxxxxx
>> <10.0.0.30:49774> for drod, 149.0) deleted
>> 09/13/12 10:43:40 (pid:44237) Shadow pid 51149 for job 133.0 exited
>> with status 4
>> 09/13/12 10:44:20 (pid:44237) Starting add_shadow_birthdate(149.0)
>> 09/13/12 10:44:20 (pid:44237) Started shadow for job 149.0 on
>> slot2@xxxxxxxxxx <10.0.0.54:51729> for drod, (shadow pid = 51262)
>> 09/13/12 10:44:20 (pid:44237) Shadow pid 51262 for job 149.0 exited
>> with status 4
>> 09/13/12 10:44:20 (pid:44237) match (slot2@xxxxxxxxxx
>> <10.0.0.54:51729> for drod) switching to job 149.0
>> 09/13/12 10:44:20 (pid:44237) Starting add_shadow_birthdate(149.0)
>> 09/13/12 10:44:20 (pid:44237) Started shadow for job 149.0 on
>> slot2@xxxxxxxxxx <10.0.0.54:51729> for drod, (shadow pid = 51296)
>> 09/13/12 10:44:20 (pid:44237) Shadow pid 51296 for job 149.0 exited
>> with status 4
>> 
>> Shadow log has numerous copies of stuff like
>> 
>> 09/13/12 10:47:51 Initializing a VANILLA shadow for job 149.0
>> 09/13/12 10:47:51 (149.0) (51919): Request to run on
>> slot4@xxxxxxxxxxx <10.0.0.2:51239> was ACCEPTED
>> --
>> 09/13/12 10:47:51 Setting maximum accepts per cycle 8.
>> 09/13/12 10:47:51
>> ******************************************************
>> 09/13/12 10:47:51 ** condor_shadow (CONDOR_SHADOW) STARTING UP
>> 09/13/12 10:47:51 ** /condor/condor-installed/sbin/condor_shadow
>> 09/13/12 10:47:51 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
>> class=DAEMON(1)
>> 09/13/12 10:47:51 ** Configuration: subsystem:SHADOW local:<NONE>
>> class:DAEMON
>> 09/13/12 10:47:51 ** $CondorVersion: 7.8.2 Aug 08 2012 $
>> 09/13/12 10:47:51 ** $CondorPlatform: x86_64_macos_10.7 $
>> 09/13/12 10:47:51 ** PID = 51919
>> 09/13/12 10:47:51 ** Log last touched 9/13 10:47:51
>> --
>> 09/13/12 10:47:51 (149.0) (51919): ERROR "Can no longer talk to
>> condor_starter <10.0.0.2:51239>" at line 219 in file
>> /Volumes/disk1/condor/execute/slot1/dir_49805/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
>> 
>> Where is "/Volumes/disk1/condor/" in the line above coming from?
>> There is no such thing on my systems. Condor is in
>> "/condor/condor-installed"
> 
> The error above shows an issue in network connectivity, which can be caused by many reasons.  (firewalls, etc.) 
> The reference location above pertains to a source code line reference from where condor was built. 
> 
>> 
>>> From StarterLog on terra.local:
>> 
>> 09/13/12 10:47:51 slot4: Got activate_claim request from shadow
>> (10.0.0.15)
>> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
>> overflow, capping to INT_MAX
>> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
>> overflow, capping to INT_MAX
>> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
>> overflow, capping to INT_MAX
>> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
>> overflow, capping to INT_MAX
>> 09/13/12 10:47:51 slot4: Remote job ID is 149.0
>> 09/13/12 10:47:51 slot4: Got universe "VANILLA" (5) from request
>> classad
>> 09/13/12 10:47:51 slot4: State change: claim-activation protocol
>> successful
>> 09/13/12 10:47:51 slot4: Changing activity: Idle -> Busy
>> 09/13/12 10:47:51 Starter pid 74324 exited with status 1
>> 09/13/12 10:47:51 slot4: State change: starter exited
>> 09/13/12 10:47:51 slot4: Changing activity: Busy -> Idle
>> 09/13/12 10:47:51 slot2: State change: received RELEASE_CLAIM command
>> 09/13/12 10:47:51 slot2: Changing state and activity: Claimed/Idle ->
>> Preempting/Vacating
>> 09/13/12 10:47:51 slot2: State change: No preempting claim, returning
>> to owner
>> 09/13/12 10:47:51 slot2: Changing state and activity:
>> Preempting/Vacating -> Owner/Idle
>> 09/13/12 10:47:51 slot2: State change: IS_OWNER is false
>> 09/13/12 10:47:51 slot2: Changing state: Owner -> Unclaimed
>> 09/13/12 10:47:51 slot1: Got activate_claim request from shadow
>> (10.0.0.15)
>> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
>> overflow, capping to INT_MAX
>> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
>> overflow, capping to INT_MAX
>> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
>> overflow, capping to INT_MAX
>> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes
>> overflow, capping to INT_MAX
>> 09/13/12 10:47:51 slot1: Remote job ID is 153.0
>> 09/13/12 10:47:51 slot1: Got universe "VANILLA" (5) from request
>> classad
>> 09/13/12 10:47:51 slot1: State change: claim-activation protocol
>> successful
>> 09/13/12 10:47:51 slot1: Changing activity: Idle -> Busy
>> 09/13/12 10:47:51 slot3: State change: received RELEASE_CLAIM command
>> 09/13/12 10:47:51 slot3: Changing state and activity: Claimed/Idle ->
>> Preempting/Vacating
>> 09/13/12 10:47:51 slot3: State change: No preempting claim, returning
>> to owner
>> 09/13/12 10:47:51 slot3: Changing state and activity:
>> Preempting/Vacating -> Owner/Idle
>> 09/13/12 10:47:51 slot3: State change: IS_OWNER is false
>> 09/13/12 10:47:51 slot3: Changing state: Owner -> Unclaimed
>> 09/13/12 10:47:51 slot4: State change: received RELEASE_CLAIM command
>> 09/13/12 10:47:51 slot4: Changing state and activity: Claimed/Idle ->
>> Preempting/Vacating
>> 09/13/12 10:47:51 slot4: State change: No preempting claim, returning
>> to owner
>> 09/13/12 10:47:51 slot4: Changing state and activity:
>> Preempting/Vacating -> Owner/Idle
>> 09/13/12 10:47:51 slot4: State change: IS_OWNER is false
>> 09/13/12 10:47:51 slot4: Changing state: Owner -> Unclaimed
>> 
>> What  does  "sysapi_disk_space_raw: Free disk space kbytes overflow,
>> capping to INT_MAX" mean?
> 
> The above should not really be a problem, it's just noting that there is plenty of space available. 
> 
>> 
>> Please help me troubleshoot this problem.
>> I am new to Condor and not sure where to start.
> 
> Please see: https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=GettingStarted
> 
>> 
>> Thanks!
>> 	Dmitry
>> 
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/