Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs won't run: workstations get timeouts on matching

Date: Wed, 14 Dec 2005 11:09:48 -0500
From: "Orchard, Bob" <Robert.Orchard@xxxxxxxxxxxxxx>
Subject: Re: [Condor-users] jobs won't run: workstations get timeouts on matching
We can try that. Thanks ... but is this to be done on all
workstation nodes, on the submitter's workstation or on the central manger
machine? Sounds like you mean for each workstation node to change this
value so it won't timeout so soon. But it does seem odd that every machine
was too busy and timed out since they actually weren't busy at all
with other things (middle of night and the previous condor job
had completed). So one might assume that it was the central manager
that was too slow to do something (or the submitter's machine). Not
clear to me since I don't know the intimate details of how condor
works.

Bob Orchard
National Research Council Canada      Conseil national de recherches Canada
Institute for Information Technology  Institut de technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
(613) 993-8557 
(613) 952-0215 Fax / télécopieur
bob.orchard@xxxxxxxxxxxxxx 
Government of Canada | Gouvernement du Canada



> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Dan Bradley
> Sent: Wednesday, December 14, 2005 10:46 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] jobs won't run: workstations get timeouts
> onmatching
> 
> 
> Bob,
> 
> This is an indication of schedd problems, possibly just a busy schedd 
> that doesn't get around to claiming machines that it got matched to.  
> There are various things you can tune to improve schedd 
> performance, but 
> you can also simply increase the value of MATCH_TIMEOUT in 
> your condor 
> configuration.  I would suggest starting with that, since it 
> is a simple 
> thing to do.
> 
> --Dan
> 
> Orchard, Bob wrote:
> 
> >When we restart the Condor Central Manager and submit a  
> large number of jobs
> >all of the workstation nodes get allocated jobs. But slowly 
> over the course
> >of 10 to 15 hours most the workstations stop being assigned 
> jobs. Below
> >is some output from the StartLog of one of the workstation 
> nodes showing that
> >jobs are matched to the node but then the match 'times out'. Is there
> >a simple explanation for this? Could it be a network problem 
> or is the
> >central manager too slow to assign a job to the node?
> >
> >12/14 07:00:31 DaemonCore: Command received via UDP from 
> host <10.10.6.33:33730>
> >12/14 07:00:31 DaemonCore: received command 440 
> (MATCH_INFO), calling handler (command_match_info)
> >12/14 07:00:31 match_info called
> >12/14 07:00:31 Received match <10.10.23.122:1622>#7916121240
> >12/14 07:00:31 State change: match notification protocol successful
> >12/14 07:00:31 Changing state: Unclaimed -> Matched
> >12/14 07:02:31 State change: match timed out
> >12/14 07:02:31 Changing state: Matched -> Owner
> >12/14 07:02:31 State change: IS_OWNER is false
> >12/14 07:02:31 Changing state: Owner -> Unclaimed
> >12/14 07:05:53 DaemonCore: Command received via TCP from 
> host <10.10.23.135:1551>
> >12/14 07:05:53 DaemonCore: received command 442 
> (REQUEST_CLAIM), calling handler (command_request_claim)
> >12/14 07:05:53 Error: can't find resource with capability 
> (<10.10.23.122:1622>#7916121240)
> >12/14 07:05:59 DaemonCore: Command received via UDP from 
> host <10.10.23.135:1599>
> >12/14 07:05:59 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling handler (command_handler)
> >12/14 07:05:59 Error: can't find resource with capability 
> (<10.10.23.122:1622>#7916121240)
> >12/14 07:16:23 DaemonCore: Command received via UDP from 
> host <10.10.6.33:33733>
> >12/14 07:16:23 DaemonCore: received command 440 
> (MATCH_INFO), calling handler (command_match_info)
> >12/14 07:16:23 match_info called
> >12/14 07:16:23 Received match <10.10.23.122:1622>#2674321328
> >12/14 07:16:23 State change: match notification protocol successful
> >12/14 07:16:23 Changing state: Unclaimed -> Matched
> >12/14 07:18:23 State change: match timed out
> >12/14 07:18:23 Changing state: Matched -> Owner
> >12/14 07:18:23 State change: IS_OWNER is false
> >12/14 07:18:23 Changing state: Owner -> Unclaimed
> >12/14 07:21:45 DaemonCore: Command received via TCP from 
> host <10.10.23.135:1637>
> >12/14 07:21:45 DaemonCore: received command 442 
> (REQUEST_CLAIM), calling handler (command_request_claim)
> >12/14 07:21:45 Error: can't find resource with capability 
> (<10.10.23.122:1622>#2674321328)
> >12/14 07:21:48 DaemonCore: Command received via UDP from 
> host <10.10.23.135:1687>
> >12/14 07:21:48 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling handler (command_handler)
> >12/14 07:21:48 Error: can't find resource with capability 
> (<10.10.23.122:1622>#2674321328)
> >12/14 07:32:23 DaemonCore: Command received via UDP from 
> host <10.10.6.33:33734>
> >12/14 07:32:23 DaemonCore: received command 440 
> (MATCH_INFO), calling handler (command_match_info)
> >12/14 07:32:23 match_info called
> >12/14 07:32:23 Received match <10.10.23.122:1622>#2231421672
> >12/14 07:32:23 State change: match notification protocol successful
> >12/14 07:32:23 Changing state: Unclaimed -> Matched
> >12/14 07:34:23 State change: match timed out
> >12/14 07:34:23 Changing state: Matched -> Owner
> >12/14 07:34:23 State change: IS_OWNER is false
> >12/14 07:34:23 Changing state: Owner -> Unclaimed
> >12/14 07:37:58 DaemonCore: Command received via TCP from 
> host <10.10.23.135:1717>
> >12/14 07:37:58 DaemonCore: received command 442 
> (REQUEST_CLAIM), calling handler (command_request_claim)
> >12/14 07:37:58 Error: can't find resource with capability 
> (<10.10.23.122:1622>#2231421672)
> >12/14 07:38:01 DaemonCore: Command received via UDP from 
> host <10.10.23.135:1767>
> >12/14 07:38:01 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling handler (command_handler)
> >12/14 07:38:01 Error: can't find resource with capability 
> (<10.10.23.122:1622>#2231421672)
> >
> >
> >Thanks, Bob.
> >
> >Bob Orchard
> >National Research Council Canada      Conseil national de 
> recherches Canada
> >Institute for Information Technology  Institut de 
> technologie de l'information
> >1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
> >Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
> >(613) 993-8557 
> >(613) 952-0215 Fax / télécopieur
> >bob.orchard@xxxxxxxxxxxxxx 
> >Government of Canada | Gouvernement du Canada
> >
> >
> >
> >  
> >
> >>-----Original Message-----
> >>From: condor-users-bounces@xxxxxxxxxxx
> >>[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of 
> Gabriel Mateescu
> >>Sent: Wednesday, December 07, 2005 9:58 PM
> >>To: Condor-Users Mail List
> >>Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank
> >>
> >>
> >>
> >>Hi,
> >>
> >>The changes in the IP addresses need to be reflected
> >>in the HOST_ALLOW_* entries in the condor_config file
> >>on the central manager. The central manager runs the
> >>negotiator and collector daemons, and the collector
> >>will only accept requests from machines listed
> >>in the HOST_ALLOW_* list.
> >>
> >>Additionally, on the submission machine, the job log
> >>file and the sched daemon log file may be helpful.
> >>
> >>Gabriel
> >>
> >>
> >>
> >>    
> >>
> >>>There have been a number of changes in the ip addresses in 
> >>>      
> >>>
> >>the past few
> >>    
> >>
> >>>weeks.
> >>>These changes were made and the latest version of condor installed
> >>>(6.6.10). Then they did accept at least one job before entering the
> >>>unclaimed/idle state. I will try to access the log files on 
> >>>      
> >>>
> >>the server
> >>    
> >>
> >>>and try to trace activity for one of these machines. It 
> >>>      
> >>>
> >>certainly could
> >>    
> >>
> >>>be related to that (in fact we are suspicious of this 
> network change
> >>>but are not sure how to trace it or fix it ... one option is to
> >>>stop all machines including the master and restart everything).
> >>>
> >>>Bob Orchard
> >>>National Research Council Canada      Conseil national de 
> recherches
> >>>Canada
> >>>Institute for Information Technology  Institut de technologie de
> >>>l'information
> >>>1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
> >>>Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) 
> >>>      
> >>>
> >>Canada K1A 0R6
> >>    
> >>
> >>>(613) 993-8557
> >>>(613) 952-0215 Fax / télécopieur
> >>>bob.orchard@xxxxxxxxxxxxxx
> >>>Government of Canada | Gouvernement du Canada
> >>>
> >>>
> >>>
> >>>-----Original Message-----
> >>>From: condor-users-bounces@xxxxxxxxxxx
> >>>[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of 
> >>>      
> >>>
> >>Gabriel Mateescu
> >>    
> >>
> >>>Sent: Wednesday, December 07, 2005 8:19 PM
> >>>To: Condor-Users Mail List
> >>>Cc: Condor-Users Mail List
> >>>Subject: Re: [Condor-users] jobs won't run: MY.Rank > 
> MY.CurrentRank
> >>>
> >>>
> >>>      
> >>>
> >>>>We have a similar problem (not as many machines) but many 
> >>>>        
> >>>>
> >>seem to get
> >>    
> >>
> >>>>stuck in the unclaimed/idle state and will not run jobs. 
> An analyze
> >>>>shows the 'reject the job for unknown reasons' for these machines.
> >>>>They ran jobs yesterday for a while but no longer will.
> >>>>
> >>>>Bob Orchard
> >>>>
> >>>>        
> >>>>
> >>>Hi,
> >>>
> >>>Did something in the environment change, such
> >>>as IP addresses or host names?
> >>>
> >>>When "analyze" does not give helpful information,
> >>>there are additional places to check:
> >>>
> >>>  1. the job log file;
> >>>  2. the sched daemon log file
> >>>  3. the negotiator daemon log file.
> >>>
> >>>Gabriel
> >>>
> >>>_______________________________________________
> >>>Condor-users mailing list
> >>>Condor-users@xxxxxxxxxxx
> >>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>>_______________________________________________
> >>>Condor-users mailing list
> >>>Condor-users@xxxxxxxxxxx
> >>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>>      
> >>>
> >>_______________________________________________
> >>Condor-users mailing list
> >>Condor-users@xxxxxxxxxxx
> >>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >>    
> >>
> >
> >_______________________________________________
> >Condor-users mailing list
> >Condor-users@xxxxxxxxxxx
> >https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >  
> >
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
Follow-Ups:
- Re: [Condor-users] jobs won't run: workstations get timeouts on matching
  - From: Dan Bradley
Prev by Date: Re: [Condor-users] jobs won't run: workstations get timeouts on matching
Next by Date: Re: [Condor-users] jobs won't run: workstations get timeouts on matching
Previous by thread: Re: [Condor-users] jobs won't run: workstations get timeouts on matching
Next by thread: Re: [Condor-users] jobs won't run: workstations get timeouts on matching
Index(es):
- Date
- Thread
Mailing List Archives

Public Access

Re: [Condor-users] jobs won't run: workstations get timeouts on matching