[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] jobs won't run: workstations get timeouts on matching



When we restart the Condor Central Manager and submit a  large number of jobs
all of the workstation nodes get allocated jobs. But slowly over the course
of 10 to 15 hours most the workstations stop being assigned jobs. Below
is some output from the StartLog of one of the workstation nodes showing that
jobs are matched to the node but then the match 'times out'. Is there
a simple explanation for this? Could it be a network problem or is the
central manager too slow to assign a job to the node?

12/14 07:00:31 DaemonCore: Command received via UDP from host <10.10.6.33:33730>
12/14 07:00:31 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:00:31 match_info called
12/14 07:00:31 Received match <10.10.23.122:1622>#7916121240
12/14 07:00:31 State change: match notification protocol successful
12/14 07:00:31 Changing state: Unclaimed -> Matched
12/14 07:02:31 State change: match timed out
12/14 07:02:31 Changing state: Matched -> Owner
12/14 07:02:31 State change: IS_OWNER is false
12/14 07:02:31 Changing state: Owner -> Unclaimed
12/14 07:05:53 DaemonCore: Command received via TCP from host <10.10.23.135:1551>
12/14 07:05:53 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:05:53 Error: can't find resource with capability (<10.10.23.122:1622>#7916121240)
12/14 07:05:59 DaemonCore: Command received via UDP from host <10.10.23.135:1599>
12/14 07:05:59 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:05:59 Error: can't find resource with capability (<10.10.23.122:1622>#7916121240)
12/14 07:16:23 DaemonCore: Command received via UDP from host <10.10.6.33:33733>
12/14 07:16:23 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:16:23 match_info called
12/14 07:16:23 Received match <10.10.23.122:1622>#2674321328
12/14 07:16:23 State change: match notification protocol successful
12/14 07:16:23 Changing state: Unclaimed -> Matched
12/14 07:18:23 State change: match timed out
12/14 07:18:23 Changing state: Matched -> Owner
12/14 07:18:23 State change: IS_OWNER is false
12/14 07:18:23 Changing state: Owner -> Unclaimed
12/14 07:21:45 DaemonCore: Command received via TCP from host <10.10.23.135:1637>
12/14 07:21:45 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:21:45 Error: can't find resource with capability (<10.10.23.122:1622>#2674321328)
12/14 07:21:48 DaemonCore: Command received via UDP from host <10.10.23.135:1687>
12/14 07:21:48 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:21:48 Error: can't find resource with capability (<10.10.23.122:1622>#2674321328)
12/14 07:32:23 DaemonCore: Command received via UDP from host <10.10.6.33:33734>
12/14 07:32:23 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:32:23 match_info called
12/14 07:32:23 Received match <10.10.23.122:1622>#2231421672
12/14 07:32:23 State change: match notification protocol successful
12/14 07:32:23 Changing state: Unclaimed -> Matched
12/14 07:34:23 State change: match timed out
12/14 07:34:23 Changing state: Matched -> Owner
12/14 07:34:23 State change: IS_OWNER is false
12/14 07:34:23 Changing state: Owner -> Unclaimed
12/14 07:37:58 DaemonCore: Command received via TCP from host <10.10.23.135:1717>
12/14 07:37:58 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:37:58 Error: can't find resource with capability (<10.10.23.122:1622>#2231421672)
12/14 07:38:01 DaemonCore: Command received via UDP from host <10.10.23.135:1767>
12/14 07:38:01 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:38:01 Error: can't find resource with capability (<10.10.23.122:1622>#2231421672)


Thanks, Bob.

Bob Orchard
National Research Council Canada      Conseil national de recherches Canada
Institute for Information Technology  Institut de technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
(613) 993-8557 
(613) 952-0215 Fax / télécopieur
bob.orchard@xxxxxxxxxxxxxx 
Government of Canada | Gouvernement du Canada



> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Gabriel Mateescu
> Sent: Wednesday, December 07, 2005 9:58 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank
> 
> 
> 
> Hi,
> 
> The changes in the IP addresses need to be reflected
> in the HOST_ALLOW_* entries in the condor_config file
> on the central manager. The central manager runs the
> negotiator and collector daemons, and the collector
> will only accept requests from machines listed
> in the HOST_ALLOW_* list.
> 
> Additionally, on the submission machine, the job log
> file and the sched daemon log file may be helpful.
> 
> Gabriel
> 
> 
> 
> > There have been a number of changes in the ip addresses in 
> the past few
> > weeks.
> > These changes were made and the latest version of condor installed
> > (6.6.10). Then they did accept at least one job before entering the
> > unclaimed/idle state. I will try to access the log files on 
> the server
> > and try to trace activity for one of these machines. It 
> certainly could
> > be related to that (in fact we are suspicious of this network change
> > but are not sure how to trace it or fix it ... one option is to
> > stop all machines including the master and restart everything).
> >
> > Bob Orchard
> > National Research Council Canada      Conseil national de recherches
> > Canada
> > Institute for Information Technology  Institut de technologie de
> > l'information
> > 1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
> > Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) 
> Canada K1A 0R6
> > (613) 993-8557
> > (613) 952-0215 Fax / télécopieur
> > bob.orchard@xxxxxxxxxxxxxx
> > Government of Canada | Gouvernement du Canada
> >
> >
> >
> > -----Original Message-----
> > From: condor-users-bounces@xxxxxxxxxxx
> > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of 
> Gabriel Mateescu
> > Sent: Wednesday, December 07, 2005 8:19 PM
> > To: Condor-Users Mail List
> > Cc: Condor-Users Mail List
> > Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank
> >
> >
> >> We have a similar problem (not as many machines) but many 
> seem to get
> >> stuck in the unclaimed/idle state and will not run jobs. An analyze
> >> shows the 'reject the job for unknown reasons' for these machines.
> >> They ran jobs yesterday for a while but no longer will.
> >>
> >> Bob Orchard
> >>
> >
> > Hi,
> >
> > Did something in the environment change, such
> > as IP addresses or host names?
> >
> > When "analyze" does not give helpful information,
> > there are additional places to check:
> >
> >   1. the job log file;
> >   2. the sched daemon log file
> >   3. the negotiator daemon log file.
> >
> > Gabriel
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>