[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs won't run: workstations get timeouts on matching



In case the last message did not display properly I'm enclosing the output of the 
StartLog in a text file.


Bob Orchard
National Research Council Canada      Conseil national de recherches Canada
Institute for Information Technology  Institut de technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
(613) 993-8557 
(613) 952-0215 Fax / télécopieur
bob.orchard@xxxxxxxxxxxxxx 
Government of Canada | Gouvernement du Canada



> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Orchard, Bob
> Sent: Wednesday, December 14, 2005 8:04 AM
> To: Condor-Users Mail List
> Subject: [Condor-users] jobs won't run: workstations get timeouts on
> matching
> 
> 
> When we restart the Condor Central Manager and submit a  
> large number of jobs
> all of the workstation nodes get allocated jobs. But slowly 
> over the course
> of 10 to 15 hours most the workstations stop being assigned 
> jobs. Below
> is some output from the StartLog of one of the workstation 
> nodes showing that
> jobs are matched to the node but then the match 'times out'. Is there
> a simple explanation for this? Could it be a network problem or is the
> central manager too slow to assign a job to the node?
> 
> 12/14 07:00:31 DaemonCore: Command received via UDP from host 
> <10.10.6.33:33730>
> 12/14 07:00:31 DaemonCore: received command 440 (MATCH_INFO), 
> calling handler (command_match_info)
> 12/14 07:00:31 match_info called
> 12/14 07:00:31 Received match <10.10.23.122:1622>#7916121240
> 12/14 07:00:31 State change: match notification protocol successful
> 12/14 07:00:31 Changing state: Unclaimed -> Matched
> 12/14 07:02:31 State change: match timed out
> 12/14 07:02:31 Changing state: Matched -> Owner
> 12/14 07:02:31 State change: IS_OWNER is false
> 12/14 07:02:31 Changing state: Owner -> Unclaimed
> 12/14 07:05:53 DaemonCore: Command received via TCP from host 
> <10.10.23.135:1551>
> 12/14 07:05:53 DaemonCore: received command 442 
> (REQUEST_CLAIM), calling handler (command_request_claim)
> 12/14 07:05:53 Error: can't find resource with capability 
> (<10.10.23.122:1622>#7916121240)
> 12/14 07:05:59 DaemonCore: Command received via UDP from host 
> <10.10.23.135:1599>
> 12/14 07:05:59 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling handler (command_handler)
> 12/14 07:05:59 Error: can't find resource with capability 
> (<10.10.23.122:1622>#7916121240)
> 12/14 07:16:23 DaemonCore: Command received via UDP from host 
> <10.10.6.33:33733>
> 12/14 07:16:23 DaemonCore: received command 440 (MATCH_INFO), 
> calling handler (command_match_info)
> 12/14 07:16:23 match_info called
> 12/14 07:16:23 Received match <10.10.23.122:1622>#2674321328
> 12/14 07:16:23 State change: match notification protocol successful
> 12/14 07:16:23 Changing state: Unclaimed -> Matched
> 12/14 07:18:23 State change: match timed out
> 12/14 07:18:23 Changing state: Matched -> Owner
> 12/14 07:18:23 State change: IS_OWNER is false
> 12/14 07:18:23 Changing state: Owner -> Unclaimed
> 12/14 07:21:45 DaemonCore: Command received via TCP from host 
> <10.10.23.135:1637>
> 12/14 07:21:45 DaemonCore: received command 442 
> (REQUEST_CLAIM), calling handler (command_request_claim)
> 12/14 07:21:45 Error: can't find resource with capability 
> (<10.10.23.122:1622>#2674321328)
> 12/14 07:21:48 DaemonCore: Command received via UDP from host 
> <10.10.23.135:1687>
> 12/14 07:21:48 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling handler (command_handler)
> 12/14 07:21:48 Error: can't find resource with capability 
> (<10.10.23.122:1622>#2674321328)
> 12/14 07:32:23 DaemonCore: Command received via UDP from host 
> <10.10.6.33:33734>
> 12/14 07:32:23 DaemonCore: received command 440 (MATCH_INFO), 
> calling handler (command_match_info)
> 12/14 07:32:23 match_info called
> 12/14 07:32:23 Received match <10.10.23.122:1622>#2231421672
> 12/14 07:32:23 State change: match notification protocol successful
> 12/14 07:32:23 Changing state: Unclaimed -> Matched
> 12/14 07:34:23 State change: match timed out
> 12/14 07:34:23 Changing state: Matched -> Owner
> 12/14 07:34:23 State change: IS_OWNER is false
> 12/14 07:34:23 Changing state: Owner -> Unclaimed
> 12/14 07:37:58 DaemonCore: Command received via TCP from host 
> <10.10.23.135:1717>
> 12/14 07:37:58 DaemonCore: received command 442 
> (REQUEST_CLAIM), calling handler (command_request_claim)
> 12/14 07:37:58 Error: can't find resource with capability 
> (<10.10.23.122:1622>#2231421672)
> 12/14 07:38:01 DaemonCore: Command received via UDP from host 
> <10.10.23.135:1767>
> 12/14 07:38:01 DaemonCore: received command 443 
> (RELEASE_CLAIM), calling handler (command_handler)
> 12/14 07:38:01 Error: can't find resource with capability 
> (<10.10.23.122:1622>#2231421672)
> 
> 
> Thanks, Bob.
> 
> Bob Orchard
> National Research Council Canada      Conseil national de 
> recherches Canada
> Institute for Information Technology  Institut de technologie 
> de l'information
> 1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
> Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
> (613) 993-8557 
> (613) 952-0215 Fax / télécopieur
> bob.orchard@xxxxxxxxxxxxxx 
> Government of Canada | Gouvernement du Canada
> 
> 
> 
> > -----Original Message-----
> > From: condor-users-bounces@xxxxxxxxxxx
> > [mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of 
> Gabriel Mateescu
> > Sent: Wednesday, December 07, 2005 9:58 PM
> > To: Condor-Users Mail List
> > Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank
> > 
> > 
> > 
> > Hi,
> > 
> > The changes in the IP addresses need to be reflected
> > in the HOST_ALLOW_* entries in the condor_config file
> > on the central manager. The central manager runs the
> > negotiator and collector daemons, and the collector
> > will only accept requests from machines listed
> > in the HOST_ALLOW_* list.
> > 
> > Additionally, on the submission machine, the job log
> > file and the sched daemon log file may be helpful.
> > 
> > Gabriel
> > 
> > 
> > 
> > > There have been a number of changes in the ip addresses in 
> > the past few
> > > weeks.
> > > These changes were made and the latest version of condor installed
> > > (6.6.10). Then they did accept at least one job before 
> entering the
> > > unclaimed/idle state. I will try to access the log files on 
> > the server
> > > and try to trace activity for one of these machines. It 
> > certainly could
> > > be related to that (in fact we are suspicious of this 
> network change
> > > but are not sure how to trace it or fix it ... one option is to
> > > stop all machines including the master and restart everything).
> > >
> > > Bob Orchard
> > > National Research Council Canada      Conseil national de 
> recherches
> > > Canada
> > > Institute for Information Technology  Institut de technologie de
> > > l'information
> > > 1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
> > > Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) 
> > Canada K1A 0R6
> > > (613) 993-8557
> > > (613) 952-0215 Fax / télécopieur
> > > bob.orchard@xxxxxxxxxxxxxx
> > > Government of Canada | Gouvernement du Canada
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: condor-users-bounces@xxxxxxxxxxx
> > > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of 
> > Gabriel Mateescu
> > > Sent: Wednesday, December 07, 2005 8:19 PM
> > > To: Condor-Users Mail List
> > > Cc: Condor-Users Mail List
> > > Subject: Re: [Condor-users] jobs won't run: MY.Rank > 
> MY.CurrentRank
> > >
> > >
> > >> We have a similar problem (not as many machines) but many 
> > seem to get
> > >> stuck in the unclaimed/idle state and will not run jobs. 
> An analyze
> > >> shows the 'reject the job for unknown reasons' for these 
> machines.
> > >> They ran jobs yesterday for a while but no longer will.
> > >>
> > >> Bob Orchard
> > >>
> > >
> > > Hi,
> > >
> > > Did something in the environment change, such
> > > as IP addresses or host names?
> > >
> > > When "analyze" does not give helpful information,
> > > there are additional places to check:
> > >
> > >   1. the job log file;
> > >   2. the sched daemon log file
> > >   3. the negotiator daemon log file.
> > >
> > > Gabriel
> > >
> > > _______________________________________________
> > > Condor-users mailing list
> > > Condor-users@xxxxxxxxxxx
> > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > >
> > > _______________________________________________
> > > Condor-users mailing list
> > > Condor-users@xxxxxxxxxxx
> > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > >
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
12/14 07:00:31 DaemonCore: Command received via UDP from host <10.10.6.33:33730>
12/14 07:00:31 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:00:31 match_info called
12/14 07:00:31 Received match <10.10.23.122:1622>#7916121240
12/14 07:00:31 State change: match notification protocol successful
12/14 07:00:31 Changing state: Unclaimed -> Matched
12/14 07:02:31 State change: match timed out
12/14 07:02:31 Changing state: Matched -> Owner
12/14 07:02:31 State change: IS_OWNER is false
12/14 07:02:31 Changing state: Owner -> Unclaimed
12/14 07:05:53 DaemonCore: Command received via TCP from host <10.10.23.135:1551>
12/14 07:05:53 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:05:53 Error: can't find resource with capability (<10.10.23.122:1622>#7916121240)
12/14 07:05:59 DaemonCore: Command received via UDP from host <10.10.23.135:1599>
12/14 07:05:59 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:05:59 Error: can't find resource with capability (<10.10.23.122:1622>#7916121240)
12/14 07:16:23 DaemonCore: Command received via UDP from host <10.10.6.33:33733>
12/14 07:16:23 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:16:23 match_info called
12/14 07:16:23 Received match <10.10.23.122:1622>#2674321328
12/14 07:16:23 State change: match notification protocol successful
12/14 07:16:23 Changing state: Unclaimed -> Matched
12/14 07:18:23 State change: match timed out
12/14 07:18:23 Changing state: Matched -> Owner
12/14 07:18:23 State change: IS_OWNER is false
12/14 07:18:23 Changing state: Owner -> Unclaimed
12/14 07:21:45 DaemonCore: Command received via TCP from host <10.10.23.135:1637>
12/14 07:21:45 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:21:45 Error: can't find resource with capability (<10.10.23.122:1622>#2674321328)
12/14 07:21:48 DaemonCore: Command received via UDP from host <10.10.23.135:1687>
12/14 07:21:48 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:21:48 Error: can't find resource with capability (<10.10.23.122:1622>#2674321328)
12/14 07:32:23 DaemonCore: Command received via UDP from host <10.10.6.33:33734>
12/14 07:32:23 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:32:23 match_info called
12/14 07:32:23 Received match <10.10.23.122:1622>#2231421672
12/14 07:32:23 State change: match notification protocol successful
12/14 07:32:23 Changing state: Unclaimed -> Matched
12/14 07:34:23 State change: match timed out
12/14 07:34:23 Changing state: Matched -> Owner
12/14 07:34:23 State change: IS_OWNER is false
12/14 07:34:23 Changing state: Owner -> Unclaimed
12/14 07:37:58 DaemonCore: Command received via TCP from host <10.10.23.135:1717>
12/14 07:37:58 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:37:58 Error: can't find resource with capability (<10.10.23.122:1622>#2231421672)
12/14 07:38:01 DaemonCore: Command received via UDP from host <10.10.23.135:1767>
12/14 07:38:01 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:38:01 Error: can't find resource with capability (<10.10.23.122:1622>#2231421672)