[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs won't run: workstations get timeouts on matching



Bob,

This is an indication of schedd problems, possibly just a busy schedd that doesn't get around to claiming machines that it got matched to. There are various things you can tune to improve schedd performance, but you can also simply increase the value of MATCH_TIMEOUT in your condor configuration. I would suggest starting with that, since it is a simple thing to do.

--Dan

Orchard, Bob wrote:

When we restart the Condor Central Manager and submit a  large number of jobs
all of the workstation nodes get allocated jobs. But slowly over the course
of 10 to 15 hours most the workstations stop being assigned jobs. Below
is some output from the StartLog of one of the workstation nodes showing that
jobs are matched to the node but then the match 'times out'. Is there
a simple explanation for this? Could it be a network problem or is the
central manager too slow to assign a job to the node?

12/14 07:00:31 DaemonCore: Command received via UDP from host <10.10.6.33:33730>
12/14 07:00:31 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:00:31 match_info called
12/14 07:00:31 Received match <10.10.23.122:1622>#7916121240
12/14 07:00:31 State change: match notification protocol successful
12/14 07:00:31 Changing state: Unclaimed -> Matched
12/14 07:02:31 State change: match timed out
12/14 07:02:31 Changing state: Matched -> Owner
12/14 07:02:31 State change: IS_OWNER is false
12/14 07:02:31 Changing state: Owner -> Unclaimed
12/14 07:05:53 DaemonCore: Command received via TCP from host <10.10.23.135:1551>
12/14 07:05:53 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:05:53 Error: can't find resource with capability (<10.10.23.122:1622>#7916121240)
12/14 07:05:59 DaemonCore: Command received via UDP from host <10.10.23.135:1599>
12/14 07:05:59 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:05:59 Error: can't find resource with capability (<10.10.23.122:1622>#7916121240)
12/14 07:16:23 DaemonCore: Command received via UDP from host <10.10.6.33:33733>
12/14 07:16:23 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:16:23 match_info called
12/14 07:16:23 Received match <10.10.23.122:1622>#2674321328
12/14 07:16:23 State change: match notification protocol successful
12/14 07:16:23 Changing state: Unclaimed -> Matched
12/14 07:18:23 State change: match timed out
12/14 07:18:23 Changing state: Matched -> Owner
12/14 07:18:23 State change: IS_OWNER is false
12/14 07:18:23 Changing state: Owner -> Unclaimed
12/14 07:21:45 DaemonCore: Command received via TCP from host <10.10.23.135:1637>
12/14 07:21:45 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:21:45 Error: can't find resource with capability (<10.10.23.122:1622>#2674321328)
12/14 07:21:48 DaemonCore: Command received via UDP from host <10.10.23.135:1687>
12/14 07:21:48 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:21:48 Error: can't find resource with capability (<10.10.23.122:1622>#2674321328)
12/14 07:32:23 DaemonCore: Command received via UDP from host <10.10.6.33:33734>
12/14 07:32:23 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:32:23 match_info called
12/14 07:32:23 Received match <10.10.23.122:1622>#2231421672
12/14 07:32:23 State change: match notification protocol successful
12/14 07:32:23 Changing state: Unclaimed -> Matched
12/14 07:34:23 State change: match timed out
12/14 07:34:23 Changing state: Matched -> Owner
12/14 07:34:23 State change: IS_OWNER is false
12/14 07:34:23 Changing state: Owner -> Unclaimed
12/14 07:37:58 DaemonCore: Command received via TCP from host <10.10.23.135:1717>
12/14 07:37:58 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:37:58 Error: can't find resource with capability (<10.10.23.122:1622>#2231421672)
12/14 07:38:01 DaemonCore: Command received via UDP from host <10.10.23.135:1767>
12/14 07:38:01 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:38:01 Error: can't find resource with capability (<10.10.23.122:1622>#2231421672)


Thanks, Bob.

Bob Orchard
National Research Council Canada      Conseil national de recherches Canada
Institute for Information Technology  Institut de technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
(613) 993-8557 (613) 952-0215 Fax / télécopieur bob.orchard@xxxxxxxxxxxxxx Government of Canada | Gouvernement du Canada



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Gabriel Mateescu
Sent: Wednesday, December 07, 2005 9:58 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank



Hi,

The changes in the IP addresses need to be reflected
in the HOST_ALLOW_* entries in the condor_config file
on the central manager. The central manager runs the
negotiator and collector daemons, and the collector
will only accept requests from machines listed
in the HOST_ALLOW_* list.

Additionally, on the submission machine, the job log
file and the sched daemon log file may be helpful.

Gabriel



There have been a number of changes in the ip addresses in
the past few
weeks.
These changes were made and the latest version of condor installed
(6.6.10). Then they did accept at least one job before entering the
unclaimed/idle state. I will try to access the log files on
the server
and try to trace activity for one of these machines. It
certainly could
be related to that (in fact we are suspicious of this network change
but are not sure how to trace it or fix it ... one option is to
stop all machines including the master and restart everything).

Bob Orchard
National Research Council Canada      Conseil national de recherches
Canada
Institute for Information Technology  Institut de technologie de
l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6 Ottawa (Ontario)
Canada K1A 0R6
(613) 993-8557
(613) 952-0215 Fax / télécopieur
bob.orchard@xxxxxxxxxxxxxx
Government of Canada | Gouvernement du Canada



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
Gabriel Mateescu
Sent: Wednesday, December 07, 2005 8:19 PM
To: Condor-Users Mail List
Cc: Condor-Users Mail List
Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank


We have a similar problem (not as many machines) but many
seem to get
stuck in the unclaimed/idle state and will not run jobs. An analyze
shows the 'reject the job for unknown reasons' for these machines.
They ran jobs yesterday for a while but no longer will.

Bob Orchard

Hi,

Did something in the environment change, such
as IP addresses or host names?

When "analyze" does not give helpful information,
there are additional places to check:

 1. the job log file;
 2. the sched daemon log file
 3. the negotiator daemon log file.

Gabriel

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users