Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs won't run: workstations get timeouts on matching

Date: Wed, 14 Dec 2005 09:45:31 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] jobs won't run: workstations get timeouts on matching

Bob,

This is an indication of schedd problems, possibly just a busy scheddthat doesn't get around to claiming machines that it got matched to.There are various things you can tune to improve schedd performance, butyou can also simply increase the value of MATCH_TIMEOUT in your condorconfiguration. I would suggest starting with that, since it is a simplething to do.


--Dan

Orchard, Bob wrote:

When we restart the Condor Central Manager and submit a  large number of jobs
all of the workstation nodes get allocated jobs. But slowly over the course
of 10 to 15 hours most the workstations stop being assigned jobs. Below
is some output from the StartLog of one of the workstation nodes showing that
jobs are matched to the node but then the match 'times out'. Is there
a simple explanation for this? Could it be a network problem or is the
central manager too slow to assign a job to the node?

12/14 07:00:31 DaemonCore: Command received via UDP from host <10.10.6.33:33730>
12/14 07:00:31 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:00:31 match_info called
12/14 07:00:31 Received match <10.10.23.122:1622>#7916121240
12/14 07:00:31 State change: match notification protocol successful
12/14 07:00:31 Changing state: Unclaimed -> Matched
12/14 07:02:31 State change: match timed out
12/14 07:02:31 Changing state: Matched -> Owner
12/14 07:02:31 State change: IS_OWNER is false
12/14 07:02:31 Changing state: Owner -> Unclaimed
12/14 07:05:53 DaemonCore: Command received via TCP from host <10.10.23.135:1551>
12/14 07:05:53 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:05:53 Error: can't find resource with capability (<10.10.23.122:1622>#7916121240)
12/14 07:05:59 DaemonCore: Command received via UDP from host <10.10.23.135:1599>
12/14 07:05:59 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:05:59 Error: can't find resource with capability (<10.10.23.122:1622>#7916121240)
12/14 07:16:23 DaemonCore: Command received via UDP from host <10.10.6.33:33733>
12/14 07:16:23 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:16:23 match_info called
12/14 07:16:23 Received match <10.10.23.122:1622>#2674321328
12/14 07:16:23 State change: match notification protocol successful
12/14 07:16:23 Changing state: Unclaimed -> Matched
12/14 07:18:23 State change: match timed out
12/14 07:18:23 Changing state: Matched -> Owner
12/14 07:18:23 State change: IS_OWNER is false
12/14 07:18:23 Changing state: Owner -> Unclaimed
12/14 07:21:45 DaemonCore: Command received via TCP from host <10.10.23.135:1637>
12/14 07:21:45 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:21:45 Error: can't find resource with capability (<10.10.23.122:1622>#2674321328)
12/14 07:21:48 DaemonCore: Command received via UDP from host <10.10.23.135:1687>
12/14 07:21:48 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:21:48 Error: can't find resource with capability (<10.10.23.122:1622>#2674321328)
12/14 07:32:23 DaemonCore: Command received via UDP from host <10.10.6.33:33734>
12/14 07:32:23 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
12/14 07:32:23 match_info called
12/14 07:32:23 Received match <10.10.23.122:1622>#2231421672
12/14 07:32:23 State change: match notification protocol successful
12/14 07:32:23 Changing state: Unclaimed -> Matched
12/14 07:34:23 State change: match timed out
12/14 07:34:23 Changing state: Matched -> Owner
12/14 07:34:23 State change: IS_OWNER is false
12/14 07:34:23 Changing state: Owner -> Unclaimed
12/14 07:37:58 DaemonCore: Command received via TCP from host <10.10.23.135:1717>
12/14 07:37:58 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:37:58 Error: can't find resource with capability (<10.10.23.122:1622>#2231421672)
12/14 07:38:01 DaemonCore: Command received via UDP from host <10.10.23.135:1767>
12/14 07:38:01 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
12/14 07:38:01 Error: can't find resource with capability (<10.10.23.122:1622>#2231421672)


Thanks, Bob.

Bob Orchard
National Research Council Canada      Conseil national de recherches Canada
Institute for Information Technology  Institut de technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6

(613) 993-8557(613) 952-0215 Fax / télécopieurbob.orchard@xxxxxxxxxxxxxxGovernment of Canada | Gouvernement du Canada

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Gabriel Mateescu
Sent: Wednesday, December 07, 2005 9:58 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank

Hi,

The changes in the IP addresses need to be reflected
in the HOST_ALLOW_* entries in the condor_config file
on the central manager. The central manager runs the
negotiator and collector daemons, and the collector
will only accept requests from machines listed
in the HOST_ALLOW_* list.

Additionally, on the submission machine, the job log
file and the sched daemon log file may be helpful.

Gabriel

There have been a number of changes in the ip addresses in

the past few

weeks.
These changes were made and the latest version of condor installed
(6.6.10). Then they did accept at least one job before entering the

unclaimed/idle state. I will try to access the log files on

the server

and try to trace activity for one of these machines. It

certainly could

be related to that (in fact we are suspicious of this network change
but are not sure how to trace it or fix it ... one option is to
stop all machines including the master and restart everything).

Bob Orchard
National Research Council Canada      Conseil national de recherches
Canada
Institute for Information Technology  Institut de technologie de
l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal

Ottawa, ON, Canada K1A 0R6 Ottawa (Ontario)

Canada K1A 0R6

(613) 993-8557
(613) 952-0215 Fax / télécopieur
bob.orchard@xxxxxxxxxxxxxx
Government of Canada | Gouvernement du Canada



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx

[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of

Gabriel Mateescu

Sent: Wednesday, December 07, 2005 8:19 PM
To: Condor-Users Mail List
Cc: Condor-Users Mail List
Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank

We have a similar problem (not as many machines) but many

seem to get

stuck in the unclaimed/idle state and will not run jobs. An analyze
shows the 'reject the job for unknown reasons' for these machines.
They ran jobs yesterday for a while but no longer will.

Bob Orchard

Hi,

Did something in the environment change, such
as IP addresses or host names?

When "analyze" does not give helpful information,
there are additional places to check:

 1. the job log file;
 2. the sched daemon log file
 3. the negotiator daemon log file.

Gabriel

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

References:
- [Condor-users] jobs won't run: workstations get timeouts on matching
  - From: Orchard, Bob

Prev by Date: Re: [Condor-users] multiple executables with
Next by Date: Re: [Condor-users] jobs won't run: workstations get timeouts on matching
Previous by thread: [Condor-users] jobs won't run: workstations get timeouts on matching
Next by thread: Re: [Condor-users] jobs won't run: workstations get timeouts on matching
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] jobs won't run: workstations get timeouts on matching