[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs won't run: workstations get timeouts on matching



Bob,

Yes, you want to change MATCH_TIMEOUT on the startd (worker) nodes. This will allow the startd to wait longer when it is expecting to be contacted by the schedd after a match has been made. So we are going on the theory that it was the schedd (submitter) that was bogged down, not the startd, but it is the startd that enforces the timeout, so that is why you need to make the configuration change to the startd.

--Dan

Orchard, Bob wrote:

We can try that. Thanks ... but is this to be done on all
workstation nodes, on the submitter's workstation or on the central manger
machine? Sounds like you mean for each workstation node to change this
value so it won't timeout so soon. But it does seem odd that every machine
was too busy and timed out since they actually weren't busy at all
with other things (middle of night and the previous condor job
had completed). So one might assume that it was the central manager
that was too slow to do something (or the submitter's machine). Not
clear to me since I don't know the intimate details of how condor
works.

Bob Orchard
National Research Council Canada      Conseil national de recherches Canada
Institute for Information Technology  Institut de technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
(613) 993-8557 (613) 952-0215 Fax / télécopieur bob.orchard@xxxxxxxxxxxxxx Government of Canada | Gouvernement du Canada



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of Dan Bradley
Sent: Wednesday, December 14, 2005 10:46 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] jobs won't run: workstations get timeouts
onmatching


Bob,

This is an indication of schedd problems, possibly just a busy schedd that doesn't get around to claiming machines that it got matched to. There are various things you can tune to improve schedd performance, but you can also simply increase the value of MATCH_TIMEOUT in your condor configuration. I would suggest starting with that, since it is a simple thing to do.

--Dan

Orchard, Bob wrote:

When we restart the Condor Central Manager and submit a
large number of jobs
all of the workstation nodes get allocated jobs. But slowly
over the course
of 10 to 15 hours most the workstations stop being assigned
jobs. Below
is some output from the StartLog of one of the workstation
nodes showing that
jobs are matched to the node but then the match 'times out'. Is there
a simple explanation for this? Could it be a network problem
or is the
central manager too slow to assign a job to the node?

12/14 07:00:31 DaemonCore: Command received via UDP from
host <10.10.6.33:33730>
12/14 07:00:31 DaemonCore: received command 440
(MATCH_INFO), calling handler (command_match_info)
12/14 07:00:31 match_info called
12/14 07:00:31 Received match <10.10.23.122:1622>#7916121240
12/14 07:00:31 State change: match notification protocol successful
12/14 07:00:31 Changing state: Unclaimed -> Matched
12/14 07:02:31 State change: match timed out
12/14 07:02:31 Changing state: Matched -> Owner
12/14 07:02:31 State change: IS_OWNER is false
12/14 07:02:31 Changing state: Owner -> Unclaimed
12/14 07:05:53 DaemonCore: Command received via TCP from
host <10.10.23.135:1551>
12/14 07:05:53 DaemonCore: received command 442
(REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:05:53 Error: can't find resource with capability
(<10.10.23.122:1622>#7916121240)
12/14 07:05:59 DaemonCore: Command received via UDP from
host <10.10.23.135:1599>
12/14 07:05:59 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_handler)
12/14 07:05:59 Error: can't find resource with capability
(<10.10.23.122:1622>#7916121240)
12/14 07:16:23 DaemonCore: Command received via UDP from
host <10.10.6.33:33733>
12/14 07:16:23 DaemonCore: received command 440
(MATCH_INFO), calling handler (command_match_info)
12/14 07:16:23 match_info called
12/14 07:16:23 Received match <10.10.23.122:1622>#2674321328
12/14 07:16:23 State change: match notification protocol successful
12/14 07:16:23 Changing state: Unclaimed -> Matched
12/14 07:18:23 State change: match timed out
12/14 07:18:23 Changing state: Matched -> Owner
12/14 07:18:23 State change: IS_OWNER is false
12/14 07:18:23 Changing state: Owner -> Unclaimed
12/14 07:21:45 DaemonCore: Command received via TCP from
host <10.10.23.135:1637>
12/14 07:21:45 DaemonCore: received command 442
(REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:21:45 Error: can't find resource with capability
(<10.10.23.122:1622>#2674321328)
12/14 07:21:48 DaemonCore: Command received via UDP from
host <10.10.23.135:1687>
12/14 07:21:48 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_handler)
12/14 07:21:48 Error: can't find resource with capability
(<10.10.23.122:1622>#2674321328)
12/14 07:32:23 DaemonCore: Command received via UDP from
host <10.10.6.33:33734>
12/14 07:32:23 DaemonCore: received command 440
(MATCH_INFO), calling handler (command_match_info)
12/14 07:32:23 match_info called
12/14 07:32:23 Received match <10.10.23.122:1622>#2231421672
12/14 07:32:23 State change: match notification protocol successful
12/14 07:32:23 Changing state: Unclaimed -> Matched
12/14 07:34:23 State change: match timed out
12/14 07:34:23 Changing state: Matched -> Owner
12/14 07:34:23 State change: IS_OWNER is false
12/14 07:34:23 Changing state: Owner -> Unclaimed
12/14 07:37:58 DaemonCore: Command received via TCP from
host <10.10.23.135:1717>
12/14 07:37:58 DaemonCore: received command 442
(REQUEST_CLAIM), calling handler (command_request_claim)
12/14 07:37:58 Error: can't find resource with capability
(<10.10.23.122:1622>#2231421672)
12/14 07:38:01 DaemonCore: Command received via UDP from
host <10.10.23.135:1767>
12/14 07:38:01 DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_handler)
12/14 07:38:01 Error: can't find resource with capability
(<10.10.23.122:1622>#2231421672)
Thanks, Bob.

Bob Orchard
National Research Council Canada Conseil national de
recherches Canada
Institute for Information Technology Institut de
technologie de l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6            Ottawa (Ontario) Canada K1A 0R6
(613) 993-8557 (613) 952-0215 Fax / télécopieur bob.orchard@xxxxxxxxxxxxxx Government of Canada | Gouvernement du Canada





-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx]On Behalf Of
Gabriel Mateescu
Sent: Wednesday, December 07, 2005 9:58 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] jobs won't run: MY.Rank > MY.CurrentRank



Hi,

The changes in the IP addresses need to be reflected
in the HOST_ALLOW_* entries in the condor_config file
on the central manager. The central manager runs the
negotiator and collector daemons, and the collector
will only accept requests from machines listed
in the HOST_ALLOW_* list.

Additionally, on the submission machine, the job log
file and the sched daemon log file may be helpful.

Gabriel



There have been a number of changes in the ip addresses in
the past few
weeks.
These changes were made and the latest version of condor installed
(6.6.10). Then they did accept at least one job before entering the
unclaimed/idle state. I will try to access the log files on
the server
and try to trace activity for one of these machines. It
certainly could
be related to that (in fact we are suspicious of this
network change
but are not sure how to trace it or fix it ... one option is to
stop all machines including the master and restart everything).

Bob Orchard
National Research Council Canada Conseil national de
recherches
Canada
Institute for Information Technology  Institut de technologie de
l'information
1200 Montreal Road, Building M-50     M50, 1200 chemin Montréal
Ottawa, ON, Canada K1A 0R6 Ottawa (Ontario)
Canada K1A 0R6
(613) 993-8557
(613) 952-0215 Fax / télécopieur
bob.orchard@xxxxxxxxxxxxxx
Government of Canada | Gouvernement du Canada



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
Gabriel Mateescu
Sent: Wednesday, December 07, 2005 8:19 PM
To: Condor-Users Mail List
Cc: Condor-Users Mail List
Subject: Re: [Condor-users] jobs won't run: MY.Rank >
MY.CurrentRank
We have a similar problem (not as many machines) but many
seem to get
stuck in the unclaimed/idle state and will not run jobs.
An analyze
shows the 'reject the job for unknown reasons' for these machines.
They ran jobs yesterday for a while but no longer will.

Bob Orchard

Hi,

Did something in the environment change, such
as IP addresses or host names?

When "analyze" does not give helpful information,
there are additional places to check:

1. the job log file;
2. the sched daemon log file
3. the negotiator daemon log file.

Gabriel

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users