Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job run time limit ?

Date: Fri, 1 Oct 2004 15:22:53 +0200
From: Jérôme Jaglale <Jerome.jaglale@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Job run time limit ?

Hello,

we still have this problem : Condor jobs stop after running a few days.

In the Shadow log :
9/12 09:12:10 (44.7) (10025): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.23)" at line 63 in file NTreceivers.C
9/12 09:12:57 (44.4) (10013): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.22)" at line 63 in file NTreceivers.C
9/12 09:13:04 (44.6) (10023): ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.23)" at line 63 in file NTreceivers.C

In the StartLog of the execution machine : it received a "RELEASE_CLAIM" command (what is it ?) from the central-manager, and after lost connection with it.
Any idea about that error ? It prevents us lauching real big simulations.

9/12 09:10:40 DaemonCore: Command received via UDP from host <172.18.45.80:64684>
9/12 09:10:40 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
9/12 09:10:40 vm1: State change: received RELEASE_CLAIM command
9/12 09:10:40 vm1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
9/12 09:10:40 Can't connect to <192.168.1.15:49252>:0, errno = 61
9/12 09:10:40 Will keep trying for 10 seconds...
9/12 09:10:50 Connect failed for 10 seconds; returning FALSE
9/12 09:10:50 ERROR:
SECMAN:2003:TCP connection to <192.168.1.15:49252> failed

9/12 09:10:50 Send_Signal: ERROR Connect to <192.168.1.15:49252> failed.9/12 09:10:50 vm1: Error sending signal to starter, errno = 22 (Unknown error: 0)
9/12 09:10:52 vm1: State change: Error sending signals to starter
9/12 09:10:52 vm1: Changing state and activity: Preempting/Vacating -> Owner/Idle
9/12 09:10:52 vm1: State change: IS_OWNER is false
9/12 09:10:52 vm1: Changing state: Owner -> Unclaimed
9/12 09:10:53 State change: RunBenchmarks is TRUE
9/12 09:10:53 vm1: Changing activity: Idle -> Benchmarking
9/12 09:10:57 State change: benchmarks completed
9/12 09:10:57 vm1: Changing activity: Benchmarking -> Idle
9/12 09:10:57 DaemonCore: Command received via UDP from host <172.18.45.80:64685>
9/12 09:10:57 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
9/12 09:10:57 Error: can't find resource with capability (<192.168.1.15:49234>#9327191778)
9/12 09:10:57 Starter pid 454 died on signal 10 (signal 10)
9/12 09:14:33 Starter pid 490 died on signal 10 (signal 10)
9/12 09:14:35 vm2: State change: starter exited
9/12 09:14:36 vm2: Changing activity: Busy -> Idle
9/12 09:14:36 DaemonCore: Command received via UDP from host <172.18.45.80:64725>
9/12 09:14:36 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
9/12 09:14:36 vm2: State change: received RELEASE_CLAIM command
9/12 09:14:36 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
9/12 09:14:36 vm2: State change: No preempting claim, returning to owner
9/12 09:14:36 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle
9/12 09:14:36 vm2: State change: IS_OWNER is false
9/12 09:14:36 vm2: Changing state: Owner -> Unclaimed
9/12 09:14:36 DaemonCore: Command received via UDP from host <172.18.45.80:64726>
9/12 09:14:36 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
9/12 09:14:36 Error: can't find resource with capability (<192.168.1.15:

Our execution machine configuration file :

HOSTALLOW_READ = 192.168.1.*, 172.18.45.*, 10.0.1.*
HOSTALLOW_WRITE = 192.168.1.*, 172.18.45.*, 10.0.1.*
NETWORK_INTERFACE = $(ADRESSE_CARTE)
MAIL = /usr/bin/mail
NEGOTIATOR = $(SBIN)/condor_negotiator
START = TRUE
LOCAL_DIR = /Users/cluster/projets/condor-6.6.5/local.E6-G5
VACATE =
UID_DOMAIN = $(ADRESSE_CARTE)
RELEASE_DIR = /Users/cluster/projets/condor-6.6.5
JAVA_MAXHEAP_ARGUMENT =
CONDOR_IDS = 503.503
DAEMON_LIST = MASTER,STARTD
COLLECTOR_NAME = cluster-collector
COLLECTOR = $(SBIN)/condor_collector
PREEMPT = false
LOCK = /tmp/condor-lock.E6-G50.472081086455972
StartIdleTime = 0
FILESYSTEM_DOMAIN = $(ADRESSE_CARTE)
CONDOR_ADMIN = root@$(ADRESSE_CARTE)
JAVA = /usr/bin/java
SUSPEND = false
CONDOR_HOST = 172.18.45.80
MEMORY = 2048
NUM_VIRTUAL_MACHINES = 2

Our central manager configuration file :

NEGOTIATOR = $(SBIN)/condor_negotiator
RELEASE_DIR = /Users/condor/Programmes/condor-6.6.6
COLLECTOR_NAME = cluster-collector
SUSPEND = False
LOCAL_DIR = /Users/condor/Programmes/condor-6.6.6/local.$(HOSTNAME)
CONDOR_HOST = 172.18.45.80
JAVA_MAXHEAP_ARGUMENT =
CONDOR_IDS = 504.504
START = True
COLLECTOR = $(SBIN)/condor_collector
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD
UID_DOMAIN = 172.18.45.80
NETWORK_INTERFACE = 172.18.45.80
JAVA = /usr/bin/java
PREEMPT = False
FILESYSTEM_DOMAIN = 172.18.45.80
CONDOR_ADMIN = root@xxxxxxxxxxxx
MAIL = /usr/bin/mail
LOCK = /tmp/condor-lock.$(HOSTNAME)0.267434544951783
DEFAULT_UNIVERSE = vanilla
HOSTALLOW_READ = 192.168.1.*, 172.18.45.*
HOSTALLOW_WRITE = 192.168.1.*, 172.18.45.*

Thanks for your help,
Jérôme Jaglale

Prev by Date: Re: [Condor-users] Multiple masters on one machine
Next by Date: [Condor-users] Maximum number of retries per job?
Previous by thread: Re: [Condor-users] Multiple masters on one machine
Next by thread: [Condor-users] Maximum number of retries per job?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Job run time limit ?