[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Need help for job disconnection and reconnection failure! Argent...



I submit jobs to my cluster but no job can run because they all disconnected. Here is my condor version(I am using Rocks to manage my cluster):
[kyle@imagegrid ~]$ condor_version
$CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
$CondorPlatform: x86_64_rhap_6.3 $
[kyle@imagegrid ~]$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot10@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:05
slot11@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:06
slot12@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:07
slot13@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:08
slot14@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:09
slot15@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:10
slot16@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:03
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:00:04
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:00:05
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:00:06
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:00:06
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.020   499  0+00:25:08
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:09
slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:10
slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:03
slot9@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:25:04
slot10@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:06
slot11@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:07
slot12@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:08
slot13@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:09
slot14@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:10
slot15@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:11
slot16@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:04
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:14:41
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:06
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:07
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:08
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:09
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:10
slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:11
slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:04
slot9@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499  0+00:15:05
                     Total Owner Claimed Unclaimed Matched Preempting Backfill
        X86_64/LINUX    32     0       0        32       0          0        0
               Total    32     0       0        32       0          0        0
[kyle@imagegrid ~]$ condor_q
-- Submitter: imagegrid.otitan.com : <192.168.1.100:40073> : imagegrid.otitan.com
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
   2.0   kyle            5/14 23:24   0+00:00:00 I  0   0.0  showpwd.sh       
   2.1   kyle            5/14 23:24   0+00:00:08 I  0   0.0  showpwd.sh       
   2.2   kyle            5/14 23:24   0+00:00:17 I  0   0.0  showpwd.sh       
   2.3   kyle            5/14 23:24   0+00:00:01 I  0   0.0  showpwd.sh       
4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended
 
The log content of my job is:
[kyle@imagegrid ~]$ cat showpwd.log
000 (002.000.000) 05/14 23:24:57 Job submitted from host: <192.168.1.100:40073>
...
000 (002.001.000) 05/14 23:24:57 Job submitted from host: <192.168.1.100:40073>
...
000 (002.002.000) 05/14 23:24:57 Job submitted from host: <192.168.1.100:40073>
...
000 (002.003.000) 05/14 23:24:57 Job submitted from host: <192.168.1.100:40073>
...
022 (002.000.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.000.000) 05/14 23:24:57 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.001.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.001.000) 05/14 23:24:57 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.002.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot3@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
022 (002.003.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot4@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.003.000) 05/14 23:24:58 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.002.000) 05/14 23:25:06 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.000.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.000.000) 05/14 23:26:58 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.001.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot2@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
022 (002.002.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot3@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
022 (002.003.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot4@xxxxxxxxxxxxxxx <10.255.255.254:45256>
...
024 (002.003.000) 05/14 23:26:58 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.001.000) 05/14 23:27:06 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.002.000) 05/14 23:27:06 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
...
 
I can see that after submission, some slots became claimed, but after few seconds, they became Unclaimed again.
Here is my local configure(generated by Rocks):
 
ALLOW_WRITE = $(HOSTALLOW_WRITE)
AMAZON_GAHP = $(SBIN)/amazon_gahp
AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
COLLECTOR_NAME = Collector at imagegrid.otitan.com
COLLECTOR_SOCKET_CACHE_SIZE = 1000
CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxxxxx
CONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST = imagegrid.otitan.com
CONDOR_IDS = 407.500
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
CONTINUE = True
DAEMON_LIST = MASTER, STARTD
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN = otitan.com
HIGHPORT = 50000
HOSTALLOW_WRITE = imagegrid.otitan.com, *.local, *.local
JAVA = /usr/bin/java
KILL = False
LOCAL_DIR = /var/opt/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
LOWPORT = 40000
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
NETWORK_INTERFACE = 10.255.255.254
PREEMPT = False
RANK = None
RELEASE_DIR = /opt/condor
SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
START = True
STARTD_EXPRS = $(STARTD_EXPRS)
SUSPEND = False
UID_DOMAIN = local
UPDATE_COLLECTOR_WITH_TCP = True
WANT_SUSPEND = False
WANT_VACATE = False
# First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM
JAVA_MAXHEAP_ARGUMENT =
JAVA_EXTRA_ARGUMENTS = -Xmx1906m
Can some one help me? Thanks!