[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Need help for job disconnection and reconnection failure! Argent...



On Tue, May 14, 2013 at 3:56 AM, 钱晓明 <kyleqian@xxxxxxxxx> wrote:
> I submit jobs to my cluster but no job can run because they all
> disconnected. Here is my condor version(I am using Rocks to manage my
> cluster):
> [kyle@imagegrid ~]$ condor_version
> $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
> $CondorPlatform: x86_64_rhap_6.3 $
> [kyle@imagegrid ~]$ condor_status
> Name               OpSys      Arch   State     Activity LoadAv Mem
> ActvtyTime
> slot10@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:05
> slot11@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:06
> slot12@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:07
> slot13@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:08
> slot14@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:09
> slot15@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:10
> slot16@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:03
> slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:00:04
> slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:00:05
> slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:00:06
> slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:00:06
> slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.020   499
> 0+00:25:08
> slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:09
> slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:10
> slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:03
> slot9@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:25:04
> slot10@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:06
> slot11@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:07
> slot12@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:08
> slot13@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:09
> slot14@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:10
> slot15@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:11
> slot16@xxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:04
> slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:14:41
> slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:06
> slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:07
> slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:08
> slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:09
> slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:10
> slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:11
> slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:04
> slot9@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   499
> 0+00:15:05
>                      Total Owner Claimed Unclaimed Matched Preempting
> Backfill
>         X86_64/LINUX    32     0       0        32       0          0
> 0
>                Total    32     0       0        32       0          0
> 0
> [kyle@imagegrid ~]$ condor_q
> -- Submitter: imagegrid.otitan.com : <192.168.1.100:40073> :
> imagegrid.otitan.com
>  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
>    2.0   kyle            5/14 23:24   0+00:00:00 I  0   0.0  showpwd.sh
>    2.1   kyle            5/14 23:24   0+00:00:08 I  0   0.0  showpwd.sh
>    2.2   kyle            5/14 23:24   0+00:00:17 I  0   0.0  showpwd.sh
>    2.3   kyle            5/14 23:24   0+00:00:01 I  0   0.0  showpwd.sh
> 4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended
>
> The log content of my job is:
> [kyle@imagegrid ~]$ cat showpwd.log
> 000 (002.000.000) 05/14 23:24:57 Job submitted from host:
> <192.168.1.100:40073>
> ...
> 000 (002.001.000) 05/14 23:24:57 Job submitted from host:
> <192.168.1.100:40073>
> ...
> 000 (002.002.000) 05/14 23:24:57 Job submitted from host:
> <192.168.1.100:40073>
> ...
> 000 (002.003.000) 05/14 23:24:57 Job submitted from host:
> <192.168.1.100:40073>
> ...
> 022 (002.000.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.000.000) 05/14 23:24:57 Job reconnection failed
>     Job not found at execution machine
>     Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 022 (002.001.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot2@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.001.000) 05/14 23:24:57 Job reconnection failed
>     Job not found at execution machine
>     Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 022 (002.002.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot3@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 022 (002.003.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot4@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.003.000) 05/14 23:24:58 Job reconnection failed
>     Job not found at execution machine
>     Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 024 (002.002.000) 05/14 23:25:06 Job reconnection failed
>     Job not found at execution machine
>     Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 022 (002.000.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot1@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.000.000) 05/14 23:26:58 Job reconnection failed
>     Job not found at execution machine
>     Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 022 (002.001.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot2@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 022 (002.002.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot3@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 022 (002.003.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to slot4@xxxxxxxxxxxxxxx <10.255.255.254:45256>
> ...
> 024 (002.003.000) 05/14 23:26:58 Job reconnection failed
>     Job not found at execution machine
>     Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 024 (002.001.000) 05/14 23:27:06 Job reconnection failed
>     Job not found at execution machine
>     Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
> ...
> 024 (002.002.000) 05/14 23:27:06 Job reconnection failed
>     Job not found at execution machine
>     Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
> ...
>
> I can see that after submission, some slots became claimed, but after few
> seconds, they became Unclaimed again.
> Here is my local configure(generated by Rocks):
>
> ALLOW_WRITE = $(HOSTALLOW_WRITE)
> AMAZON_GAHP = $(SBIN)/amazon_gahp
> AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
> COLLECTOR_NAME = Collector at imagegrid.otitan.com
> COLLECTOR_SOCKET_CACHE_SIZE = 1000
> CONDOR_ADMIN = condor@xxxxxxxxxxxxxxxxxxxx
> CONDOR_DEVELOPERS = NONE
> CONDOR_DEVELOPERS_COLLECTOR = NONE
> CONDOR_HOST = imagegrid.otitan.com
> CONDOR_IDS = 407.500
> CONDOR_SSHD = /usr/sbin/sshd
> CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
> CONTINUE = True
> DAEMON_LIST = MASTER, STARTD
> EMAIL_DOMAIN = $(FULL_HOSTNAME)
> FILESYSTEM_DOMAIN = otitan.com
> HIGHPORT = 50000
> HOSTALLOW_WRITE = imagegrid.otitan.com, *.local, *.local
> JAVA = /usr/bin/java
> KILL = False
> LOCAL_DIR = /var/opt/condor
> LOCK = /tmp/condor-lock.$(HOSTNAME)
> LOWPORT = 40000
> MAIL = /bin/mail
> NEGOTIATOR_INTERVAL = 120
> NETWORK_INTERFACE = 10.255.255.254
> PREEMPT = False
> RANK = None
> RELEASE_DIR = /opt/condor
> SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
> START = True
> STARTD_EXPRS = $(STARTD_EXPRS)
> SUSPEND = False
> UID_DOMAIN = local
> UPDATE_COLLECTOR_WITH_TCP = True
> WANT_SUSPEND = False
> WANT_VACATE = False
> # First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM
> JAVA_MAXHEAP_ARGUMENT =
> JAVA_EXTRA_ARGUMENTS = -Xmx1906m
> Can some one help me? Thanks!
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Two things to check:

- Did you enable file transfer in your submit files? Please send one
to check it out.
- Did you enable the ALLOW_WRITE parameter? It has to allow the
network of your servers to write.

--
Diego Bello Carreño