[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_submit hangs when Queue > 1



Hi,

when I submit a vanilla job with a submit file like this

Executable = /bin/sleep
Arguments  = 30
Universe   = vanilla
output     = runner.out
error      = runner.error
Log        = runner.log
Queue 1

everything is fine. However when I change the last line to 'Queue 2' or
any other number larger than 1 I cannot submit the job anymore.
condor_submit hangs. strace shows that it is waiting to read from a
socket, and SchedLog has this:

12/29/10 21:10:41 (pid:12071) condor_read(): timeout reading 5 bytes from <10.0.0.1:50781>.
12/29/10 21:10:41 (pid:12071) IO: Failed to read packet header

It seems that Schedd cannot talk to condor_submit:

mih@head1 ~/debian/condor % sudo netstat -anp |grep 53992
tcp        0      0 0.0.0.0:53992           0.0.0.0:*               LISTEN      12071/condor_schedd
tcp        0      0 10.0.0.1:57682          10.0.0.1:53992          ESTABLISHED 12072/condor_negoti
tcp        0      0 10.0.0.1:53992          10.0.0.1:57682          ESTABLISHED 12071/condor_schedd
tcp        1      0 10.0.0.1:50781          10.0.0.1:53992          CLOSE_WAIT  20398/condor_submit
udp        0      0 0.0.0.0:53992           0.0.0.0:*                           12071/condor_schedd

Enabling some debugging in condor_submit doesn't shed more light:

mih@head1 ~/debian/condor % _TOOL_DEBUG=D_ALL ; condor_submit -debug job
12/29/10 21:10:21 Can't find CondorPlatform in classad for schedd head1.xxxxx.xxxxxxxxx.xxx
Submitting job(s).
[hangs]

Submission happens on the central manager of the pool -- deviation from
the default configuration is fairly minimal:

DAEMON_LIST = MASTER, SCHEDD, COLLECTOR, NEGOTIATOR
UID_DOMAIN = xxxxx.xxxxxxxxx.xxx
FILESYSTEM_DOMAIN = xxxxx.xxxxxxxxx.xxx
ALLOW_WRITE = *xxxxx.xxxxxxxxx.xxx
NETWORK_INTERFACE = 10.0.0.1

Enabling D_FULLDEBUG for SCHEDD doesn't add much more:

12/29/10 21:21:56 Adding to resolved authorization table: mih@xxxxxxxxxxxxxxxxxxx/10.0.0.1: WRITE
12/29/10 21:21:56 Received TCP command 1112 (QMGMT_WRITE_CMD) from mih@xxxxxxxxxxxxxxxxxxx <10.0.0.1:38560>, access level WRITE
12/29/10 21:21:56 OwnerCheck retval 1 (success),no ad
12/29/10 21:21:56 OwnerCheck retval 1 (success),no ad
12/29/10 21:21:56 OwnerCheck retval 1 (success),no ad
12/29/10 21:22:16 condor_read(): timeout reading 5 bytes from <10.0.0.1:38560>.
12/29/10 21:22:16 IO: Failed to read packet header
12/29/10 21:22:16 QMGR Connection closed


I'd be glad if somebody could point me to the problem.

Thanks in advance,

Michael

-- 
Michael Hanke
http://mih.voxindeserto.de