[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Firewalls, shared port and standard universe jobs



I have a problem running standard universe jobs on machines with a firewall enabled.

I am running Condor 8.0.3 (condor-8.0.3-174914-deb_7_amd64.deb) on Ubuntu 12.04. Shared port is configured...

SHARED_PORT_ARGS = -p 4080
USE_SHARED_PORT = TRUE
DAEMON_LIST = SHARED_PORT, $(DAEMON_LIST)

...and the firewall on the desktop machines permits UDP and TCP traffic on port 4080.

Vanilla universe jobs run fine.

Standard universe jobs are always stuck in the idle state. The ShadowLog looks like this...

09/27/13 11:34:47 (?.?) (24947):******* Standard Shadow starting up *******
09/27/13 11:34:47 (?.?) (24947):** $CondorVersion: 8.0.3 Sep 19 2013 BuildID: 174914 $
09/27/13 11:34:47 (?.?) (24947):** $CondorPlatform: x86_64_Debian7 $
09/27/13 11:34:47 (?.?) (24947):*******************************************
09/27/13 11:34:47 (?.?) (24947):uid=0, euid=118, gid=0, egid=127
09/27/13 11:34:47 (?.?) (24947):Hostname = "<129.94.130.201:4080?sock=30414_70f6_4>", Job = 4.0
09/27/13 11:34:47 (4.0) (24947):Requesting Primary Starter
09/27/13 11:34:47 (4.0) (24947):Shadow: Request to run a job was ACCEPTED
09/27/13 11:34:47 (4.0) (24947):connect returns -1, errno = 113
09/27/13 11:34:47 (4.0) (24947):failed to connect to scheduler on <129.94.130.201:39910> 09/27/13 11:34:47 (4.0) (24947):Shadow: DoCleanup: unlinking TmpCkpt '/var/lib/condor/spool/4/0/cluster4.proc0.subproc0.tmp' 09/27/13 11:34:47 (4.0) (24947):Trying to unlink /var/lib/condor/spool/4/0/cluster4.proc0.subproc0.tmp 09/27/13 11:34:47 (4.0) (24947):Can't get address for checkpoint server host (NULL): Success
09/27/13 11:34:47 (4.0) (24947):********** Shadow Exiting(108) **********

I was a little confused to see this line...

09/27/13 11:34:47 (4.0) (24947):failed to connect to scheduler on <129.94.130.201:39910>

...doesn't that imply that shared port is not being used?

When the firewall is disabled, the standard universe jobs run fine...

09/27/13 11:37:48 (?.?) (25695):******* Standard Shadow starting up *******
09/27/13 11:37:48 (?.?) (25695):** $CondorVersion: 8.0.3 Sep 19 2013 BuildID: 174914 $
09/27/13 11:37:48 (?.?) (25695):** $CondorPlatform: x86_64_Debian7 $
09/27/13 11:37:48 (?.?) (25695):*******************************************
09/27/13 11:37:48 (?.?) (25695):uid=0, euid=118, gid=0, egid=127
09/27/13 11:37:48 (?.?) (25695):Hostname = "<129.94.130.201:4080?sock=30414_70f6_4>", Job = 4.0
09/27/13 11:37:48 (4.0) (25695):Requesting Primary Starter
09/27/13 11:37:48 (4.0) (25695):Shadow: Request to run a job was ACCEPTED
09/27/13 11:37:48 (4.0) (25695):Shadow: RSC_SOCK connected, fd = 17
09/27/13 11:37:48 (4.0) (25695):Shadow: CLIENT_LOG connected, fd = 18
09/27/13 11:37:48 (4.0) (25695):My_Filesystem_Domain = "maths.unsw.edu.au"
09/27/13 11:37:48 (4.0) (25695):My_UID_Domain = "unsw.edu.au"
09/27/13 11:37:48 (4.0) (25695):Can't get address for checkpoint server host (NULL): Success
09/27/13 11:37:48 (4.0) (25695):        Entering pseudo_get_file_stream
09/27/13 11:37:48 (4.0) (25695): file = "/var/lib/condor/spool/4/cluster4.ickpt.subproc0"
09/27/13 11:37:48 (4.0) (25695):        Entering pseudo_get_file_stream
09/27/13 11:37:48 (4.0) (25695): file = "/var/lib/condor/spool/4/cluster4.ickpt.subproc0"
09/27/13 11:37:48 (4.0) (25695):        Entering pseudo_get_file_stream
09/27/13 11:37:48 (4.0) (25695): file = "/var/lib/condor/spool/4/cluster4.ickpt.subproc0"

I'd be very grateful if anyone could offer advice on how to debug this further.

Many thanks

Martin