[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 7.6 - Windows Parallel Universe problems



I noticed some problems with jobs running in the "Parallel" universe
on Windows. My cluster of Windows Server 2003 R2 machines works
without any problems if I submit jobs to the "Vanilla" universe.
However, when I submit jobs to the parallel universe (actually
identical to the ones I submitted to the "vanilla" universe before in
terms of their configuration, i.e. I just reserve a number of nodes,
all nodes except for node 0 exit and node 0 runs an executable that
spawns processes on the reserved machines. In case I'm running a job
which needs just one node, I start the executable in the "vanilla"
universe). The issues are as follows:

1. Everytime I try to set a job attribute using

condor_chirp set_job_attr ...

from within my job this fails with "Error: 13 (Permission denied)". In
the ShadowLog I find the entries (related to this problem):

01/20/12 16:38:21 (80.0) (4436): SECMAN: command 1112 QMGMT_WRITE_CMD
to schedd at <10.2.10.7:3376> from TCP port 3839 (blocking).
01/20/12 16:38:21 (80.0) (4436): SECMAN: using session
sword07:3732:1327073895:22 for {<10.2.10.7:3376>,<1112>}.
01/20/12 16:38:21 (80.0) (4436): SECMAN: resume, other side is
$CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $, NOT
reauthenticating.
01/20/12 16:38:21 (80.0) (4436): SECMAN: successfully enabled message
authenticator!
01/20/12 16:38:21 (80.0) (4436): SECMAN: successfully enabled encryption!
01/20/12 16:38:21 (80.0) (4436): SECMAN: startCommand succeeded.
01/20/12 16:38:21 (80.0) (4436): SetEffectiveOwner(FelixWolfheimer)
failed with errno=13: Permission denied.
01/20/12 16:38:21 (80.0) (4436): QmgrJobUpdater::updateAttr: failed to
update (AppstarterCommPort = "31001"): ConnectQ() failed

With the "Vanilla" universe setting the same job classads works perfectly fine.

2. The second problem seems to be related to the first one as it leads
to similar error messages in the ShadowLog. When the job finishes
(successfully!), the SetEffectiveOwner function fails again and
condor_shadow ends with error code 107 which requeues the job. This
leads to an endless loop as it happens again the next time when the
job finishes. The entry in the ShadowLog related to this problem are:

01/20/12 16:39:06 (80.0) (4436): SECMAN: command 1112 QMGMT_WRITE_CMD
to schedd at <10.2.10.7:3376> from TCP port 3898 (blocking).
01/20/12 16:39:06 (80.0) (4436): SECMAN: using session
sword07:3732:1327073895:22 for {<10.2.10.7:3376>,<1112>}.
01/20/12 16:39:06 (80.0) (4436): SECMAN: resume, other side is
$CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $, NOT
reauthenticating.
01/20/12 16:39:06 (80.0) (4436): SECMAN: successfully enabled message
authenticator!
01/20/12 16:39:06 (80.0) (4436): SECMAN: successfully enabled encryption!
01/20/12 16:39:06 (80.0) (4436): SECMAN: startCommand succeeded.
01/20/12 16:39:06 (80.0) (4436): SetEffectiveOwner(FelixWolfheimer)
failed with errno=13: Permission denied.
01/20/12 16:39:06 (80.0) (4436): Failed to perform final update to job queue!
01/20/12 16:39:06 (80.0) (4436): condor_read() failed: recv() returned
-1, errno = 10054 , reading 21 bytes from startd at <10.2.10.7:9619>.
01/20/12 16:39:06 (80.0) (4436): IO: Failed to read packet header
01/20/12 16:39:06 (80.0) (4436): condor_read() failed: recv() returned
-1, errno = 10054 , reading 21 bytes from startd slot1@xxxxxxxxxxxxxxx
01/20/12 16:39:06 (80.0) (4436): IO: Failed to read packet header
01/20/12 16:39:16 (80.0) (4436): DaemonCore::Wake_up_select called
from an unknown thread. windows tid = 424001/20/12 16:39:16 (80.0)
(4436): DC_AUTHENTICATE: received DC_AUTHENTICATE from
<10.2.10.7:3907>
01/20/12 16:39:16 (80.0) (4436): DC_AUTHENTICATE: resuming session id
f914fef94f70e070eefec587c671f28a1d0e72104b533974:
01/20/12 16:39:16 (80.0) (4436): DC_AUTHENTICATE: message
authenticator enabled with key id
f914fef94f70e070eefec587c671f28a1d0e72104b533974.
01/20/12 16:39:16 (80.0) (4436): DC_AUTHENTICATE: encryption enabled
for session f914fef94f70e070eefec587c671f28a1d0e72104b533974
01/20/12 16:39:16 (80.0) (4436): DC_AUTHENTICATE: Success.
01/20/12 16:39:16 (80.0) (4436): PERMISSION GRANTED to condor@parent
from host 10.2.10.7 for command 60000 (DC_RAISESIGNAL), access level
DAEMON: reason: DAEMON authorization has been made automatic for
condor@parent
01/20/12 16:39:36 (80.0) (4436): Retrying job cleanup, calling terminateJob()
01/20/12 16:39:36 (80.0) (4436): SECMAN: command 1112 QMGMT_WRITE_CMD
to schedd at <10.2.10.7:3376> from TCP port 3928 (blocking).
01/20/12 16:39:36 (80.0) (4436): SECMAN: using session
sword07:3732:1327073895:22 for {<10.2.10.7:3376>,<1112>}.
01/20/12 16:39:36 (80.0) (4436): SECMAN: resume, other side is
$CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $, NOT
reauthenticating.
01/20/12 16:39:36 (80.0) (4436): SECMAN: successfully enabled message
authenticator!
01/20/12 16:39:36 (80.0) (4436): SECMAN: successfully enabled encryption!
01/20/12 16:39:36 (80.0) (4436): SECMAN: startCommand succeeded.
01/20/12 16:39:36 (80.0) (4436): SetEffectiveOwner(FelixWolfheimer)
failed with errno=13: Permission denied.
01/20/12 16:39:36 (80.0) (4436): Failed to perform final update to job queue!
01/20/12 16:40:06 (80.0) (4436): Retrying job cleanup, calling terminateJob()
01/20/12 16:40:06 (80.0) (4436): SECMAN: command 1112 QMGMT_WRITE_CMD
to schedd at <10.2.10.7:3376> from TCP port 3930 (blocking).
01/20/12 16:40:06 (80.0) (4436): SECMAN: using session
sword07:3732:1327073895:22 for {<10.2.10.7:3376>,<1112>}.
01/20/12 16:40:06 (80.0) (4436): SECMAN: resume, other side is
$CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $, NOT
reauthenticating.
01/20/12 16:40:06 (80.0) (4436): SECMAN: successfully enabled message
authenticator!
01/20/12 16:40:06 (80.0) (4436): SECMAN: successfully enabled encryption!
01/20/12 16:40:06 (80.0) (4436): SECMAN: startCommand succeeded.

... repeated several times ...

01/20/12 16:41:36 (80.0) (4436): SetEffectiveOwner(FelixWolfheimer)
failed with errno=13: Permission denied.
01/20/12 16:41:36 (80.0) (4436): Failed to perform final update to job queue!
01/20/12 16:41:36 (80.0) (4436): Maximum number of job cleanup retry
attempts (SHADOW_MAX_JOB_CLEANUP_RETRIES=5) reached; Forcing job
requeue!
01/20/12 16:41:36 (80.0) (4436): SharedPortEndpoint: Destructor:
Problem in thread shutdown notification: 0
01/20/12 16:41:36 (80.0) (4436): **** condor_shadow (condor_SHADOW)
pid 4436 EXITING WITH STATUS 107

As the "vanilla" universe is working properly I suppose that my setup
is not completely insane. Has anyone else seen this problem so far?
I there any setting that might cause such a behavior? BTW: For all my
jobs (vanilla and parallel) I'm using the "condor_submit -remote"
command to submit them to the dedicated scheduler. I don't know
whether this is of any importance.