Hi Condor-Users!
I have written about this before and I will write about it again
because I am still stumbling on this issue.
I am using a software that requires that Condor always use the same
user account when running jobs. Right now I have a small Condor pool with 3
machines running Windows.
I have defined in the condor_config file of each machine SLOT1_USER
= domain\user_account that Condor should use to run the jobs.
I have also included DEDICATED_EXECUTE_ACCOUNT_REGEXP = True.
But the problem occurs when I have to store the credentials on each
machine.
I have received instructions from my software supplier that I should
run CONDOR_STORE_CRED ADD on every machine of the pool.
But when I issue the command CONDOR_STORE_CRED ADD I get the error
“Make sure your ALLOW_WRITE setting includes this host”. And, yes, it does. The
ALLOW_WRITE variable includes this host (it is set to *).
When the software supplier contacted the Condor Team regarding this
issue they got the following answer,
This is a common problem people encounter when setting up Condor
on Windows. This error indicates that there is a communication problem between
the condor_store_cred tool and the condor_schedd daemon. The first thing you
want to do is verify that the schedd is in fact running on the machine from
which you are executing condor_store_cred. If it is, the SchedLog is the place
to look for details on why the communication is failing. A common reason is
because of a misconfigured security setup, which is why the error message
refers to HOSTALLOW_WRITE. Of course, there may be other problems. Adding the
D_SECURITY flag to the SCHEDD_DEBUG configuration macro will allow you to get
the most information out of your SchedLog.
Hope this helps. Let me know if you need any more help tracking
this down.
Thanks,
Greg Quinn
Condor Team
Greg wrote that
The
first thing you want to do is verify that the schedd is in fact running on the
machine from which you are executing condor_store_cred.
I have checked it, and no, schedd daemon is not
running on the machine I am executing condor_store_cred. It is only running on
the central manager. And there the command condor_store_cred worked fine.
Isn’t it so that condor_schedd should only run
on the machine where the jobs may be submitted from, in my case this is the
central manager?
1. Is it really necessary to execute condor_store_cred add on
every machine of my pool?
2. If yes, is it necessary that condor_schedd runs on every
machine of the condor pool?
3. If yes, how should I do so that condor_schedd runs on every
machine?
I include below the SchedLog file related to the submitted job with
ID 29.
10.110.44.12 is the central manager where the jobs are submitted
from; Condor_schedd is running on this machine; Condor_store_cred add worked
fine on this machine.
10.110.44.19 is the execute machine; Condor_store_cred add didn’t
work on this machine; Condor_schedd is not running on this machine.
Any clue of what is happening?
Cheers,
Sónia
9/24 11:14:33 Starting add_shadow_birthdate(29.0)
9/24 11:14:33 Started shadow for job 29.0 on
"<10.110.44.19:1232>", (shadow pid = 4172)
9/24 11:14:34 DC_AUTHENTICATE: received DC_AUTHENTICATE from
<10.110.44.12:53509>
9/24 11:14:34 DC_AUTHENTICATE: added session id
o2f-sth-lap-016:4012:1285319674:43 to cache for 8640000 seconds!
9/24 11:14:34 DC_AUTHENTICATE: received UDP packet from
<10.110.44.12:51224>.
9/24 11:14:34 DC_AUTHENTICATE: received DC_AUTHENTICATE from
<10.110.44.12:51224>
9/24 11:14:34 DC_AUTHENTICATE: resuming session id
o2f-sth-lap-016:4012:1285319674:43 given to <10.110.44.12:53509>:
9/24 11:14:34 DC_AUTHENTICATE: Success.
9/24 11:14:34 STARTCOMMAND: starting 60001 to
<10.110.44.12:53238> on UDP port 51225.
9/24 11:14:34 SECMAN: command 60001 to <10.110.44.12:53238> on
UDP port 51225.
9/24 11:14:34 SECMAN:
Cookie="2E5D02CE7113BE0793647B61D1BB49E8ED987ADCAD1C68D33A6E0DA5F1E9396F856B4060CBB7A93FEC1430B3470CE5D97C2FFB12626DEAD2320F8D124A7FD4C"
9/24 11:14:34 SECMAN: startCommand succeeded.
9/24 11:14:34 DC_AUTHENTICATE: received UDP packet from <10.110.44.12:51225>.
9/24 11:14:34 DC_AUTHENTICATE: received DC_AUTHENTICATE from
<10.110.44.12:51225>
9/24 11:14:34 DC_AUTHENTICATE: Success.
9/24 11:14:34 DaemonCore: Command received via UDP from host
<10.110.44.12:51225>
9/24 11:14:34 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
9/24 11:14:34 Shadow pid 4172 for job 29.0 exited with status 4
9/24 11:14:34 ERROR: Shadow exited with job exception code!
9/24 11:14:36 ERROR: SetHandleInformation() failed in
SetFDInheritFlag(0,0),err=87
9/24 11:14:36 ERROR: SetHandleInformation() failed in
SetFDInheritFlag(1,0),err=87
9/24 11:14:36 ERROR: SetHandleInformation() failed in
SetFDInheritFlag(2,0),err=87
9/24 11:14:36 Starting add_shadow_birthdate(29.0)
9/24 11:14:36 Started shadow for job 29.0 on
"<10.110.44.19:1232>", (shadow pid = 3280)
9/24 11:14:37 STARTCOMMAND: starting 1 to <10.110.44.12:9618>
on UDP port 51226.
9/24 11:14:37 SECMAN: command 1 to <10.110.44.12:9618> on UDP
port 51226.
9/24 11:14:37 SECMAN: using session
o2f-sth-lap-016:2396:1285318979:3 for {<10.110.44.12:9618>,<1>}.
9/24 11:14:37 SECMAN: UDP, have_session == 1, can_neg == 1
9/24 11:14:37 SECMAN: startCommand succeeded.
9/24 11:14:37 Sent ad to central manager for o2f_sonlil@xxxxxxxxxxxxxxxxxxxx
9/24 11:14:37 STARTCOMMAND: starting 11 to <10.110.44.12:9618>
on UDP port 51227.
9/24 11:14:37 SECMAN: command 11 to <10.110.44.12:9618> on UDP
port 51227.
9/24 11:14:37 SECMAN: using session
o2f-sth-lap-016:2396:1285318979:3 for {<10.110.44.12:9618>,<11>}.
9/24 11:14:37 SECMAN: UDP, have_session == 1, can_neg == 1
9/24 11:14:37 SECMAN: startCommand succeeded.
9/24 11:14:37 Sent ad to 1 collectors for o2f_sonlil@xxxxxxxxxxxxxxxxxxxx
9/24 11:14:38 DC_AUTHENTICATE: received DC_AUTHENTICATE from <10.110.44.12:53518>
9/24 11:14:38 DC_AUTHENTICATE: added session id
o2f-sth-lap-016:4012:1285319678:44 to cache for 8640000 seconds!
9/24 11:14:38 DC_AUTHENTICATE: received UDP packet from
<10.110.44.12:51228>.
9/24 11:14:38 DC_AUTHENTICATE: received DC_AUTHENTICATE from
<10.110.44.12:51228>
9/24 11:14:38 DC_AUTHENTICATE: resuming session id
o2f-sth-lap-016:4012:1285319678:44 given to <10.110.44.12:53518>:
9/24 11:14:38 DC_AUTHENTICATE: Success.
9/24 11:14:38 STARTCOMMAND: starting 60001 to <10.110.44.12:53238>
on UDP port 51229.
9/24 11:14:38 SECMAN: command 60001 to <10.110.44.12:53238> on
UDP port 51229.
9/24 11:14:38 SECMAN: Cookie="2E5D02CE7113BE0793647B61D1BB49E8ED987ADCAD1C68D33A6E0DA5F1E9396F856B4060CBB7A93FEC1430B3470CE5D97C2FFB12626DEAD2320F8D124A7FD4C"
9/24 11:14:38 SECMAN: startCommand succeeded.
9/24 11:14:38 DC_AUTHENTICATE: received UDP packet from
<10.110.44.12:51229>.
9/24 11:14:38 DC_AUTHENTICATE: received DC_AUTHENTICATE from
<10.110.44.12:51229>
9/24 11:14:38 DC_AUTHENTICATE: Success.
9/24 11:14:38 DaemonCore: Command received via UDP from host
<10.110.44.12:51229>
9/24 11:14:38 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
9/24 11:14:38 Shadow pid 3280 for job 29.0 exited with status 4
9/24 11:14:38 ERROR: Shadow exited with job exception code!
9/24 11:14:38 Match for cluster 29 has had 5 shadow exceptions, relinquishing.
9/24 11:14:38 STARTCOMMAND: starting 443 to
<10.110.44.19:1232> on UDP port 51230.
9/24 11:14:38 SECMAN: command 443 to <10.110.44.19:1232> on
UDP port 51230.
9/24 11:14:38 SECMAN: using session
O2F-sth-LAP-002:588:1285319520:39 for {<10.110.44.19:1232>,<443>}.
9/24 11:14:38 SECMAN: UDP, have_session == 1, can_neg == 1
9/24 11:14:38 SECMAN: startCommand succeeded.
9/24 11:14:38 STARTCOMMAND: starting 443 to
<10.110.44.19:1232> on UDP port 51231.
9/24 11:14:38 SECMAN: command 443 to <10.110.44.19:1232> on
UDP port 51231.
9/24 11:14:38 SECMAN: using session
O2F-sth-LAP-002:588:1285319520:39 for {<10.110.44.19:1232>,<443>}.
9/24 11:14:38 SECMAN: UDP, have_session == 1, can_neg == 1
9/24 11:14:38 SECMAN: startCommand succeeded.
9/24 11:14:38 Sent RELEASE_CLAIM to startd on <10.110.44.19:1232>
9/24 11:14:38 Match record (<10.110.44.19:1232>, 29, 0) deleted
9/24 11:14:38 DC_AUTHENTICATE: received DC_AUTHENTICATE from
<10.110.44.19:1544>
9/24 11:14:38 DC_AUTHENTICATE: resuming session id
o2f-sth-lap-016:4012:1285319536:25 given to <10.110.44.19:1526>:
9/24 11:14:38 DC_AUTHENTICATE: Success.
9/24 11:14:38 DaemonCore: Command received via TCP from host
<10.110.44.19:1544>
9/24 11:14:38 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
9/24 11:14:38 Got VACATE_SERVICE from <10.110.44.19:1544>
Sónia Liléo
O2 Strandvägen 5B 114 51 Stockholm
Tel: +46 8 559 310 37 Mobile: +46 73 752 95 74
www.o2.se