[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] COD troubleshooting



Thanks Dan. I added both 'IWD' and 'User' without success. But I found a startd not crashing and a StartdLog more verbose:

12/27 14:33:52 DaemonCore: Command received via TCP from condor@fcdfcaf445 from host <131.225.240.106:35843> 12/27 14:33:52 DaemonCore: received command 1000 (CA_AUTH_CMD), calling handler (command_classad_handler)
12/27 14:33:52 Serving request for CA_ACTIVATE_CLAIM by user 'condor'
12/27 14:33:52 vm2: State change: Suspending because a COD job is now running
12/27 14:33:52 vm2: Changing activity: Retiring -> Suspended
12/27 14:33:52 vm2: cannot use glexec to spawn starter: no proxy (is GLEXEC_STARTER set in the shadow?)
12/27 14:33:52 vm2: writeJobAd: Write_Pipe failed
12/27 14:33:52 vm2: ERROR: exec_starter returned 0

Looks like gLExec activation is used also to activate COD. I didn't mention before that gLExec is active in my configuration. The error has something to do with the X509 proxy not present. Is the mechanism to transport X509 the same as universe=grid jobs? Is it possible to specify with what X509 proxy the COD should run under?

Thanks
Renzo


On Dec 27, 2006, at 12:29 PM, Dan Bradley wrote:
Hello,

I have a hunch that some of the ClassAd attributes that the COD manual
claims are optional are actually required.

--Dan

Renzo Borgatti wrote:
Hi,

I have a problem activating claims using COD (Condor 6.9.0). This is
what I'm doing:

condor_cod request -addr "<131.225.212.148:39446>" -classad ci.out
Successfully sent CA_REQUEST_CLAIM to startd at <131.225.212.148:39446>
Result ClassAd written to ci.out
ID of new claim is: "<131.225.212.148:39446>#1167216341#4"

condor_cod activate -id "<131.225.212.148:39446>#1167216341#4" -
classad ci.out -jobad TestCod
Attempt to send CA_ACTIVATE_CLAIM to startd <131.225.212.148:39446>
failed
Reply ClassAd returned 'Failure' but does not have the ErrorString
attribute

On the worker node, I can see the following two lines in the
StartdLog right before crashing:

12/27 11:50:05 DaemonCore: Command received via TCP from
condor@fcdfcaf444 from host <131.225.240.106:45123>
12/27 11:50:05 DaemonCore: received command 1000 (CA_AUTH_CMD),
calling handler (command_classad_handler)

while in the MasterLog:

12/27 11:55:30 The STARTD (pid 15721) died due to signal 11
12/27 11:55:30 All daemons are gone.  Exiting.
12/27 11:55:32 **** condor_master (condor_MASTER) EXITING WITH STATUS 0

TestCod is a file with the following 2 lines:

Cmd="/bin/ps"
Args="-aux"

Am I using condor_cod the right way? Is there a way to have more
debugging information to understand what happened?

Thanks
Renzo
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR