[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] COD troubleshooting



On Dec 27, 2006, at 3:43 PM, Dan Bradley wrote:
Unfortunately, COD doesn't currently support transferring any files,
including the x509 proxy file.  Is it possible for you to rely upon a
shared filesystem for this purpose?

No unfortunately, because the Condor installation is glide-in based (most probably the strongest use-case to have gLExec in place). And I don't know how the starter search for the X509. Is this done using the shadow on the head node? Or should the x509 be somewhere on the worker?

I haven't thought through what permissions would be necessary in order for this to work for gLExec.

Afaik, a COD is just a special kind of job with short latency because there is no negotation and preemption is always on. Probably it should work as any other jobs when gLExec is active, so the startd is doing the right thing. Without knowing the internals of the mechanism, I can't say if it just needs to have an env variable in place to work or something else.

Can you or someone else confirm that the COD feature is not yet ready to be used in conjuction with gLExec? I just want to be sure that there is nothing else I can do from my side.

Renzo


--Dan

Renzo Borgatti wrote:
Thanks Dan. I added both 'IWD' and 'User' without success. But I
found a startd not crashing and a StartdLog more verbose:

12/27 14:33:52 DaemonCore: Command received via TCP from
condor@fcdfcaf445 from host <131.225.240.106:35843>
12/27 14:33:52 DaemonCore: received command 1000 (CA_AUTH_CMD),
calling handler (command_classad_handler)
12/27 14:33:52 Serving request for CA_ACTIVATE_CLAIM by user 'condor'
12/27 14:33:52 vm2: State change: Suspending because a COD job is now
running
12/27 14:33:52 vm2: Changing activity: Retiring -> Suspended
12/27 14:33:52 vm2: cannot use glexec to spawn starter: no proxy (is
GLEXEC_STARTER set in the shadow?)
12/27 14:33:52 vm2: writeJobAd: Write_Pipe failed
12/27 14:33:52 vm2: ERROR: exec_starter returned 0

Looks like gLExec activation is used also to activate COD. I didn't
mention before that gLExec is active in my configuration. The error
has something to do with the X509 proxy not present. Is the mechanism
to transport X509 the same as universe=grid jobs? Is it possible to
specify with what X509 proxy the COD should run under?

Thanks
Renzo


On Dec 27, 2006, at 12:29 PM, Dan Bradley wrote:

Hello,

I have a hunch that some of the ClassAd attributes that the COD manual
claims are optional are actually required.

--Dan

Renzo Borgatti wrote:

Hi,

I have a problem activating claims using COD (Condor 6.9.0). This is
what I'm doing:


condor_cod request -addr "<131.225.212.148:39446>" -classad ci.out

Successfully sent CA_REQUEST_CLAIM to startd at
<131.225.212.148:39446>
Result ClassAd written to ci.out
ID of new claim is: "<131.225.212.148:39446>#1167216341#4"


condor_cod activate -id "<131.225.212.148:39446>#1167216341#4" -

classad ci.out -jobad TestCod
Attempt to send CA_ACTIVATE_CLAIM to startd <131.225.212.148:39446>
failed
Reply ClassAd returned 'Failure' but does not have the ErrorString
attribute

On the worker node, I can see the following two lines in the
StartdLog right before crashing:

12/27 11:50:05 DaemonCore: Command received via TCP from
condor@fcdfcaf444 from host <131.225.240.106:45123>
12/27 11:50:05 DaemonCore: received command 1000 (CA_AUTH_CMD),
calling handler (command_classad_handler)

while in the MasterLog:

12/27 11:55:30 The STARTD (pid 15721) died due to signal 11
12/27 11:55:30 All daemons are gone.  Exiting.
12/27 11:55:32 **** condor_master (condor_MASTER) EXITING WITH
STATUS 0

TestCod is a file with the following 2 lines:

Cmd="/bin/ps"
Args="-aux"

Am I using condor_cod the right way? Is there a way to have more
debugging information to understand what happened?

Thanks
Renzo
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR