[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Standard Universe and Job Hooks (condor_starter vs condor_starter.std)



Joan J. Piles wrote:
Hi all,

We have a hook that must be called for each job running in our cluster, an instance of xxxxx_HOOK_JOB_EXIT. In the Vanilla universe (the one most of our jobs use), there is no problem, and it works almost as expected (I say almost because the exit reason is shown as "evict" even when "condor_rm" is used, but that's not an important problem for us).

We have recently found that this hook is completely ignored for Standard universe jobs. According to the documentation it should work, and it is condor_starter's job to run the hooks. However, there seem to be two condor_starter executables, one for most jobs, and another one (condor_starter.std) for Standard universe jobs. Furthermore, in the sourece code there are two completely different implementations, and the Standard universe one seems to have no hook capability at all, so I don't know if this is a bug or a feature ;-)

What are our options for implementing hooks for Standard Universe jobs? Is this being worked upon (in development versions), or we should find a workaround? We already tried ditching condor_starter.std, but the default condor_starter doesn't seem to be able to start Standard Universe jobs.

Thanks in advance,

Joan


Hi Joan -

You are correct, standard universe has its own shadow/starter pair that does not support a bunch of mechanisms found in the newer shadow/starter pair that supports other universes like Vanilla, Java, etc. Besides hooks, other features like ssh_to_job and CCB do not work in standard universe for this reason.

We are currently actively looking at moving some functionality from the standard universe starter/shadow into the newer starter/shadow. ( For some details, see some thinking we did on this a couple weeks ago at https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1956,67 ).

Question: do you primarily use standard universe for checkpointing, or do you rely on remote system calls as well? I ask because another option we are considering is to add support to the vanilla universe to easily handle standalone checkpointing where some signal is sent periodically to create a ckpt file in the vanilla job's output sandbox, whether the executable is linked w/ Condor's standalone checkpointing library or some other one.

regards,
Todd