[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Standard Universe and Job Hooks (condor_starter vs condor_starter.std)



Hi Todd,

Our primary focus is checkpoint, we have a shared filesystem, so I think remote I/O is not needed at all (I think checkpointing takes care of re-opening file descriptors). A checkpointing system for Vanilla universe jobs would be the perfect solution for us (and, in fact, we made some tests with dmtcp and some bash wrappers, but it wasn't ready yet four our needs).

Furthermore, as our jobsizes can be somewhat big (~20G aren't that unusual here), we want to avoid periodic checkpointing and only checkpoint on evictions (just in case this makes things easier).

If there is any other way to run a command whenever a job finishes (with access to the classad of the job just run), it would fit our needs as well (perhaps some magic with USER_JOB_WRAPPER and bash scripting? I don't know if a wrapper makes sense for an Standard job).

Regards,

Joan

El 13/04/11 17:56, Todd Tannenbaum escribió:
Joan J. Piles wrote:
Hi all,

We have a hook that must be called for each job running in our cluster, an instance of xxxxx_HOOK_JOB_EXIT. In the Vanilla universe (the one most of our jobs use), there is no problem, and it works almost as expected (I say almost because the exit reason is shown as "evict" even when "condor_rm" is used, but that's not an important problem for us).

We have recently found that this hook is completely ignored for Standard universe jobs. According to the documentation it should work, and it is condor_starter's job to run the hooks. However, there seem to be two condor_starter executables, one for most jobs, and another one (condor_starter.std) for Standard universe jobs. Furthermore, in the sourece code there are two completely different implementations, and the Standard universe one seems to have no hook capability at all, so I don't know if this is a bug or a feature ;-)

What are our options for implementing hooks for Standard Universe jobs? Is this being worked upon (in development versions), or we should find a workaround? We already tried ditching condor_starter.std, but the default condor_starter doesn't seem to be able to start Standard Universe jobs.

Thanks in advance,

Joan


Hi Joan -

You are correct, standard universe has its own shadow/starter pair that does not support a bunch of mechanisms found in the newer shadow/starter pair that supports other universes like Vanilla, Java, etc. Besides hooks, other features like ssh_to_job and CCB do not work in standard universe for this reason.

We are currently actively looking at moving some functionality from the standard universe starter/shadow into the newer starter/shadow. ( For some details, see some thinking we did on this a couple weeks ago at https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1956,67 ).

Question: do you primarily use standard universe for checkpointing, or do you rely on remote system calls as well? I ask because another option we are considering is to add support to the vanilla universe to easily handle standalone checkpointing where some signal is sent periodically to create a ckpt file in the vanilla job's output sandbox, whether the executable is linked w/ Condor's standalone checkpointing library or some other one.

regards,
Todd

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
--------------------------------------------------------------------------
Joan Josep Piles Contreras -  Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 976 76 10 00 (ext. 5454)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------