[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpointing Condor's vanilla universe jobs.



Actually, I have modified your version to my enviroment and I have added 
this "feature". Now, we are testing it under a production cluster with about 
15 users and 150 works.

Thanks for your time.
--
José.

El Tuesday 26 February 2008 10:48:20 Mark Calleja escribió:
> Hi José,
>
> Thanks for your feedback. What you suggest should not be too difficult
> to add, so if you're interested I can send you a modified version of the
> code for you to test.
>
> Cheers,
> Mark
>
> José M. Martín wrote:
> > Thanks for your contribution.
> >
> > I used it to my Condor cluster and it works greatly. Only a suggestion:
> > you can trap Condor's signals to force to your programs to make a
> > checkpoint. When Condor vacates a program, it sends it a signal (killsig)
> > (http://www.cs.wisc.edu/condor/manual/v7.0/2_7Priorities_Preemption.html#
> >SECTION00373000000000000000) Trapping this signal, programs could make a
> > checkpoint before stop.
> >
> > Cheers,
> > José
> >
> > El Monday 04 February 2008 12:37:43 Mark Calleja escribió:
> >> Hi,
> >>
> >> In case it's of use or interest to anyone else on this mailing list,
> >> I've written some notes on how one can use Parrot and the BLCR kernel
> >> modules to transparently checkpoint Condor's vanilla universe jobs. The
> >> link is:
> >>
> >> http://www.escience.cam.ac.uk/projects/camgrid/blcr.html
> >>
> >> This is recent/ongoing work, so feedback and/or bug reports back to me
> >> please.
> >>
> >> Cheers,
> >> Mark
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/