[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Detecting checkpoint.



     If I have a computation which, for instance, I know is
going nowhere if it runs for an hour on a 2GHz Intel processor,
I can install code that checks the CPU time occasionally and
shuts down after an hour on that 2GHz processor.  Keep in mind
that there are no milestones in the code that are consistently
linear in CPU time.

     Now, I throw the computation into the seas of Condor
where it finds a slower processor.  Hence, I want to increase
the amount of time I allow before shutting down.  I can measure
CPU speed by any number of means (as is done in NAMD).  Once
I'm checkpointed and restarted on another machine, however,
that measurement is compromised.  To resolve the problem, I
need to be able to detect that I may have changed processors
during execution so I can adjust my shutdown criteria.

     I'd already looked over Section 4 of the Manual before
I wrote the initial question, by the way.  I'd also mistakenly
thought that getpid() would return something useful on the
assumption that a checkpointed application restarted as a new
process (dumb idea I guess).

     I know I can do stuff in the ClassAd w.r.t. processor
speed but that restricts where the job can execute.  I'd
rather be able to adapt the application to the environment
and have more open options for execution.

     I haven't checked yet, but since you may know off the
top of your head, does gethostid work in standard universe?
Is that a good indicator of a processor change?

     I've used times(2) in standard universe and it appears
to return sensible values through checkpoint so all I need
is the means to know when to recalibrate.

				Thanks,
				Phil

				P. A. Cheeseman
				aai@xxxxxxxxxx
				http://web.ics.purdue.edu/~aai/
				765.496.8224
 

> -----Original Message-----
> From: Erik Paulson [mailto:epaulson@xxxxxxxxxxx] 
> Sent: Wednesday, July 19, 2006 2:17 PM
> To: aai@xxxxxxxxxx; Condor-Users Mail List
> Subject: Re: [Condor-users] Detecting checkpoint.
> 
> On Wed, Jul 19, 2006 at 09:51:37AM -0400, P. A. Cheeseman wrote:
> > 
> >      Is there any means by which a standard universe executable
> > can check to determine if it has been checkpointed?
> > 
> 
> There's no API to find out if you've been checkpointed. You could 
> have your job force a checkpoint, in which case you could know that
> you were checkedpointed at least once.
> 
> Why do you want your job to know?
> 
> -Erik
>