[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] State transition for peempted jobs and its implication with Condor-G



On Feb 15, 2008, at 1:56 PM, Barnett P. Chiu wrote:

When a job is temporarily suspended by a higher priority job, what state does it go into? I got the impression that job state will become idle and the job will sit in the queue, waiting for a match again. But will it go through 'hold' state before becoming 'idle' and if so, will this transition (R-> (H?) -> I) reflect on condor_q?

I guess a possibility that a job being preempted could go into a 'hold' state is when this particular job is being checkpointed (therefore, file staging is involved => hold state).

The startd may choose to suspend a running job for a number of reasons (configurable by the admin), one of which may be a job running on a different slot on that machine. In this case, the suspended job will be marked as Running in the job queue.

A startd may also decide to evict a job from the execution machine. One reason for this is there's another job the startd would rather run in that slot. In this case, the evicted job returns to Idle status, awaiting another match.

This reminds me of another question: when a job is submitted in Condor-G, grid manager on the remote gatekeeper will forward this job to Condor (assuming the underlying batch system is Condor) and let it schedule the job, but in which universe will the site's native Condor run the job in?

The default for GRAM is to submit the job in the vanilla universe.

If job ends up being scheduled as a Vanilla job, then how would this job receive a checkpointing service? Is it the case that the jobmanager, in the meantime, also somehow watches over the job while it is being executed on the worker node and hence, even though it is being run as a Vanilla job, checkpointing could still be achieved?

Of course, thoughts above were based on my impression that Condor-G does support checkpointing but I am not sure on which level it is achieved. Or Condor-G job does not support checkpointing at all?

Condor-G and GRAM do not directly support checkpointing of jobs. The batch scheduler behind GRAM may support it, though.

Is there a possibility that jobmanager on gatekeeper could somehow "inform" the its native Condor to scheduler jobs in a universe other than Vanilla?


When Condor is the batch system behind GRAM, the client can make additions and modifications to the Condor submit file that GRAM writes with the 'condor_submit' RSL attribute. Here's an example of how to use it in a Condor-G submit file:

globus_rsl = (condor_submit=(universe standard)(priority 10))

Thanks and regards,
Jaime Frey
UW-Madison Condor Team