[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Job eviction on submitting DAG machines



Hey all,

I was hoping to get some advice on this problem:

We have some machines that occasionally refuse to run the DAG from a submitter's machine.  In other words, the submitter will submit a DAG job and the condor_dagman will just spin between IDLE and RUNNING, continuously evicting the job.

e.g. (from the dagman output):

001 (4634.000.000) 11/18 11:03:49 Job executing on host: <10.10.xxx.xxx:1118>

...

004 (4634.000.000) 11/18 11:03:49 Job was evicted.

                (0) Job was not checkpointed.

                                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage

                                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

                0  -  Run Bytes Sent By Job

                0  -  Run Bytes Received By Job


(this continues to repeat over and over)...

These machines submit jobs only and do not handle any jobs.  I don't know if the DAGMAN submission follows the same START rules as with machines in the Condor pool, but how do I ensure that, regardless of any circumstances, a user's machine will not evict the job?

(We are using 7.04 on most user submit machines, but have been upgrading their condor_submit_dag executables to the latest -- I'm pretty sure this issue has been seen on users with either version).

As always, appreciate the assistance :),
Steve


Get a great deal on Windows 7 and see how it works the way you want. Check out the offers on Windows 7now.