I was hoping to get some advice on this problem:
We have some machines that occasionally refuse to run the DAG from a submitter's machine. In other words, the submitter will submit a DAG job and the condor_dagman will just spin between IDLE and RUNNING, continuously evicting the job.
e.g. (from the dagman output):
001 (4634.000.000) 11/18 11:03:49 Job executing on host: <10.10.xxx.xxx:1118>
004 (4634.000.000) 11/18 11:03:49 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
(this continues to repeat over and over)...
These machines submit jobs only and do not handle any jobs. I don't know if the DAGMAN submission follows the same START rules as with machines in the Condor pool, but how do I ensure that, regardless of any circumstances, a user's machine will not evict the job?
(We are using 7.04 on most user submit machines, but have been upgrading their condor_submit_dag executables to the latest -- I'm pretty sure this issue has been seen on users with either version).
As always, appreciate the assistance :),
Get a great deal on Windows 7 and see how it works the way you want. Check out the offers on Windows 7now.