[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Unexplainable Job Eviction



My group uses Condor on a local cluster.  Our set-up is relatively simple and our use of Condor is rather basic.  The types of calculations we run cannot be checkpointed in the way that Condor would like, so we have turned off all preemption options.

Here's our unexplainable event.  Occasionally, jobs stop mid-stream and the Condor log file reports:

--------------------------START OF FILE-----------------------------
000 (12841.000.000) 12/13 14:45:37 Job submitted from host: < 10.79.133.101:33346>
...
001 (12841.000.000) 12/13 14:45:42 Job executing on host: <10.79.133.112:32812>
...
006 (12841.000.000 ) 12/13 14:45:50 Image size of job updated: 1729736
...
006 (12841.000.000) 12/13 15:05:50 Image size of job updated: 1731508
...
007 (12841.000.000) 12/15 23:51:00 Shadow exception!
        Can no longer talk to condor_starter on execute machine ( 10.79.133.112)
        0  -  Run Bytes Sent By Job
        6891  -  Run Bytes Received By Job
...
------------------------END OF FILE-------------------------------------

Can someone explain what message 007 means and what sorts of pathologies this indicates?


Regards,
HPH

--
Hrant P. Hratchian, Ph.D.
E. R. Davidson Fellow
Department of Chemistry
Indiana University
Bloomington, Indiana 47405-7102
812.856.0829
hhratchi@xxxxxxxxxxx