[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Unexplainable Job Eviction



Re the below error message:

It means that the condor_starter process on the execute node suddenly and unexpectedly disappeared.  In condor_config, is
   KILL = FALSE
?

Considering this happened at 10 min before midnight, could your machines be rebooting at that time due to patch installations or whatever?

Also, check for clues in the StarterLog on the execute machine... What do these log files say happened at that time?

---
Todd Tannenbaum
Dept of Computer Sciences
University of Wisconsin-Madison
..Sent from a Palm Treo 680...

-----Original Message-----

From:  "Hrant P. Hratchian" <hhratchi@xxxxxxxxxxx>
Subj:  [Condor-users] Unexplainable Job Eviction
Date:  Tue Dec 19, 2006 10:21 am
Size:  1K
To:  "Condor Users" <condor-users@xxxxxxxxxxx>

My group uses Condor on a local cluster.  Our set-up is relatively simple
and our use of Condor is rather basic.  The types of calculations we run
cannot be checkpointed in the way that Condor would like, so we have turned
off all preemption options.

Here's our unexplainable event.  Occasionally, jobs stop mid-stream and the
Condor log file reports:

--------------------------START OF FILE-----------------------------
000 (12841.000.000) 12/13 14:45:37 Job submitted from host: <
10.79.133.101:33346>
...
001 (12841.000.000) 12/13 14:45:42 Job executing on host: <
10.79.133.112:32812>
...
006 (12841.000.000) 12/13 14:45:50 Image size of job updated: 1729736
...
006 (12841.000.000) 12/13 15:05:50 Image size of job updated: 1731508
...
007 (12841.000.000) 12/15 23:51:00 Shadow exception!
        Can no longer talk to condor_starter on execute machine (
10.79.133.112)
        0  -  Run Bytes Sent By Job
        6891  -  Run Bytes Received By Job
...
------------------------END OF FILE-------------------------------------

Can someone explain what message 007 means and what sorts of pathologies
this indicates?


Regards,
HPH

-- 
Hrant P. Hratchian, Ph.D.

--- message truncated ---