[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Complete jobs occasionally do not leave run state



Dear Condor community,

I am using DAGman to submit a batch of standard universe jobs
heterogeneously (SunOS + Linux) on a shared filesystem. My condor
version is 6.6.0.

In rare circumstances that unfortunately I cannot duplicate, sometimes
a job will complete but stay in the Run state indefinitely.

I know that job has completed because the following line appears in
the stderr log: "Terminating at example 1". The final three lines of
main() in the program's C++ code are:
        cerr << "Terminating at example " << example_count << "\n";
        cerr.flush();
        return 0;
The only explanation I can think of is that maybe there was a problem
during memory deallocation immediately prior to program termination.

This particular time I encountered this behavior, the pertinent job
line from condor_q is:
50813.0   turian          6/23 07:22   0+03:27:32 R  0   200.0
weak-hypothesis.$$
condor_q -long says:
50813.000:  Request is being serviced

I grep'ed for 50813 in the logs, the only thing out of the ordinary is
in the DAGman log:
007 (50813.000.000) 06/23 10:24:39 Shadow exception!
        Failed to connect to schedd!
        8396914  -  Run Bytes Sent By Job
        14539960  -  Run Bytes Received By Job
...
001 (50813.000.000) 06/23 10:24:40 Job executing on host: <128.122.140.86:32773>

What does this error mean? How can I avoid this in the future?

NB I removed the job from the queue and set up a new DAGman job to
start where it left off (i.e. assuming the job completed successfully,
since the output was intact), so I cannot query this particular errant
job. But if anyone can suggest what diagnostics I can perform the next
time this occurs, I'm all ears.

Thanks,
   Joseph

-- 
http://www.cs.nyu.edu/~turian/