[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job cannot reconnect to starter running MPI



I was monitoring the logs in the execution nodes as suggested earlier and I got some errors just after HTCondor set the job from Running -> Idle.

My question is, why is the job going idle? What was in the starter log before it caught the sigquit (why is it being killed)? The ShadowLog you quoted below runs from 14:37:13 to 14:40:04; what does the StartLog and StarterLog.* say for that period of time for those nodes? Why is the StartLog you showed full of deactivate_claim_forcibly()?

- ToddM