[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SIGQUIT / debugging



On Feb 19, 2013, at 7:34 AM, "Shrum, Donald C" <DCShrum@xxxxxxxxxxxxx> wrote:

> I periodically see jobs that fail with a SIGQUIT
> 
> In the scheduler:
> SchedLog:02/18/13 19:47:35 (pid:25985) match (slot3@xxxxxxxxxxxxxxxxxx <10.178.6.101:54726> for nmg11) switching to job 5911.734
> SchedLog:02/18/13 19:47:35 (pid:25985) Started shadow for job 5911.734 on slot3@xxxxxxxxxxxxxxxxxx <10.178.6.101:54726> for nmg11, (shadow pid = 14851)
> SchedLog:02/18/13 19:47:37 (pid:25985) Negotiating for owner: nmg11@xxxxxxxxx
> SchedLog:02/18/13 19:47:37 (pid:25985) Finished negotiating for nmg11 in local pool: 0 matched, 1 rejected
> 
> The processing node (slot3@xxxxxxxxxxxxxxxxxx  in this case) I see:
> 02/18/13 19:47:36 Create_Process succeeded, pid=5788                 
> 02/18/13 21:10:27 Process exited, pid=5788, status=0   
> 02/18/13 21:10:27 Got SIGQUIT.  Performing fast shutdown.
> 02/18/13 21:10:27 ShutdownFast all jobs.             
> 02/18/13 21:10:27 **** condor_starter (condor_STARTER) pid 5785 EXITING WITH STATUS 0
> 
> 
> I'm inclined to think the job crashed or failed and the SIGQUIT was sent to condor as a result of the crash.  Is there something else going on that I should debug.  Google has not been much help thus far  :)


These logs show that the job completed normally with exit code 0. The SIGQUIT is sent to the condor_starter process as part of the cleanup after the job completes. There's no sign here of anything unusual. Are there any other indications you're seeing that suggest that the job crashed?

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project