[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] SIGQUIT / debugging
- Date: Wed, 20 Feb 2013 03:45:02 +0000
- From: "Shrum, Donald C" <DCShrum@xxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] SIGQUIT / debugging
Thanks for the reply Jaime... Here is some more detail.. The program I am running is a simple test program.
The problem seems to occur on only one submit node so perhaps I will figure this out prior to getting a reply :)
FSU Research Computing Center
When I submit a job I see this on the scheduler -
02/19/13 22:32:15 (pid:2939) Sent ad to central manager for dcshrum@xxxxxxxxxxxxxxxxxx
02/19/13 22:32:17 (pid:2939) Finished negotiating for dcshrum in local pool: 1 matched, 0 rejected
02/19/13 22:32:18 (pid:2939) match (slot5@xxxxxxxxxxxxxxxxx <10.178.6.41:46528> for dcshrum) out of jobs; relinquishing
02/19/13 22:32:18 (pid:2939) Completed RELEASE_CLAIM to startd slot5@xxxxxxxxxxxxxxxxx <10.178.6.41:46528> for dcshrum
02/19/13 22:32:18 (pid:2939) Match record (slot5@xxxxxxxxxxxxxxxxx <10.178.6.41:46528> for dcshrum, 12.0) deleted
I get an error back in my error that looks like this -
000 (012.000.000) 02/19 22:32:15 Job submitted from host: <10.178.6.3:46701>
001 (012.000.000) 02/19 22:32:18 Job executing on host: <10.178.6.41:46528>
005 (012.000.000) 02/19 22:32:18 Job terminated.
(0) Abnormal termination (signal 11)
(0) No core file
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
7543 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
7543 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 15 10 55363437
Memory (MB) : 1 6033
If I look at the processing node (10.178.6.41 in this case) I see this in the logs...
02/19/13 22:32:18 About to exec /var/lib/condor/execute/dir_28037/condor_exec.exe
02/19/13 22:32:18 Create_Process succeeded, pid=28040
02/19/13 22:32:18 Process exited, pid=28040, signal=11
02/19/13 22:32:18 Got SIGQUIT. Performing fast shutdown.
02/19/13 22:32:18 ShutdownFast all jobs.
02/19/13 22:32:18 **** condor_starter (condor_STARTER) pid 28037 EXITING WITH STATUS 0
From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
Sent: Tuesday, February 19, 2013 12:32 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] SIGQUIT / debugging
On Feb 19, 2013, at 7:34 AM, "Shrum, Donald C" <DCShrum@xxxxxxxxxxxxx> wrote:
> I periodically see jobs that fail with a SIGQUIT
> In the scheduler:
> SchedLog:02/18/13 19:47:35 (pid:25985) match (slot3@xxxxxxxxxxxxxxxxxx <10.178.6.101:54726> for nmg11) switching to job 5911.734
> SchedLog:02/18/13 19:47:35 (pid:25985) Started shadow for job 5911.734 on slot3@xxxxxxxxxxxxxxxxxx <10.178.6.101:54726> for nmg11, (shadow pid = 14851)
> SchedLog:02/18/13 19:47:37 (pid:25985) Negotiating for owner: nmg11@xxxxxxxxx
> SchedLog:02/18/13 19:47:37 (pid:25985) Finished negotiating for nmg11 in local pool: 0 matched, 1 rejected
> The processing node (slot3@xxxxxxxxxxxxxxxxxx in this case) I see:
> 02/18/13 19:47:36 Create_Process succeeded, pid=5788
> 02/18/13 21:10:27 Process exited, pid=5788, status=0
> 02/18/13 21:10:27 Got SIGQUIT. Performing fast shutdown.
> 02/18/13 21:10:27 ShutdownFast all jobs.
> 02/18/13 21:10:27 **** condor_starter (condor_STARTER) pid 5785 EXITING WITH STATUS 0
> I'm inclined to think the job crashed or failed and the SIGQUIT was sent to condor as a result of the crash. Is there something else going on that I should debug. Google has not been much help thus far :)
These logs show that the job completed normally with exit code 0. The SIGQUIT is sent to the condor_starter process as part of the cleanup after the job completes. There's no sign here of anything unusual. Are there any other indications you're seeing that suggest that the job crashed?
Thanks and regards,
UW-Madison HTCondor Project
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: