[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SIGQUIT / debugging



What do you mean by "simple test program"? Note that the signal it is receiving is SIGSEGV, i.e. segmentation fault. Seeing that your program terminates immediately, there is very likely a problem with the environment or hardware on your worker node. Can you run the program manually there?

-Max

On 02/20/2013 04:45 AM, Shrum, Donald C wrote:
Thanks for the reply Jaime... Here is some more detail..  The program I am running is a simple test program.
The problem seems to occur on only one submit node so perhaps I will figure this out prior to getting a reply  :)

--Donny
FSU Research Computing Center

When I submit a job I see this on the scheduler -
02/19/13 22:32:15 (pid:2939) Sent ad to central manager for dcshrum@xxxxxxxxxxxxxxxxxx
<snip>
02/19/13 22:32:17 (pid:2939) Finished negotiating for dcshrum in local pool: 1 matched, 0 rejected
<snip>
02/19/13 22:32:18 (pid:2939) match (slot5@xxxxxxxxxxxxxxxxx <10.178.6.41:46528> for dcshrum) out of jobs; relinquishing
02/19/13 22:32:18 (pid:2939) Completed RELEASE_CLAIM to startd slot5@xxxxxxxxxxxxxxxxx <10.178.6.41:46528> for dcshrum
02/19/13 22:32:18 (pid:2939) Match record (slot5@xxxxxxxxxxxxxxxxx <10.178.6.41:46528> for dcshrum, 12.0) deleted

I get an error back in my error that looks like this -
000 (012.000.000) 02/19 22:32:15 Job submitted from host: <10.178.6.3:46701>
001 (012.000.000) 02/19 22:32:18 Job executing on host: <10.178.6.41:46528>
005 (012.000.000) 02/19 22:32:18 Job terminated.
         (0) Abnormal termination (signal 11)
         (0) No core file
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                 Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
         0  -  Run Bytes Sent By Job
         7543  -  Run Bytes Received By Job
         0  -  Total Bytes Sent By Job
         7543  -  Total Bytes Received By Job
         Partitionable Resources :    Usage  Request Allocated
            Cpus                 :                       1         1
            Disk (KB)            :     	  15       10           55363437
            Memory (MB)          :              1      6033

If I look at the processing node (10.178.6.41 in this case) I see this in the logs...
02/19/13 22:32:18 About to exec /var/lib/condor/execute/dir_28037/condor_exec.exe
02/19/13 22:32:18 Create_Process succeeded, pid=28040
02/19/13 22:32:18 Process exited, pid=28040, signal=11
02/19/13 22:32:18 Got SIGQUIT.  Performing fast shutdown.
02/19/13 22:32:18 ShutdownFast all jobs.
02/19/13 22:32:18 **** condor_starter (condor_STARTER) pid 28037 EXITING WITH STATUS 0





-----Original Message-----
From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
Sent: Tuesday, February 19, 2013 12:32 PM
To: HTCondor-Users Mail List
Cc: condor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] SIGQUIT / debugging

On Feb 19, 2013, at 7:34 AM, "Shrum, Donald C" <DCShrum@xxxxxxxxxxxxx> wrote:

I periodically see jobs that fail with a SIGQUIT

In the scheduler:
SchedLog:02/18/13 19:47:35 (pid:25985) match (slot3@xxxxxxxxxxxxxxxxxx <10.178.6.101:54726> for nmg11) switching to job 5911.734
SchedLog:02/18/13 19:47:35 (pid:25985) Started shadow for job 5911.734 on slot3@xxxxxxxxxxxxxxxxxx <10.178.6.101:54726> for nmg11, (shadow pid = 14851)
SchedLog:02/18/13 19:47:37 (pid:25985) Negotiating for owner: nmg11@xxxxxxxxx
SchedLog:02/18/13 19:47:37 (pid:25985) Finished negotiating for nmg11 in local pool: 0 matched, 1 rejected

The processing node (slot3@xxxxxxxxxxxxxxxxxx  in this case) I see:
02/18/13 19:47:36 Create_Process succeeded, pid=5788
02/18/13 21:10:27 Process exited, pid=5788, status=0
02/18/13 21:10:27 Got SIGQUIT.  Performing fast shutdown.
02/18/13 21:10:27 ShutdownFast all jobs.
02/18/13 21:10:27 **** condor_starter (condor_STARTER) pid 5785 EXITING WITH STATUS 0


I'm inclined to think the job crashed or failed and the SIGQUIT was sent to condor as a result of the crash.  Is there something else going on that I should debug.  Google has not been much help thus far  :)

These logs show that the job completed normally with exit code 0. The SIGQUIT is sent to the condor_starter process as part of the cleanup after the job completes. There's no sign here of anything unusual. Are there any other indications you're seeing that suggest that the job crashed?

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/