[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SIGQUIT / debugging



Hi Max, 

I thought the same thing so I ran the program manually on that node and it ran just fine.

Here is the program, as you can see it's pretty simple :)

#include <stdio.h>
#include <stdlib.h>

int main ( )
{

        int mod = 355%113;
        int next = (mod*10)/113;

        long int cnt=0;
        long int max  = 1000;

        while ((mod > 0) && (cnt < max))
        {
                cnt++;
                mod = (mod*10)%113;
                next = (mod*10)/113;
        }

        printf("Done");

}



-----Original Message-----
From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Max Fischer
Sent: Wednesday, February 20, 2013 8:50 AM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] SIGQUIT / debugging

What do you mean by "simple test program"? Note that the signal it is receiving is SIGSEGV, i.e. segmentation fault. Seeing that your program terminates immediately, there is very likely a problem with the environment or hardware on your worker node. Can you run the program manually there?

-Max

On 02/20/2013 04:45 AM, Shrum, Donald C wrote:
> Thanks for the reply Jaime... Here is some more detail..  The program I am running is a simple test program.
> The problem seems to occur on only one submit node so perhaps I will 
> figure this out prior to getting a reply  :)
>
> --Donny
> FSU Research Computing Center
>
> When I submit a job I see this on the scheduler -
> 02/19/13 22:32:15 (pid:2939) Sent ad to central manager for 
> dcshrum@xxxxxxxxxxxxxxxxxx <snip>
> 02/19/13 22:32:17 (pid:2939) Finished negotiating for dcshrum in local 
> pool: 1 matched, 0 rejected <snip>
> 02/19/13 22:32:18 (pid:2939) match (slot5@xxxxxxxxxxxxxxxxx 
> <10.178.6.41:46528> for dcshrum) out of jobs; relinquishing
> 02/19/13 22:32:18 (pid:2939) Completed RELEASE_CLAIM to startd 
> slot5@xxxxxxxxxxxxxxxxx <10.178.6.41:46528> for dcshrum
> 02/19/13 22:32:18 (pid:2939) Match record (slot5@xxxxxxxxxxxxxxxxx 
> <10.178.6.41:46528> for dcshrum, 12.0) deleted
>
> I get an error back in my error that looks like this -
> 000 (012.000.000) 02/19 22:32:15 Job submitted from host: 
> <10.178.6.3:46701>
> 001 (012.000.000) 02/19 22:32:18 Job executing on host: 
> <10.178.6.41:46528>
> 005 (012.000.000) 02/19 22:32:18 Job terminated.
>          (0) Abnormal termination (signal 11)
>          (0) No core file
>                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
>                  Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>          0  -  Run Bytes Sent By Job
>          7543  -  Run Bytes Received By Job
>          0  -  Total Bytes Sent By Job
>          7543  -  Total Bytes Received By Job
>          Partitionable Resources :    Usage  Request Allocated
>             Cpus                 :                       1         1
>             Disk (KB)            :     	  15       10           55363437
>             Memory (MB)          :              1      6033
>
> If I look at the processing node (10.178.6.41 in this case) I see this in the logs...
> 02/19/13 22:32:18 About to exec 
> /var/lib/condor/execute/dir_28037/condor_exec.exe
> 02/19/13 22:32:18 Create_Process succeeded, pid=28040
> 02/19/13 22:32:18 Process exited, pid=28040, signal=11
> 02/19/13 22:32:18 Got SIGQUIT.  Performing fast shutdown.
> 02/19/13 22:32:18 ShutdownFast all jobs.
> 02/19/13 22:32:18 **** condor_starter (condor_STARTER) pid 28037 
> EXITING WITH STATUS 0
>
>
>
>
>
> -----Original Message-----
> From: htcondor-users-bounces@xxxxxxxxxxx 
> [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
> Sent: Tuesday, February 19, 2013 12:32 PM
> To: HTCondor-Users Mail List
> Cc: condor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] SIGQUIT / debugging
>
> On Feb 19, 2013, at 7:34 AM, "Shrum, Donald C" <DCShrum@xxxxxxxxxxxxx> wrote:
>
>> I periodically see jobs that fail with a SIGQUIT
>>
>> In the scheduler:
>> SchedLog:02/18/13 19:47:35 (pid:25985) match 
>> (slot3@xxxxxxxxxxxxxxxxxx <10.178.6.101:54726> for nmg11) switching 
>> to job 5911.734
>> SchedLog:02/18/13 19:47:35 (pid:25985) Started shadow for job 
>> 5911.734 on slot3@xxxxxxxxxxxxxxxxxx <10.178.6.101:54726> for nmg11, 
>> (shadow pid = 14851)
>> SchedLog:02/18/13 19:47:37 (pid:25985) Negotiating for owner: 
>> nmg11@xxxxxxxxx
>> SchedLog:02/18/13 19:47:37 (pid:25985) Finished negotiating for nmg11 
>> in local pool: 0 matched, 1 rejected
>>
>> The processing node (slot3@xxxxxxxxxxxxxxxxxx  in this case) I see:
>> 02/18/13 19:47:36 Create_Process succeeded, pid=5788
>> 02/18/13 21:10:27 Process exited, pid=5788, status=0
>> 02/18/13 21:10:27 Got SIGQUIT.  Performing fast shutdown.
>> 02/18/13 21:10:27 ShutdownFast all jobs.
>> 02/18/13 21:10:27 **** condor_starter (condor_STARTER) pid 5785 
>> EXITING WITH STATUS 0
>>
>>
>> I'm inclined to think the job crashed or failed and the SIGQUIT was 
>> sent to condor as a result of the crash.  Is there something else 
>> going on that I should debug.  Google has not been much help thus far  
>> :)
>
> These logs show that the job completed normally with exit code 0. The SIGQUIT is sent to the condor_starter process as part of the cleanup after the job completes. There's no sign here of anything unusual. Are there any other indications you're seeing that suggest that the job crashed?
>
> Thanks and regards,
> Jaime Frey
> UW-Madison HTCondor Project
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/