[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor SOAP hanging schedd.



Please capture some pstack output from the schedd when it is hung and report back.

(as root)
while [ 1 ]; do date; pstack $(pidof condor_schedd); sleep 3; done | tee schedd.$(pidof condor_schedd).pstack

Best,


matt

On 06/21/2010 02:19 PM, Patrick Armstrong wrote:
> Has anyone else ever seen this problem? Is there any more information I
> can provide?
> 
> 
> 
> On 16-Jun-10, at 11:41 AM, Patrick Armstrong wrote:
> 
>> I've been having some trouble with condor soap queries hanging my
>> schedd. I have Condor 7.5.2 installed, with a pool of about 200
>> workers, and about 10000 jobs in my queue, and every ten minutes or
>> so, a script of mine is querying the schedd with the soap interface.
>> Normally, this takes about two minutes, and looks like this in the log:
>>
>> 06/16/10 10:39:51 Received HTTP POST connection from <127.0.0.1:59318>
>> 06/16/10 10:39:51 Current Socket bufsize=85k
>> 06/16/10 10:39:51 Current Socket bufsize=49k
>> 06/16/10 10:39:51 About to serve HTTP request...
>> 06/16/10 10:39:51 SOAP entered getJobAds(), transaction: 0
>> 06/16/10 10:39:53 SOAP leaving getJobAds() result=0
>> 06/16/10 10:41:20 Completed servicing HTTP request
>>
>>
>> However, I'll occasionally see the schedd get stuck, and not do
>> anything until I send it SIGKILL. The log looks like this:
>>
>>
>> [root@canfarpool ~]# tail /var/log/condor/SchedLog
>> 06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from 
>> <142.104.63.28:48906>, access level DAEMON
>> 06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from 
>> <142.104.63.28:48906>, access level DAEMON
>> 06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from 
>> <142.104.63.28:48906>, access level DAEMON
>> 06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from 
>> <142.104.63.28:48906>, access level DAEMON
>> 06/16/10 10:58:11 Received HTTP POST connection from <127.0.0.1:34416>
>> 06/16/10 10:58:11 Current Socket bufsize=85k
>> 06/16/10 10:58:11 Current Socket bufsize=49k
>> 06/16/10 10:58:11 About to serve HTTP request...
>> 06/16/10 10:58:11 SOAP entered getJobAds(), transaction: 0
>> 06/16/10 10:58:14 SOAP leaving getJobAds() result=0
>> [root@canfarpool ~]# date
>> Wed Jun 16 11:39:57 PDT 2010
>>
>> As you can see, it's been stuck for about 40 minutes.
>>
>>
>> Has anyone else run into this?
>>
>> --patrick
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/