[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor SOAP hanging schedd.



Sorry it took so long for me to reply. Here's the output from running pstack for a while:

http://pastebin.com/GUzUVDYa

thanks

--patrick

On 24-Jun-10, at 5:39 AM, Matthew Farrellee wrote:

Please capture some pstack output from the schedd when it is hung and report back.

(as root)
while [ 1 ]; do date; pstack $(pidof condor_schedd); sleep 3; done | tee schedd.$(pidof condor_schedd).pstack

Best,


matt

On 06/21/2010 02:19 PM, Patrick Armstrong wrote:
Has anyone else ever seen this problem? Is there any more information I
can provide?



On 16-Jun-10, at 11:41 AM, Patrick Armstrong wrote:

I've been having some trouble with condor soap queries hanging my
schedd. I have Condor 7.5.2 installed, with a pool of about 200
workers, and about 10000 jobs in my queue, and every ten minutes or
so, a script of mine is querying the schedd with the soap interface.
Normally, this takes about two minutes, and looks like this in the log:

06/16/10 10:39:51 Received HTTP POST connection from <127.0.0.1:59318>
06/16/10 10:39:51 Current Socket bufsize=85k
06/16/10 10:39:51 Current Socket bufsize=49k
06/16/10 10:39:51 About to serve HTTP request...
06/16/10 10:39:51 SOAP entered getJobAds(), transaction: 0
06/16/10 10:39:53 SOAP leaving getJobAds() result=0
06/16/10 10:41:20 Completed servicing HTTP request


However, I'll occasionally see the schedd get stuck, and not do
anything until I send it SIGKILL. The log looks like this:


[root@canfarpool ~]# tail /var/log/condor/SchedLog
06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
<142.104.63.28:48906>, access level DAEMON
06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
<142.104.63.28:48906>, access level DAEMON
06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
<142.104.63.28:48906>, access level DAEMON
06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
<142.104.63.28:48906>, access level DAEMON
06/16/10 10:58:11 Received HTTP POST connection from <127.0.0.1:34416>
06/16/10 10:58:11 Current Socket bufsize=85k
06/16/10 10:58:11 Current Socket bufsize=49k
06/16/10 10:58:11 About to serve HTTP request...
06/16/10 10:58:11 SOAP entered getJobAds(), transaction: 0
06/16/10 10:58:14 SOAP leaving getJobAds() result=0
[root@canfarpool ~]# date
Wed Jun 16 11:39:57 PDT 2010

As you can see, it's been stuck for about 40 minutes.


Has anyone else run into this?

--patrick

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/