[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor SOAP hanging schedd.



Patrick,

A quick look shows you're getting all the same stack...

#0  0x00d2d402 in __kernel_vsyscall ()
#1  0x004f967d in ___newselect_nocancel () from /lib/i686/nosegneg/libc.so.6
#2  0x08532c14 in __wrap_select ()
#3  0x083343aa in ?? ()
#4  0x083275ed in soap_flush_raw ()
#5  0x0832c726 in soap_flush ()
#6  0x0832c7e4 in soap_send_raw ()
#7  0x0832e3fe in soap_element_start_end_out ()
#8  0x081c2f82 in soap_out_condor__ClassAdStructAttr ()
#9  0x081c316a in soap_out_condor__ClassAdStruct ()
#10 0x081c332d in soap_out_condor__ClassAdStructArray ()
#11 0x081c33ec in soap_out_condor__ClassAdStructArrayAndStatus ()
#12 0x081c347d in soap_out_condor__getJobAdsResponse ()
#13 0x081c857f in soap_put_condor__getJobAdsResponse ()
#14 0x081db518 in soap_serve_condor__getJobAds ()
#15 0x081dde8f in soap_serve_request ()
#16 0x081de1aa in soap_serve ()
#17 0x081e3511 in dc_soap_serve ()
#18 0x081f5e38 in DaemonCore::HandleReq ()

I imagine if the Schedd is using CPU and IO bandwidth then it's just a matter of the response to the getJobAds taking a long time to write. If this happened all the time I'd imagine maybe the Schedd is just slow. However, it could be that your client is periodically reading slowly. Maybe the client is interleaving reads with computation.

It's also possible there's a bug in gSOAP at the data scale of 10,000 jobs, but that would be somewhat surprising.

Best,


matt

Patrick Armstrong wrote:
Sorry it took so long for me to reply. Here's the output from running pstack for a while:

http://pastebin.com/GUzUVDYa

thanks

--patrick

On 24-Jun-10, at 5:39 AM, Matthew Farrellee wrote:

Please capture some pstack output from the schedd when it is hung and report back.

(as root)
while [ 1 ]; do date; pstack $(pidof condor_schedd); sleep 3; done | tee schedd.$(pidof condor_schedd).pstack

Best,


matt

On 06/21/2010 02:19 PM, Patrick Armstrong wrote:
Has anyone else ever seen this problem? Is there any more information I
can provide?



On 16-Jun-10, at 11:41 AM, Patrick Armstrong wrote:

I've been having some trouble with condor soap queries hanging my
schedd. I have Condor 7.5.2 installed, with a pool of about 200
workers, and about 10000 jobs in my queue, and every ten minutes or
so, a script of mine is querying the schedd with the soap interface.
Normally, this takes about two minutes, and looks like this in the log:

06/16/10 10:39:51 Received HTTP POST connection from <127.0.0.1:59318>
06/16/10 10:39:51 Current Socket bufsize=85k
06/16/10 10:39:51 Current Socket bufsize=49k
06/16/10 10:39:51 About to serve HTTP request...
06/16/10 10:39:51 SOAP entered getJobAds(), transaction: 0
06/16/10 10:39:53 SOAP leaving getJobAds() result=0
06/16/10 10:41:20 Completed servicing HTTP request


However, I'll occasionally see the schedd get stuck, and not do
anything until I send it SIGKILL. The log looks like this:


[root@canfarpool ~]# tail /var/log/condor/SchedLog
06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
<142.104.63.28:48906>, access level DAEMON
06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
<142.104.63.28:48906>, access level DAEMON
06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
<142.104.63.28:48906>, access level DAEMON
06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
<142.104.63.28:48906>, access level DAEMON
06/16/10 10:58:11 Received HTTP POST connection from <127.0.0.1:34416>
06/16/10 10:58:11 Current Socket bufsize=85k
06/16/10 10:58:11 Current Socket bufsize=49k
06/16/10 10:58:11 About to serve HTTP request...
06/16/10 10:58:11 SOAP entered getJobAds(), transaction: 0
06/16/10 10:58:14 SOAP leaving getJobAds() result=0
[root@canfarpool ~]# date
Wed Jun 16 11:39:57 PDT 2010

As you can see, it's been stuck for about 40 minutes.


Has anyone else run into this?

--patrick

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/