Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor SOAP bug: stopping server when there are pending transactions hangs daemons

Date: Tue, 2 May 2006 11:39:36 -0500
From: Matthew Farrellee <matt@xxxxxxxxxxx>
Subject: Re: [Condor-users] Condor SOAP bug: stopping server when there are pending transactions hangs daemons

Someone with a better understanding of Condor internals, please letme know if I'm wrong here.


(inline)

On May 2, 2006, at 9:30 AM, David E. Konerding wrote:

Matthew Farrellee wrote:

On May 1, 2006, at 5:42 PM, David E. Konerding wrote:

Hi,

I am noticing a very inconvenient bug with Condor SOAP:

If a transaction is begun, and has not yet expired, stopping the
condor
master causes all the daemons to go to a zombie
state and hang around.


This is probably the same problem as the condor_q issue below. All
Condor daemons are single threaded, so if there is a SOAP transaction
active no one can talk to the Schedd. I'm guessing that the Master
just gives up trying to tell it's children to shutdown at some point
and exits. If the children are shutdown serially then a "hanging"
Schedd at the beginning of the child list would account for this.


I'm confused by this answer.  There is nothing in a single threaded
application which prevents a server
from maintaining more than one simultaneous transaction (database
servers do this all the time).  Nor is there anything
that prevents a server from listening to a port and responding to
multiple requests (nearly) simultaneously.

This is certainly true, and I think I was wrong with what I saidabout the port being an issue.

So does this mean that the Condor source base itself has thelimitation
of one transaction at a time?

Right now, I believe it does. This is often not a problem, and younormally would not see this because in local submissions datatransfer is deferred and therefore the operation is very quick. Youshould try doing a remote condor_submit (use the -r or -s options)twice, simultaneously with a large input file. I think you'll findthat the submits are serialized.

What's happening when condor_q is being run at multiple times, orbeing
run when something is being submitted:
is condor_submit using transactions internally, and condor_q blocks
while submits are in progress?

I believe that the condor_q will definitely block. I think the samething will happen if you have a large number of job in a queue andyou run two condor_q operations (say start one "condor_q" and thenanother "condor_q -constraint FALSE").

Finally, what does this mean for me writing a web service jobsubmitted
and monitor where multiple submitters and monitors will be
accessing the same Condor SOAP server?  From my perspective, it means
all my client codes have to be aware of the single-transactionlimit and
has to retry operations, and be aggressive about asking for long
transaction times (because I'm doing file transfers and there could be
network timeouts,
I don't want to lose an entire job submission and file transfer
transaction just because there was a network dropout), yet
be careful to close down those transactions. If a single clientcrashes
with a long transaction outstanding, it'll host all the other clients.

It means that you have to deal with a failure case atBeginTransaction, where the "maximum number of transactions may havebeen exceeded." I put that in quotes because until some patches go in(in the next few weeks) the max number of transactions is set to 1.This error case is no different than a connection limit on a database-- a case everyone seems to ignore on the web.

As for an aggressive/long transaction timeout, keep in mind that eachcall you make to the Schedd in a transaction will extend yourtransaction for the number of seconds you passed to BeginTransaction.So say you call BeginTransaction(30) and 25 seconds later you callNewCluster(), once the Schedd receives NewCluster() you have 30seconds again before your transaction may be aborted, not 5. Thismeans you can keep a relatively low timeout and still carry out a lotof work, without fear that a faulty client can hose everyone.


I hope this helps.


matt

References:
- [Condor-users] Condor SOAP bug: stopping server when there are pending transactions hangs daemons
  - From: David E. Konerding
- Re: [Condor-users] Condor SOAP bug: stopping server when there are pending transactions hangs daemons
  - From: Matthew Farrellee
- Re: [Condor-users] Condor SOAP bug: stopping server when there are pending transactions hangs daemons
  - From: David E. Konerding

Prev by Date: Re: [Condor-users] Condor SOAP bug: stopping server when there are pending transactions hangs daemons
Next by Date: [Condor-users] Problem in Getting Result Back
Previous by thread: Re: [Condor-users] Condor SOAP bug: stopping server when there are pending transactions hangs daemons
Next by thread: [Condor-users] File open failure in condor_compiled application?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Condor SOAP bug: stopping server when there are pending transactions hangs daemons