[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] RuntimeError: Failed to commmit and disconnect from queue.



This has nothing do to with negotiation.  

 

You managed to time out the connection to the Schedd before the submit transaction was allowed to complete.

The timeout for holding open a transaction to the schedd without making any forward progress is 20 seconds.

 

So the transaction failed and no jobs were submitted.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of don_vanchos
Sent: Wednesday, August 14, 2019 12:52 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] RuntimeError: Failed to commmit and disconnect from queue.

 

Hello,

 

I use Python with HTCondor, and when I submit the following task, everything works fine:

 

sub = htcondor.Submit({
"executable": "/bin/echo",
"arguments": "hello_world",
"universe": "vanilla",
"should_transfer_files": "NO",
"transfer_executable": "False",
"output": "stdout.txt",
"initialdir": "/tmpdir",
"run_as_owner": "True",
"+Owner": classad.quote("user"),
})

with schedd.transaction() as schedd_transaction:
    cluster_id = sub.queue(schedd_transaction)

But then I add another line inside the 'with' _expression_ (and put 'import time' at the beginning of the file). It turns out the following:

with schedd.transaction() as schedd_transaction:
   cluster_id = sub.queue(schedd_transaction)
   time.sleep(30)

 

So the last code does not work (of course, I did import time at the beginning of the file.), the error is:

        with schedd.transaction() as schedd_transaction:
            cluster_id = sub.queue(schedd_transaction)
>           time.sleep(30)
E           RuntimeError: Failed to commmit and disconnect from queue.

 

The question is, why is this error happening? And how does this relate to NEGOTIATOR_INTERVAL setting? (Because 30 seconds will attach to the error when the setting is equal to 60 (default), and time.sleep(1) leads to the error when NEGOTIATOR_INTERVAL=5.)

 

Log:

08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) condor_write(fd=4 schedd at <192.168.128.5:9618>,,size=13,timeout=0,flags=0,non_blocking=0)
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a60 resetting
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a60 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a60 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a60 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4ac0 resetting
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) condor_read(fd=4 schedd at <192.168.128.5:9618>,,size=5,timeout=0,flags=0,non_blocking=0)
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4ac0 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS:2) condor_read(): Socket closed when trying to read 5 bytes from schedd at <192.168.128.5:9618>
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS:2) IO: EOF reading packet header
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) Stream::get(int) failed to read padding
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) condor_write(fd=4 schedd at <192.168.128.5:9618>,,size=13,timeout=0,flags=0,non_blocking=0)
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a40 resetting
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a40 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a40 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a40 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS) condor_write() failed: send() 13 bytes to schedd at <192.168.128.5:9618> returned -1, timeout=0, errno=32 Broken pipe.
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS) Buf::write(): condor_write() failed
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) condor_write(fd=4 schedd at <192.168.128.5:9618>,,size=13,timeout=0,flags=0,non_blocking=0)
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4b10 resetting
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4b10 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4b10 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4b10 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS) condor_write() failed: send() 13 bytes to schedd at <192.168.128.5:9618> returned -1, timeout=0, errno=32 Broken pipe.
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS) Buf::write(): condor_write() failed
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) CLOSE TCP <192.168.128.2:43967> fd=4

 

--

Sincerely yours,
Ivan Ergunov                                                 mailto:hozblok@xxxxxxxxx