[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] RuntimeError: Failed to commmit and disconnect from queue.



Hi John,

Thank you for your answer!ÂIs there any setting to manage this timeout (20 sec)?


ÑÑ, 14 ÐÐÐ. 2019 Ð. Ð 21:25, John M Knoeller <johnkn@xxxxxxxxxxx>:

This has nothing do to with negotiation.ÂÂ

Â

You managed to time out the connection to the Schedd before the submit transaction was allowed to complete.

The timeout for holding open a transaction to the schedd without making any forward progress is 20 seconds.

Â

So the transaction failed and no jobs were submitted.

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of don_vanchos
Sent: Wednesday, August 14, 2019 12:52 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] RuntimeError: Failed to commmit and disconnect from queue.

Â

Hello,

Â

I use Python with HTCondor, and when I submit the following task, everything works fine:

Â

sub = htcondor.Submit({
"executable": "/bin/echo",
"arguments": "hello_world",
"universe": "vanilla",
"should_transfer_files": "NO",
"transfer_executable": "False",
"output": "stdout.txt",
"initialdir": "/tmpdir",
"run_as_owner": "True",
"+Owner": classad.quote("user"),
})

with schedd.transaction() as schedd_transaction:
  cluster_id = sub.queue(schedd_transaction)

But then I add another line inside the 'with' _expression_ (and put 'import time' at the beginning of the file).ÂIt turns out the following:

with schedd.transaction() as schedd_transaction:
 Âcluster_id = sub.queue(schedd_transaction)
 Âtime.sleep(30)

Â

So the last code does not work (of course, I did import time at the beginning of the file.), the error is:

    with schedd.transaction() as schedd_transaction:
      cluster_id = sub.queue(schedd_transaction)
> Â Â Â Â Â time.sleep(30)
E Â Â Â Â Â RuntimeError: Failed to commmit and disconnect from queue.

Â

The question is, why is this error happening? And how does this relate toÂNEGOTIATOR_INTERVAL setting? (Because 30 seconds will attach to the error when the setting is equal to 60 (default), and time.sleep(1) leads to the error when NEGOTIATOR_INTERVAL=5.)

Â

Log:

08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) condor_write(fd=4 schedd at <192.168.128.5:9618>,,size=13,timeout=0,flags=0,non_blocking=0)
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a60 resetting
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a60 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a60 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a60 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4ac0 resetting
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) condor_read(fd=4 schedd at <192.168.128.5:9618>,,size=5,timeout=0,flags=0,non_blocking=0)
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4ac0 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS:2) condor_read(): Socket closed when trying to read 5 bytes from schedd at <192.168.128.5:9618>
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS:2) IO: EOF reading packet header
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) Stream::get(int) failed to read padding
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) condor_write(fd=4 schedd at <192.168.128.5:9618>,,size=13,timeout=0,flags=0,non_blocking=0)
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a40 resetting
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a40 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a40 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4a40 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS) condor_write() failed: send() 13 bytes to schedd at <192.168.128.5:9618> returned -1, timeout=0, errno=32 Broken pipe.
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS) Buf::write(): condor_write() failed
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) condor_write(fd=4 schedd at <192.168.128.5:9618>,,size=13,timeout=0,flags=0,non_blocking=0)
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4b10 resetting
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4b10 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4b10 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_DAEMONCORE:2) selector 0x7ffdfddc4b10 adding fd 4 (socket:[1904821])
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS) condor_write() failed: send() 13 bytes to schedd at <192.168.128.5:9618> returned -1, timeout=0, errno=32 Broken pipe.
08/14/19 17:38:45 (fd:5) (pid:25536) (D_ALWAYS) Buf::write(): condor_write() failed
08/14/19 17:38:45 (fd:5) (pid:25536) (D_NETWORK) CLOSE TCP <192.168.128.2:43967> fd=4

Â

--

Sincerely yours,
Ivan Ergunov                         mailto:hozblok@xxxxxxxxx

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Sincerely yours,
Ivan Ergunov                         mailto:hozblok@xxxxxxxxx