[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to Connect



Hi Jamie.

192.168.1.206 is the master machine that I'm using to start the condor jobs, so it should ping itself.

The error occurs everytime after the first time. If I restart Condor, the error still occurs. If I reboot 192.168.1.206 then Condor will run. I need to check other Condor commands next time it happens - but as I remember condor_q didn't work.


--
Kind regards,

Justin Fisher.

On Wed, Aug 2, 2017 at 10:07 PM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
On Jul 28, 2017, at 4:27 AM, Justin Fisher <justin0419@xxxxxxxxx> wrote:

I occasionally get this error. 192.168.1.206 is the machine I use to submit the jobs. I think it's some kind of network issue, but I'm not sure. My work around is to reboot the submit machine, but is there a less drastic method?

I can ping all the other machines on the network and the NFS shares needed for Condor are all there.

ERROR: Failed to connect to local queue manager

This looks like an error message that condor_submit prints.
When this error occurs, does it happen every time, or does condor_submit still work sometimes? Do other commands that talk to the schedd (e.g. condor_q, condor_rm) also fail?

You say you can ping all of the other machines on the network. Can you ping this machine (192.168.1.206) when the errors occur? If the machine is otherwise healthy, you can try restarting just the HTCondor daemons.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/