[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problems with hibernation



Hi Ben,

I've tried both with and without COLLECTOR_PERSISTENT_AD_LOG defined, and the problem occurs in both cases. When it is defined, details about the hibernating machine are successfully added to the file, but the "update_with_ack: Failed to send query EOM to collector host" still occurs.

Below I show an extract from the CollectorLog containing the error (*) with COLLECTOR_DEBUG set to D_ALL.

Thanks,
Andrew.

(*)
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) State = FDS_READY
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) max_fd = 11
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) Selection FD's
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) 	Read {4 6 7 9 10 11 } = 6
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) 	Write {} = 0
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) 	Except {} = 0
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) Ready FD's
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) 	Read {9 } = 1
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) 	Write {} = 0
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) 	Except {} = 0
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) Timeout = 21.000000 seconds
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) ACCEPT bound to <a.b.e.f:9618> fd=13 peer=<a.b.c.d:45636>
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) condor_read(fd=13 <a.b.c.d:45636>,,size=5,timeout=1,flags=2,non_blocking=0)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) condor_read(fd=13 <a.b.c.d:45636>,,size=5,timeout=0,flags=0,non_blocking=1)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) condor_read(fd=13 <a.b.c.d:45636>,,size=624,timeout=0,flags=0,non_blocking=1)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_SECURITY) DC_AUTHENTICATE: received DC_AUTHENTICATE from <a.b.c.d:45636>
02/13/15 18:04:56 (fd:14) (pid:62918) (D_SECURITY) DC_AUTHENTICATE: resuming session id condor-test01:62918:1423848553:240 with return address <a.b.c.d:47955>:
02/13/15 18:04:56 (fd:14) (pid:62918) (D_SECURITY) SECMAN: other side is $CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $, NOT reauthenticating.
02/13/15 18:04:56 (fd:14) (pid:62918) (D_SECURITY) DC_AUTHENTICATE: message authenticator enabled with key id condor-test01:62918:1423848553:240.
02/13/15 18:04:56 (fd:14) (pid:62918) (D_SECURITY) DC_AUTHENTICATE: encryption enabled for session condor-test01:62918:1423848553:240
02/13/15 18:04:56 (fd:14) (pid:62918) (D_SECURITY) DC_AUTHENTICATE: Success.
02/13/15 18:04:56 (fd:14) (pid:62918) (D_SECURITY) IPVERIFY: matched user condor_pool.domain from a.b.* to allow list
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) PERMISSION GRANTED to condor_pool.domain from host a.b.c.d for command 60 (UPDATE_STARTD_AD_WITH_ACK), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy allows IP address a.b.c.d; identifiers used for this remote host: a.b.c.d,wn-test04.domain
02/13/15 18:04:56 (fd:14) (pid:62918) (D_COMMAND) Received TCP command 60 (UPDATE_STARTD_AD_WITH_ACK) from condor_pool.domain <a.b.c.d:45636>, access level ADVERTISE_STARTD
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) condor_read(fd=13 <a.b.c.d:45636>,,size=21,timeout=20,flags=0,non_blocking=1)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) condor_read(fd=13 <a.b.c.d:45636>,,size=4075,timeout=20,flags=0,non_blocking=1)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_DAEMONCORE) In DaemonCore Timeout()
02/13/15 18:04:56 (fd:14) (pid:62918) (D_DAEMONCORE) DaemonCore Timeout() Complete, returning 19 
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) PERF: entering select
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) PERF: leaving select
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) State = FDS_READY
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) max_fd = 13
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) Selection FD's
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) 	Read {4 6 7 9 10 11 13 } = 7
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) 	Write {} = 0
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) 	Except {} = 0
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) Ready FD's
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) 	Read {13 } = 1
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) 	Write {} = 0
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) 	Except {} = 0
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) Timeout = 19.000000 seconds
02/13/15 18:04:56 (fd:14) (pid:62918) (D_DAEMONCORE) Calling Handler <DaemonCore::HandleReqPayloadReady> for Socket <Waiting for command 60 payload>
02/13/15 18:04:56 (fd:14) (pid:62918) (D_COMMAND) Calling Handler <DaemonCore::HandleReqPayloadReady> (4)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_DAEMONCORE) Cancel_Socket: cancelled socket 4 <Waiting for command 60 payload> 0x24bffc0
02/13/15 18:04:56 (fd:14) (pid:62918) (D_COMMAND) Calling HandleReq <receive_update_expect_ack> (0) for command 60 (UPDATE_STARTD_AD_WITH_ACK) from condor_pool.domain <a.b.c.d:45636>
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) condor_read(fd=13 <a.b.c.d:45636>,,size=21,timeout=1,flags=0,non_blocking=0)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) condor_read(fd=13 <a.b.c.d:45636>,,size=2308,timeout=1,flags=0,non_blocking=0)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) Want private ads, but no socket given!
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) condor_write(fd=13 <a.b.c.d:45636>,,size=29,timeout=5,flags=0,non_blocking=0)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_ALWAYS) Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxx,a.b.c.d>
02/13/15 18:04:56 (fd:14) (pid:62918) (D_COMMAND) Return from HandleReq <receive_update_expect_ack> (handler: 0.052006s, sec: 0.008s, payload: 0.001s)
02/13/15 18:04:56 (fd:14) (pid:62918) (D_NETWORK) CLOSE <a.b.e.f:9618> fd=13
02/13/15 18:04:56 (fd:13) (pid:62918) (D_COMMAND) Return from Handler <DaemonCore::HandleReqPayloadReady> 0.052220s
02/13/15 18:04:56 (fd:13) (pid:62918) (D_DAEMONCORE) In DaemonCore Timeout()
02/13/15 18:04:56 (fd:13) (pid:62918) (D_DAEMONCORE) DaemonCore Timeout() Complete, returning 19 
02/13/15 18:04:56 (fd:13) (pid:62918) (D_ALWAYS) PERF: entering select


________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Ben Cotton [ben.cotton@xxxxxxxxxxxxxxxxxx]
Sent: Friday, February 13, 2015 5:03 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Problems with hibernation

On Fri, Feb 13, 2015 at 11:29 AM,  <andrew.lahiff@xxxxxxxxxx> wrote:
> Has anyone seen this before or know what's going on?

I have not, unfortunately. I would expect a different failure mode if
it were the case, but in the interests of covering the obvious
questions, do you have COLLECTOR_PERSISTENT_AD_LOG set on your
collector and can the user running condor_collector write to it?

If you set COLLECTOR_DEBUG to be chattier (I'm not sure what your'e at
right now), does it give any extra clues?


Thanks,
BC

--
Ben Cotton
main: 888.292.5320

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/