[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Startd couldn't change state to UnClaimed after job finished



hi all,

	I test a job in my test condor pool, 1 computer, after a test job finished ,the Activity of the computer has changed to Idle,but the State is still Claimed. There is an error in StartLog about condor_write error,after 600 seconds Sched send a release command.
	Thanks .
		
my test job

Universe=parallel
Executable = /bin/hostname
Output=h.out.$(NODE)
Log = h.log
machine_count=1
Queue

	
SchedLog
		
4/3 06:02:56 (pid:7396) Called reschedule_negotiator()
4/3 06:03:01 (pid:7396) Sent ad to central manager for zhaokun@xxxxxxxxxxxx
4/3 06:03:01 (pid:7396) Sent ad to 1 collectors for zhaokun@xxxxxxxxxxxx
4/3 06:03:01 (pid:7396) Inserting new attribute Scheduler into non-active cluster cid=29 acid=-1
4/3 06:03:11 (pid:7396) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxx
4/3 06:03:11 (pid:7396) Out of requests - 1 reqs matched, 0 reqs idle
4/3 06:03:11 (pid:7396) Sent REQUEST_CLAIM to startd mgt1.hotsim.local <172.16.0.1:56568> for DedicatedScheduler
4/3 06:03:11 (pid:7396) Inserting new attribute Scheduler into non-active cluster cid=29 acid=-1
4/3 06:03:11 (pid:7396) Starting add_shadow_birthdate(29.0)
4/3 06:03:11 (pid:7396) Started shadow for job 29.0 on mgt1.hotsim.local <172.16.0.1:56568> for DedicatedScheduler, (shadow pid = 7839)
4/3 06:03:13 (pid:7396) In DedicatedScheduler::reaper pid 7839 has status 25600
4/3 06:03:13 (pid:7396) Shadow pid 7839 exited with status 100
4/3 06:03:13 (pid:7396) DedicatedScheduler::deallocMatchRec
4/3 06:03:13 (pid:7396) DedicatedScheduler::deallocMatchRec
4/3 06:03:31 (pid:7396) Sent owner (0 jobs) ad to 1 collectors
4/3 06:13:13 (pid:7396) Resource mgt1.hotsim.local has been unused for 600 seconds, limit is 600, releasing

StartLog
	
4/3 06:03:11 match_info called
4/3 06:03:11 Received match <172.16.0.1:56568>#1175551221#1#...
4/3 06:03:11 State change: match notification protocol successful
4/3 06:03:11 Changing state: Unclaimed -> Matched
4/3 06:03:11 Request accepted.
4/3 06:03:11 Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxx
4/3 06:03:11 State change: claiming protocol successful
4/3 06:03:11 Changing state: Matched -> Claimed
4/3 06:03:12 Got activate_claim request from shadow (<172.16.0.1:55514>)
4/3 06:03:12 Remote job ID is 29.0
4/3 06:03:13 Got universe "PARALLEL" (11) from request classad
4/3 06:03:13 State change: claim-activation protocol successful
4/3 06:03:13 Changing activity: Idle -> Busy
4/3 06:03:13 Called deactivate_claim_forcibly()
4/3 06:03:13 Starter pid 7844 exited with status 0
4/3 06:03:13 State change: starter exited
4/3 06:03:13 Changing activity: Busy -> Idle
4/3 06:03:13 Called deactivate_claim()
4/3 06:03:13 condor_write(): Socket closed when trying to write 56 bytes to <172.16.0.1:59635>, fd is 7
4/3 06:03:13 Buf::write(): condor_write() failed
4/3 06:13:13 State change: received RELEASE_CLAIM command
4/3 06:13:13 Changing state and activity: Claimed/Idle -> Preempting/Vacating
4/3 06:13:13 State change: No preempting claim, returning to owner
4/3 06:13:13 Changing state and activity: Preempting/Vacating -> Owner/Idle
4/3 06:13:13 State change: IS_OWNER is false
4/3 06:13:13 Changing state: Owner -> Unclaimed


	Thanks.
      	 Zhaokun
			   Beijing Hotsim Technology Co.,Ltd
			   zhaokun@xxxxxxxxxxxxx
          2009-02-05