[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Startd couldn't change state to UnClaimed after jobfinished



Dear all,

	I test a vanilla job in this pool,but there is no error.The difference in StartLog,after vanilla job finished,    only "Called deactivate_claim_forcibly()" and successfully  "received RELEASE_CLAIM command";after parallel job finished ,  first "Called deactivate_claim_forcibly()" ,then  "Called deactivate_claim()",and a error occured , "condor_write(): Socket closed when trying to write 56 bytes to <172.16.0.1:59635>, fd is 7".

	How to solve this?
	
	 help, thanks.
	
	my job
	
Executable = /bin/hostname
Universe=vanilla
Log = s1.log
Output = s1.out
Queue

	
	StartLog
	
	4/3 18:08:41 slot1: Got universe "VANILLA" (5) from request classad
4/3 18:08:41 slot1: State change: claim-activation protocol successful
4/3 18:08:41 slot1: Changing activity: Idle -> Busy
4/3 18:08:42 slot1: Called deactivate_claim_forcibly()
4/3 18:08:42 Starter pid 9618 exited with status 0
4/3 18:08:42 slot1: State change: starter exited
4/3 18:08:42 slot1: Changing activity: Busy -> Idle
4/3 18:08:42 slot1: State change: received RELEASE_CLAIM command
4/3 18:08:42 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
4/3 18:08:42 slot1: State change: No preempting claim, returning to owner
4/3 18:08:42 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
4/3 18:08:42 slot1: State change: IS_OWNER is false
4/3 18:08:42 slot1: Changing state: Owner -> Unclaimed

	
	


			

	Thanks.
      	 Zhaokun
			   Beijing Hotsim Technology Co.,Ltd
			   zhaokun@xxxxxxxxxxxxx
          2009-02-06
=======From 2009-02-05 15:41:39 =======

>hi all,
>
>	I test a job in my test condor pool, 1 computer, after a test job finished ,the Activity of the computer has changed to Idle,but the State is still Claimed. There is an error in StartLog about condor_write error,after 600 seconds Sched send a release command.
>	Thanks .
>		
>my test job
>
>Universe=parallel
>Executable = /bin/hostname
>Output=h.out.$(NODE)
>Log = h.log
>machine_count=1
>Queue
>
>	
>SchedLog
>		
>4/3 06:02:56 (pid:7396) Called reschedule_negotiator()
>4/3 06:03:01 (pid:7396) Sent ad to central manager for zhaokun@xxxxxxxxxxxx
>4/3 06:03:01 (pid:7396) Sent ad to 1 collectors for zhaokun@xxxxxxxxxxxx
>4/3 06:03:01 (pid:7396) Inserting new attribute Scheduler into non-active cluster cid=29 acid=-1
>4/3 06:03:11 (pid:7396) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxx
>4/3 06:03:11 (pid:7396) Out of requests - 1 reqs matched, 0 reqs idle
>4/3 06:03:11 (pid:7396) Sent REQUEST_CLAIM to startd mgt1.hotsim.local <172.16.0.1:56568> for DedicatedScheduler
>4/3 06:03:11 (pid:7396) Inserting new attribute Scheduler into non-active cluster cid=29 acid=-1
>4/3 06:03:11 (pid:7396) Starting add_shadow_birthdate(29.0)
>4/3 06:03:11 (pid:7396) Started shadow for job 29.0 on mgt1.hotsim.local <172.16.0.1:56568> for DedicatedScheduler, (shadow pid = 7839)
>4/3 06:03:13 (pid:7396) In DedicatedScheduler::reaper pid 7839 has status 25600
>4/3 06:03:13 (pid:7396) Shadow pid 7839 exited with status 100
>4/3 06:03:13 (pid:7396) DedicatedScheduler::deallocMatchRec
>4/3 06:03:13 (pid:7396) DedicatedScheduler::deallocMatchRec
>4/3 06:03:31 (pid:7396) Sent owner (0 jobs) ad to 1 collectors
>4/3 06:13:13 (pid:7396) Resource mgt1.hotsim.local has been unused for 600 seconds, limit is 600, releasing
>
>StartLog
>	
>4/3 06:03:11 match_info called
>4/3 06:03:11 Received match <172.16.0.1:56568>#1175551221#1#...
>4/3 06:03:11 State change: match notification protocol successful
>4/3 06:03:11 Changing state: Unclaimed -> Matched
>4/3 06:03:11 Request accepted.
>4/3 06:03:11 Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxx
>4/3 06:03:11 State change: claiming protocol successful
>4/3 06:03:11 Changing state: Matched -> Claimed
>4/3 06:03:12 Got activate_claim request from shadow (<172.16.0.1:55514>)
>4/3 06:03:12 Remote job ID is 29.0
>4/3 06:03:13 Got universe "PARALLEL" (11) from request classad
>4/3 06:03:13 State change: claim-activation protocol successful
>4/3 06:03:13 Changing activity: Idle -> Busy
>4/3 06:03:13 Called deactivate_claim_forcibly()
>4/3 06:03:13 Starter pid 7844 exited with status 0
>4/3 06:03:13 State change: starter exited
>4/3 06:03:13 Changing activity: Busy -> Idle
>4/3 06:03:13 Called deactivate_claim()
>4/3 06:03:13 condor_write(): Socket closed when trying to write 56 bytes to <172.16.0.1:59635>, fd is 7
>4/3 06:03:13 Buf::write(): condor_write() failed
>4/3 06:13:13 State change: received RELEASE_CLAIM command
>4/3 06:13:13 Changing state and activity: Claimed/Idle -> Preempting/Vacating
>4/3 06:13:13 State change: No preempting claim, returning to owner
>4/3 06:13:13 Changing state and activity: Preempting/Vacating -> Owner/Idle
>4/3 06:13:13 State change: IS_OWNER is false
>4/3 06:13:13 Changing state: Owner -> Unclaimed
>
>
>	Thanks.
>      	 Zhaokun
>			   Beijing Hotsim Technology Co.,Ltd
>			   zhaokun@xxxxxxxxxxxxx
>          2009-02-05
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at: 
>https://lists.cs.wisc.edu/archive/condor-users/

= = = = = = = = = = = = = = = = = = = =