[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Startd couldn't change state to UnClaimed afterjobfinished



Zhaokun,

Did you find a resolution to this issue? I am having the same problem:
parallel job runs and exits fine, startd says condor_write() failed,
and state remains Claimed until schedd releases it after 600 seconds.

Gideon

2009/2/8 zhaokun <zhaokun@xxxxxxxxxxxxx>:
> Dear all,
>
>        Nobody replied my question.Is there anybody who can receive my mails?
>
>
>
>
>
>        Thanks.
>        Zhaokun
>                           Beijing Hotsim Technology Co.,Ltd
>                           zhaokun@xxxxxxxxxxxxx
>           2009-02-09
> =======From 2009-02-06 11:50:18 =======
>
>>Dear all,
>>
>>       I test a vanilla job in this pool,but there is no error.The difference in StartLog,after vanilla job finished,    only "Called deactivate_claim_forcibly()" and successfully  "received RELEASE_CLAIM command";after parallel job finished ,  first "Called deactivate_claim_forcibly()" ,then  "Called deactivate_claim()",and a error occured , "condor_write(): Socket closed when trying to write 56 bytes to <172.16.0.1:59635>, fd is 7".
>>
>>       How to solve this?
>>
>>        help, thanks.
>>
>>       my job
>>
>>Executable = /bin/hostname
>>Universe=vanilla
>>Log = s1.log
>>Output = s1.out
>>Queue
>>
>>
>>       StartLog
>>
>>       4/3 18:08:41 slot1: Got universe "VANILLA" (5) from request classad
>>4/3 18:08:41 slot1: State change: claim-activation protocol successful
>>4/3 18:08:41 slot1: Changing activity: Idle -> Busy
>>4/3 18:08:42 slot1: Called deactivate_claim_forcibly()
>>4/3 18:08:42 Starter pid 9618 exited with status 0
>>4/3 18:08:42 slot1: State change: starter exited
>>4/3 18:08:42 slot1: Changing activity: Busy -> Idle
>>4/3 18:08:42 slot1: State change: received RELEASE_CLAIM command
>>4/3 18:08:42 slot1: Changing state and activity: Claimed/Idle -> Preempting/Vacating
>>4/3 18:08:42 slot1: State change: No preempting claim, returning to owner
>>4/3 18:08:42 slot1: Changing state and activity: Preempting/Vacating -> Owner/Idle
>>4/3 18:08:42 slot1: State change: IS_OWNER is false
>>4/3 18:08:42 slot1: Changing state: Owner -> Unclaimed
>>
>>
>>
>>
>>
>>
>>
>>       Thanks.
>>       Zhaokun
>>                          Beijing Hotsim Technology Co.,Ltd
>>                          zhaokun@xxxxxxxxxxxxx
>>          2009-02-06
>>=======From 2009-02-05 15:41:39 =======
>>
>>>hi all,
>>>
>>>      I test a job in my test condor pool, 1 computer, after a test job finished ,the Activity of the computer has changed to Idle,but the State is still Claimed. There is an error in StartLog about condor_write error,after 600 seconds Sched send a release command.
>>>      Thanks .
>>>
>>>my test job
>>>
>>>Universe=parallel
>>>Executable = /bin/hostname
>>>Output=h.out.$(NODE)
>>>Log = h.log
>>>machine_count=1
>>>Queue
>>>
>>>
>>>SchedLog
>>>
>>>4/3 06:02:56 (pid:7396) Called reschedule_negotiator()
>>>4/3 06:03:01 (pid:7396) Sent ad to central manager for zhaokun@xxxxxxxxxxxx
>>>4/3 06:03:01 (pid:7396) Sent ad to 1 collectors for zhaokun@xxxxxxxxxxxx
>>>4/3 06:03:01 (pid:7396) Inserting new attribute Scheduler into non-active cluster cid=29 acid=-1
>>>4/3 06:03:11 (pid:7396) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxx
>>>4/3 06:03:11 (pid:7396) Out of requests - 1 reqs matched, 0 reqs idle
>>>4/3 06:03:11 (pid:7396) Sent REQUEST_CLAIM to startd mgt1.hotsim.local <172.16.0.1:56568> for DedicatedScheduler
>>>4/3 06:03:11 (pid:7396) Inserting new attribute Scheduler into non-active cluster cid=29 acid=-1
>>>4/3 06:03:11 (pid:7396) Starting add_shadow_birthdate(29.0)
>>>4/3 06:03:11 (pid:7396) Started shadow for job 29.0 on mgt1.hotsim.local <172.16.0.1:56568> for DedicatedScheduler, (shadow pid = 7839)
>>>4/3 06:03:13 (pid:7396) In DedicatedScheduler::reaper pid 7839 has status 25600
>>>4/3 06:03:13 (pid:7396) Shadow pid 7839 exited with status 100
>>>4/3 06:03:13 (pid:7396) DedicatedScheduler::deallocMatchRec
>>>4/3 06:03:13 (pid:7396) DedicatedScheduler::deallocMatchRec
>>>4/3 06:03:31 (pid:7396) Sent owner (0 jobs) ad to 1 collectors
>>>4/3 06:13:13 (pid:7396) Resource mgt1.hotsim.local has been unused for 600 seconds, limit is 600, releasing
>>>
>>>StartLog
>>>
>>>4/3 06:03:11 match_info called
>>>4/3 06:03:11 Received match <172.16.0.1:56568>#1175551221#1#...
>>>4/3 06:03:11 State change: match notification protocol successful
>>>4/3 06:03:11 Changing state: Unclaimed -> Matched
>>>4/3 06:03:11 Request accepted.
>>>4/3 06:03:11 Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxx
>>>4/3 06:03:11 State change: claiming protocol successful
>>>4/3 06:03:11 Changing state: Matched -> Claimed
>>>4/3 06:03:12 Got activate_claim request from shadow (<172.16.0.1:55514>)
>>>4/3 06:03:12 Remote job ID is 29.0
>>>4/3 06:03:13 Got universe "PARALLEL" (11) from request classad
>>>4/3 06:03:13 State change: claim-activation protocol successful
>>>4/3 06:03:13 Changing activity: Idle -> Busy
>>>4/3 06:03:13 Called deactivate_claim_forcibly()
>>>4/3 06:03:13 Starter pid 7844 exited with status 0
>>>4/3 06:03:13 State change: starter exited
>>>4/3 06:03:13 Changing activity: Busy -> Idle
>>>4/3 06:03:13 Called deactivate_claim()
>>>4/3 06:03:13 condor_write(): Socket closed when trying to write 56 bytes to <172.16.0.1:59635>, fd is 7
>>>4/3 06:03:13 Buf::write(): condor_write() failed
>>>4/3 06:13:13 State change: received RELEASE_CLAIM command
>>>4/3 06:13:13 Changing state and activity: Claimed/Idle -> Preempting/Vacating
>>>4/3 06:13:13 State change: No preempting claim, returning to owner
>>>4/3 06:13:13 Changing state and activity: Preempting/Vacating -> Owner/Idle
>>>4/3 06:13:13 State change: IS_OWNER is false
>>>4/3 06:13:13 Changing state: Owner -> Unclaimed
>>>
>>>
>>>      Thanks.
>>>              Zhaokun
>>>                         Beijing Hotsim Technology Co.,Ltd
>>>                         zhaokun@xxxxxxxxxxxxx
>>>          2009-02-05
>>>_______________________________________________
>>>Condor-users mailing list
>>>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>>subject: Unsubscribe
>>>You can also unsubscribe by visiting
>>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>>The archives can be found at:
>>>https://lists.cs.wisc.edu/archive/condor-users/
>>
>>= = = = = = = = = = = = = = = = = = = =
>>_______________________________________________
>>Condor-users mailing list
>>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>subject: Unsubscribe
>>You can also unsubscribe by visiting
>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>The archives can be found at:
>>https://lists.cs.wisc.edu/archive/condor-users/
>
> = = = = = = = = = = = = = = = = = = = =
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>