[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job went to X state without issuing condor_rm.



Is it possible that the job actually completed right before the hold took effect? You could check the job ad to see if it has all the indicators of a completed job, e.g. JobStatus = 4, a CompletionDate, ExitCode/Status/BySignal.

IIRC, Jaime added coded that ensures if you're completed/removed and you get held that you go back into completed/removed instead of idle.

Is this an ongoing problem or just a strange anomaly?

Best,


matt

Johnson koil Raj wrote:
> Hi Matt,
>         No this is not in that state. condor_rm is not issued, we gave
> condor_release only. The schedd log shows that clearly but the job went
> into X state.
> 
> 8/9 21:03:03 (pid:5215) abort_job_myself: 514.0 action:Hold
> log_hold:true notify:true
> 8/9 21:03:03 (pid:5215) Found shadow record for job 514.0, host =
> <192.168.10.92:9620>
> 8/9 21:03:14 (pid:5215) No HoldReasonSubCode found for job 514.0  
> we have faced same issue sometimes previously.  
> 
> thanks
> Johnson
> 
> Matthew Farrellee wrote:
>> Is the job X before of LeaveJobInQueue?
>>
>> http://www.cs.wisc.edu/condor/manual/v7.3/4_5Application_Program.html#SECTION00551300000000000000
>>
>>
>> Best,
>>
>>
>> matt
>>
>> Johnson koil Raj wrote:
>>  
>>> Hi,
>>>     One of the job went into X state as soon as the job is released
>>> by the using condor_release. First of all the Job is held by
>>> condor_hold it seems to be before shadow exit, to start the Job we
>>> issued condor_release. condor_release is successfully but when we see
>>> the condor_q shows that job in X state.
>>>
>>> The Job is submitted through SOAP api. we are using version 7.2.3
>>>
>>> I think below logs will help to find what went wrong to sent job to X
>>> state.
>>>
>>> In Schedd log:
>>> 8/9 21:03:03 (pid:5215) abort_job_myself: 514.0 action:Hold
>>> log_hold:true notify:true
>>> 8/9 21:03:03 (pid:5215) Found shadow record for job 514.0, host =
>>> <192.168.10.92:9620>
>>> 8/9 21:03:14 (pid:5215) No HoldReasonSubCode found for job 514.0
>>> 8/9 21:03:16 (pid:5215) Writing record to user
>>> logfile=/mail/condor/log/VM_514_0.log owner=idealgrid
>>> 8/9 21:03:19 (pid:5215) FileLock object is updating timestamp on:
>>> /mail/condor/log/VM_514_0.log
>>> 8/9 21:03:19 (pid:5215) FileLock::obtain(1) - @1249831999.611700 lock
>>> on /mail/condor/log/VM_514_0.log now WRITE
>>> 8/9 21:03:21 (pid:5215) FileLock::obtain(2) - @1249832001.150186 lock
>>> on /mail/condor/log/VM_514_0.log now UNLOCKED
>>> 8/9 21:03:22 (pid:5215) Shadow pid 6457 for job 514.0 exited with
>>> status 102
>>> 8/9 21:03:22 (pid:5215) Deleting shadow rec for PID 6457, job (514.0)
>>> 8/9 21:03:22 (pid:5215) Writing record to user
>>> logfile=/mail/condor/log/VM_514_0.log owner=idealgrid
>>> 8/9 21:03:22 (pid:5215) FileLock object is updating timestamp on:
>>> /mail/condor/log/VM_514_0.log
>>> 8/9 21:03:22 (pid:5215) FileLock::obtain(1) - @1249832002.754296 lock
>>> on /mail/condor/log/VM_514_0.log now WRITE
>>> 8/9 21:03:24 (pid:5215) FileLock::obtain(2) - @1249832004.133541 lock
>>> on /mail/condor/log/VM_514_0.log now UNLOCKED
>>> 8/9 21:03:24 (pid:5215) Job 514.0 is finished
>>> 8/9 21:03:24 (pid:5215) Job cleanup for 514.0 will not block, calling
>>> jobIsFinished() directly
>>> 8/9 21:03:24 (pid:5215) jobIsFinished() completed, calling
>>> DestroyProc(514.0)
>>>
>>>
>>> In ShadowLog:
>>> 8/9 21:03:03 (514.0) (6457): In handleJobRemoval(), sig 10
>>> 8/9 21:03:03 (514.0) (6457): setting exit reason on
>>> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx to 102
>>> 8/9 21:03:03 (514.0) (6457): Resource
>>> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from EXECUTING to
>>> FINISHED
>>> 8/9 21:03:03 (514.0) (6457): Entering
>>> DCStartd::deactivateClaim(forceful)
>>> 8/9 21:03:04 (514.0) (6457): DCStartd::deactivateClaim: successfully
>>> sent command
>>> 8/9 21:03:04 (514.0) (6457): Killed starter (fast) at
>>> <192.168.10.92:9620>
>>> 8/9 21:03:16 (514.0) (6457): Inside RemoteResource::updateFromStarter()
>>> 8/9 21:03:19 (514.0) (6457): Inside RemoteResource::resourceExit()
>>> 8/9 21:03:19 (514.0) (6457): setting exit reason on
>>> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx to 107
>>> 8/9 21:03:19 (514.0) (6457): Resource
>>> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from FINISHED to
>>> FINISHED
>>> 8/9 21:03:19 (514.0) (6457): Job 514.0 is being evicted from
>>> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
>>> 8/9 21:03:19 (514.0) (6457): FileLock::obtain(1) - @1249831999.591092
>>> lock on /mail/condor/log/VM_514_0.log now WRITE
>>> 8/9 21:03:19 (514.0) (6457): FileLock::obtain(2) - @1249831999.610029
>>> lock on /mail/condor/log/VM_514_0.log now UNLOCKED
>>> 8/9 21:03:21 (514.0) (6457): Updating Job Queue:
>>> SetAttribute(LastJobLeaseRenewal = 1249831999)
>>> 8/9 21:03:21 (514.0) (6457): Updating Job Queue:
>>> SetAttribute(RemoteSysCpu = 4.000000)
>>> 8/9 21:03:21 (514.0) (6457): Updating Job Queue:
>>> SetAttribute(RemoteUserCpu = 3435.000000)
>>> 8/9 21:03:21 (514.0) (6457): Updating Job Queue:
>>> SetAttribute(LastVacateTime = 1249831999)
>>> 8/9 21:03:21 (514.0) (6457): Updating Job Queue:
>>> SetAttribute(BytesSent = 0.000000)
>>> 8/9 21:03:21 (514.0) (6457): Updating Job Queue:
>>> SetAttribute(BytesRecvd = 9785.000000)
>>> 8/9 21:03:22 (514.0) (6457): **** condor_shadow (condor_SHADOW) pid
>>> 6457 EXITING WITH STATUS 102
>>>
>>>
>>> In Starter Log
>>> 8/9 21:03:04 ProcAPI::buildFamily() Found daddypid on the system: 11157
>>> 8/9 21:03:08 Got SIGQUIT.  Performing fast shutdown.
>>> 8/9 21:03:08 ShutdownFast all jobs.
>>> 8/9 21:03:08 Inside VMProc::ShutdownFast()
>>> 8/9 21:03:08 Inside VMProc::StopVM
>>> 8/9 21:03:08 VMGAHP[11157] <- 'CONDOR_VM_STOP 243 1'
>>> 8/9 21:03:09 VMGAHP[11157] -> 'S'
>>> 8/9 21:03:10 VMGAHP[11157] <- 'RESULTS'
>>> 8/9 21:03:11 VMGAHP[11157] -> 'R'
>>> 8/9 21:03:11 VMGAHP[11157] -> 'S' '1'
>>> 8/9 21:03:11 VMGAHP[11157] -> '243' '0' 'NULL'
>>> 8/9 21:03:11 PID for VM is changed from [23754] to [0]
>>> 8/9 21:03:12 Inside VM_GAHP_SERVER::cleanup()
>>> 8/9 21:03:12 VMGAHP[11157] <- 'QUIT'
>>> 8/9 21:03:17 VMGAHP[11157] -> 'S'
>>> 8/9 21:03:18 VMGahpServer::killVM() failed!
>>> 8/9 21:03:18 End of VM_GAHP_SERVER::cleanup
>>> 8/9 21:03:19 Inside VMProc::cleanup()
>>> 8/9 21:03:19 ProcAPI::buildFamily() Found daddypid on the system: 11157
>>>
>>> In UserLog
>>> 001 (514.000.000) 08/09 15:39:59 Job executing on host:
>>> <192.168.10.92:9620>
>>> ...
>>> 004 (514.000.000) 08/09 21:03:19 Job was evicted.
>>>     (0) Job was not checkpointed.
>>>         Usr 0 00:57:15, Sys 0 00:00:04  -  Run Remote Usage
>>>         Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>>>     0  -  Run Bytes Sent By Job
>>>     1957  -  Run Bytes Received By Job
>>> ...
>>> 013 (514.000.000) 08/09 21:03:19 Job was released.
>>>     via condor_release (by user daemon)
>>> ...
>>> 009 (514.000.000) 08/09 21:03:22 Job was aborted by the user.
>>> ...
>>>
>>>
>>>
>>> thanks
>>> Johnson
>>>
>>> Please do not print this email unless it is absolutely necessary.
>>> The information contained in this electronic message and any
>>> attachments to this message are intended for the exclusive use of the
>>> addressee(s) and may contain proprietary, confidential or privileged
>>> information. If you are not the intended recipient, you should not
>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>> immediately and destroy all copies of this message and any attachments.
>>> WARNING: Computer viruses can be transmitted via email. The recipient
>>> should check this email and any attachments for the presence of
>>> viruses. The company accepts no liability for any damage caused by
>>> any virus transmitted by this email.
>>> www.wipro.com
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>>     
>>
>>
>>   
>