[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job went to X state without issuing condor_rm.



Hi Matt,

This is a VM job it is not completed. From the Starter and User log given below you can see that this job is evicted.

some job Ads I have copied below for that job.

JobStatus = 3
ExitStatus = 0
ExitBySignal = FALSE
CompletionDate = 0
LastReleaseReason = "via condor_release (by user daemon)"
ReleaseReason = "via condor_release (by user daemon)"
LastHoldReason = "via condor_hold (by user daemon)"
LastHoldReasonCode = 1

We are not facing this problem frequently. But 3 to 4 times we have noticed.

thanks
johnson

Matthew Farrellee wrote:
Is it possible that the job actually completed right before the hold took effect? You could check the job ad to see if it has all the indicators of a completed job, e.g. JobStatus = 4, a CompletionDate, ExitCode/Status/BySignal.

IIRC, Jaime added coded that ensures if you're completed/removed and you get held that you go back into completed/removed instead of idle.

Is this an ongoing problem or just a strange anomaly?

Best,


matt

Johnson koil Raj wrote:
Hi Matt,
        No this is not in that state. condor_rm is not issued, we gave
condor_release only. The schedd log shows that clearly but the job went
into X state.

8/9 21:03:03 (pid:5215) abort_job_myself: 514.0 action:Hold
log_hold:true notify:true
8/9 21:03:03 (pid:5215) Found shadow record for job 514.0, host =
<192.168.10.92:9620>
8/9 21:03:14 (pid:5215) No HoldReasonSubCode found for job 514.0 we have faced same issue sometimes previously.
thanks
Johnson

Matthew Farrellee wrote:
Is the job X before of LeaveJobInQueue?

http://www.cs.wisc.edu/condor/manual/v7.3/4_5Application_Program.html#SECTION00551300000000000000


Best,


matt

Johnson koil Raj wrote:
Hi,
    One of the job went into X state as soon as the job is released
by the using condor_release. First of all the Job is held by
condor_hold it seems to be before shadow exit, to start the Job we
issued condor_release. condor_release is successfully but when we see
the condor_q shows that job in X state.

The Job is submitted through SOAP api. we are using version 7.2.3

I think below logs will help to find what went wrong to sent job to X
state.

In Schedd log:
8/9 21:03:03 (pid:5215) abort_job_myself: 514.0 action:Hold
log_hold:true notify:true
8/9 21:03:03 (pid:5215) Found shadow record for job 514.0, host =
<192.168.10.92:9620>
8/9 21:03:14 (pid:5215) No HoldReasonSubCode found for job 514.0
8/9 21:03:16 (pid:5215) Writing record to user
logfile=/mail/condor/log/VM_514_0.log owner=idealgrid
8/9 21:03:19 (pid:5215) FileLock object is updating timestamp on:
/mail/condor/log/VM_514_0.log
8/9 21:03:19 (pid:5215) FileLock::obtain(1) - @1249831999.611700 lock
on /mail/condor/log/VM_514_0.log now WRITE
8/9 21:03:21 (pid:5215) FileLock::obtain(2) - @1249832001.150186 lock
on /mail/condor/log/VM_514_0.log now UNLOCKED
8/9 21:03:22 (pid:5215) Shadow pid 6457 for job 514.0 exited with
status 102
8/9 21:03:22 (pid:5215) Deleting shadow rec for PID 6457, job (514.0)
8/9 21:03:22 (pid:5215) Writing record to user
logfile=/mail/condor/log/VM_514_0.log owner=idealgrid
8/9 21:03:22 (pid:5215) FileLock object is updating timestamp on:
/mail/condor/log/VM_514_0.log
8/9 21:03:22 (pid:5215) FileLock::obtain(1) - @1249832002.754296 lock
on /mail/condor/log/VM_514_0.log now WRITE
8/9 21:03:24 (pid:5215) FileLock::obtain(2) - @1249832004.133541 lock
on /mail/condor/log/VM_514_0.log now UNLOCKED
8/9 21:03:24 (pid:5215) Job 514.0 is finished
8/9 21:03:24 (pid:5215) Job cleanup for 514.0 will not block, calling
jobIsFinished() directly
8/9 21:03:24 (pid:5215) jobIsFinished() completed, calling
DestroyProc(514.0)


In ShadowLog:
8/9 21:03:03 (514.0) (6457): In handleJobRemoval(), sig 10
8/9 21:03:03 (514.0) (6457): setting exit reason on
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx to 102
8/9 21:03:03 (514.0) (6457): Resource
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from EXECUTING to
FINISHED
8/9 21:03:03 (514.0) (6457): Entering
DCStartd::deactivateClaim(forceful)
8/9 21:03:04 (514.0) (6457): DCStartd::deactivateClaim: successfully
sent command
8/9 21:03:04 (514.0) (6457): Killed starter (fast) at
<192.168.10.92:9620>
8/9 21:03:16 (514.0) (6457): Inside RemoteResource::updateFromStarter()
8/9 21:03:19 (514.0) (6457): Inside RemoteResource::resourceExit()
8/9 21:03:19 (514.0) (6457): setting exit reason on
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx to 107
8/9 21:03:19 (514.0) (6457): Resource
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from FINISHED to
FINISHED
8/9 21:03:19 (514.0) (6457): Job 514.0 is being evicted from
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx
8/9 21:03:19 (514.0) (6457): FileLock::obtain(1) - @1249831999.591092
lock on /mail/condor/log/VM_514_0.log now WRITE
8/9 21:03:19 (514.0) (6457): FileLock::obtain(2) - @1249831999.610029
lock on /mail/condor/log/VM_514_0.log now UNLOCKED
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(LastJobLeaseRenewal = 1249831999)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(RemoteSysCpu = 4.000000)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(RemoteUserCpu = 3435.000000)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(LastVacateTime = 1249831999)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(BytesSent = 0.000000)
8/9 21:03:21 (514.0) (6457): Updating Job Queue:
SetAttribute(BytesRecvd = 9785.000000)
8/9 21:03:22 (514.0) (6457): **** condor_shadow (condor_SHADOW) pid
6457 EXITING WITH STATUS 102


In Starter Log
8/9 21:03:04 ProcAPI::buildFamily() Found daddypid on the system: 11157
8/9 21:03:08 Got SIGQUIT.  Performing fast shutdown.
8/9 21:03:08 ShutdownFast all jobs.
8/9 21:03:08 Inside VMProc::ShutdownFast()
8/9 21:03:08 Inside VMProc::StopVM
8/9 21:03:08 VMGAHP[11157] <- 'CONDOR_VM_STOP 243 1'
8/9 21:03:09 VMGAHP[11157] -> 'S'
8/9 21:03:10 VMGAHP[11157] <- 'RESULTS'
8/9 21:03:11 VMGAHP[11157] -> 'R'
8/9 21:03:11 VMGAHP[11157] -> 'S' '1'
8/9 21:03:11 VMGAHP[11157] -> '243' '0' 'NULL'
8/9 21:03:11 PID for VM is changed from [23754] to [0]
8/9 21:03:12 Inside VM_GAHP_SERVER::cleanup()
8/9 21:03:12 VMGAHP[11157] <- 'QUIT'
8/9 21:03:17 VMGAHP[11157] -> 'S'
8/9 21:03:18 VMGahpServer::killVM() failed!
8/9 21:03:18 End of VM_GAHP_SERVER::cleanup
8/9 21:03:19 Inside VMProc::cleanup()
8/9 21:03:19 ProcAPI::buildFamily() Found daddypid on the system: 11157

In UserLog
001 (514.000.000) 08/09 15:39:59 Job executing on host:
<192.168.10.92:9620>
...
004 (514.000.000) 08/09 21:03:19 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:57:15, Sys 0 00:00:04  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    1957  -  Run Bytes Received By Job
...
013 (514.000.000) 08/09 21:03:19 Job was released.
    via condor_release (by user daemon)
...
009 (514.000.000) 08/09 21:03:22 Job was aborted by the user.
...



thanks
Johnson

Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of the
addressee(s) and may contain proprietary, confidential or privileged
information. If you are not the intended recipient, you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and destroy all copies of this message and any attachments.
WARNING: Computer viruses can be transmitted via email. The recipient
should check this email and any attachments for the presence of
viruses. The company accepts no liability for any damage caused by
any virus transmitted by this email.
www.wipro.com
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
www.wipro.com