[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job went to X state without issuing condor_rm.



Hi Matt,
No this is not in that state. condor_rm is not issued, we gave condor_release only. The schedd log shows that clearly but the job went into X state.

8/9 21:03:03 (pid:5215) abort_job_myself: 514.0 action:Hold log_hold:true notify:true
8/9 21:03:03 (pid:5215) Found shadow record for job 514.0, host = <192.168.10.92:9620>
8/9 21:03:14 (pid:5215) No HoldReasonSubCode found for job 514.0 we have faced same issue sometimes previously.

thanks
Johnson

Matthew Farrellee wrote:
Is the job X before of LeaveJobInQueue?

http://www.cs.wisc.edu/condor/manual/v7.3/4_5Application_Program.html#SECTION00551300000000000000

Best,


matt

Johnson koil Raj wrote:
Hi,
One of the job went into X state as soon as the job is released by the using condor_release. First of all the Job is held by condor_hold it seems to be before shadow exit, to start the Job we issued condor_release. condor_release is successfully but when we see the condor_q shows that job in X state.

The Job is submitted through SOAP api. we are using version 7.2.3

I think below logs will help to find what went wrong to sent job to X state.

In Schedd log:
8/9 21:03:03 (pid:5215) abort_job_myself: 514.0 action:Hold log_hold:true notify:true 8/9 21:03:03 (pid:5215) Found shadow record for job 514.0, host = <192.168.10.92:9620>
8/9 21:03:14 (pid:5215) No HoldReasonSubCode found for job 514.0
8/9 21:03:16 (pid:5215) Writing record to user logfile=/mail/condor/log/VM_514_0.log owner=idealgrid 8/9 21:03:19 (pid:5215) FileLock object is updating timestamp on: /mail/condor/log/VM_514_0.log 8/9 21:03:19 (pid:5215) FileLock::obtain(1) - @1249831999.611700 lock on /mail/condor/log/VM_514_0.log now WRITE 8/9 21:03:21 (pid:5215) FileLock::obtain(2) - @1249832001.150186 lock on /mail/condor/log/VM_514_0.log now UNLOCKED
8/9 21:03:22 (pid:5215) Shadow pid 6457 for job 514.0 exited with status 102
8/9 21:03:22 (pid:5215) Deleting shadow rec for PID 6457, job (514.0)
8/9 21:03:22 (pid:5215) Writing record to user logfile=/mail/condor/log/VM_514_0.log owner=idealgrid 8/9 21:03:22 (pid:5215) FileLock object is updating timestamp on: /mail/condor/log/VM_514_0.log 8/9 21:03:22 (pid:5215) FileLock::obtain(1) - @1249832002.754296 lock on /mail/condor/log/VM_514_0.log now WRITE 8/9 21:03:24 (pid:5215) FileLock::obtain(2) - @1249832004.133541 lock on /mail/condor/log/VM_514_0.log now UNLOCKED
8/9 21:03:24 (pid:5215) Job 514.0 is finished
8/9 21:03:24 (pid:5215) Job cleanup for 514.0 will not block, calling jobIsFinished() directly 8/9 21:03:24 (pid:5215) jobIsFinished() completed, calling DestroyProc(514.0)


In ShadowLog:
8/9 21:03:03 (514.0) (6457): In handleJobRemoval(), sig 10
8/9 21:03:03 (514.0) (6457): setting exit reason on slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx to 102 8/9 21:03:03 (514.0) (6457): Resource slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from EXECUTING to FINISHED
8/9 21:03:03 (514.0) (6457): Entering DCStartd::deactivateClaim(forceful)
8/9 21:03:04 (514.0) (6457): DCStartd::deactivateClaim: successfully sent command
8/9 21:03:04 (514.0) (6457): Killed starter (fast) at <192.168.10.92:9620>
8/9 21:03:16 (514.0) (6457): Inside RemoteResource::updateFromStarter()
8/9 21:03:19 (514.0) (6457): Inside RemoteResource::resourceExit()
8/9 21:03:19 (514.0) (6457): setting exit reason on slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx to 107 8/9 21:03:19 (514.0) (6457): Resource slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from FINISHED to FINISHED 8/9 21:03:19 (514.0) (6457): Job 514.0 is being evicted from slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx 8/9 21:03:19 (514.0) (6457): FileLock::obtain(1) - @1249831999.591092 lock on /mail/condor/log/VM_514_0.log now WRITE 8/9 21:03:19 (514.0) (6457): FileLock::obtain(2) - @1249831999.610029 lock on /mail/condor/log/VM_514_0.log now UNLOCKED 8/9 21:03:21 (514.0) (6457): Updating Job Queue: SetAttribute(LastJobLeaseRenewal = 1249831999) 8/9 21:03:21 (514.0) (6457): Updating Job Queue: SetAttribute(RemoteSysCpu = 4.000000) 8/9 21:03:21 (514.0) (6457): Updating Job Queue: SetAttribute(RemoteUserCpu = 3435.000000) 8/9 21:03:21 (514.0) (6457): Updating Job Queue: SetAttribute(LastVacateTime = 1249831999) 8/9 21:03:21 (514.0) (6457): Updating Job Queue: SetAttribute(BytesSent = 0.000000) 8/9 21:03:21 (514.0) (6457): Updating Job Queue: SetAttribute(BytesRecvd = 9785.000000) 8/9 21:03:22 (514.0) (6457): **** condor_shadow (condor_SHADOW) pid 6457 EXITING WITH STATUS 102


In Starter Log
8/9 21:03:04 ProcAPI::buildFamily() Found daddypid on the system: 11157
8/9 21:03:08 Got SIGQUIT.  Performing fast shutdown.
8/9 21:03:08 ShutdownFast all jobs.
8/9 21:03:08 Inside VMProc::ShutdownFast()
8/9 21:03:08 Inside VMProc::StopVM
8/9 21:03:08 VMGAHP[11157] <- 'CONDOR_VM_STOP 243 1'
8/9 21:03:09 VMGAHP[11157] -> 'S'
8/9 21:03:10 VMGAHP[11157] <- 'RESULTS'
8/9 21:03:11 VMGAHP[11157] -> 'R'
8/9 21:03:11 VMGAHP[11157] -> 'S' '1'
8/9 21:03:11 VMGAHP[11157] -> '243' '0' 'NULL'
8/9 21:03:11 PID for VM is changed from [23754] to [0]
8/9 21:03:12 Inside VM_GAHP_SERVER::cleanup()
8/9 21:03:12 VMGAHP[11157] <- 'QUIT'
8/9 21:03:17 VMGAHP[11157] -> 'S'
8/9 21:03:18 VMGahpServer::killVM() failed!
8/9 21:03:18 End of VM_GAHP_SERVER::cleanup
8/9 21:03:19 Inside VMProc::cleanup()
8/9 21:03:19 ProcAPI::buildFamily() Found daddypid on the system: 11157

In UserLog
001 (514.000.000) 08/09 15:39:59 Job executing on host: <192.168.10.92:9620>
...
004 (514.000.000) 08/09 21:03:19 Job was evicted.
    (0) Job was not checkpointed.
        Usr 0 00:57:15, Sys 0 00:00:04  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
    0  -  Run Bytes Sent By Job
    1957  -  Run Bytes Received By Job
...
013 (514.000.000) 08/09 21:03:19 Job was released.
    via condor_release (by user daemon)
...
009 (514.000.000) 08/09 21:03:22 Job was aborted by the user.
...



thanks
Johnson

Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
www.wipro.com
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/




Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
www.wipro.com