[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job finished with status 115



On 2/11/2021 3:34 AM, Jean-Claude CHEVALEYRE wrote:
Hello,

I have some Atlas jobs that are failling. I have look in the logs files.
I can see by example for this jobs number 93742.0. This job  finished with a status 115 . What does means exactly this status ?


Hi Jean-Caude,

Looking at your investigation below (thank you for including this), I think the confusion here is the job did not exit with a status 115.  The condor_shadow process (a component of the HTCondor service) exited with a status 115, but that is not the job process.

To see the exit status for a job, you could look in the EventLog or use the condor_history command.

Below I see that you grepped the event log and there is a Job Terminate event for job 93742.0... the exit status for that job will appear in the next line.  In other words, events in the event log are multi-line, and thus your grep did not show it.

Alternatively, you can use the "condor_history" command.  This command is similar to condor_q, but allows you to see attributes about jobs that have left the queue (due to completion or removal).  From your submit machine enter the following to see the exitcode:

      condor_history 93742.0 -limit 1 -af exitcode

Or to see all attributes about this completed job do:

      condor_history 93742.0 -limit 1 -l

See the condor_history manual page  (man condor_history) for more options, and documentation about most of the available job attributes can be found in the Manual appendix here:
  https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html#job-classad-attributes

Hope the above helps,
Todd


Bellow are some extract of  logs outputs:

[root@gridarcce01 log]# grep -RH '93742' arc/arex-jobs* | more
arc/arex-jobs.log-20210211:2021-02-10 23:45:00 Finished - job id: 6PwKDm5cYTynOUEdEnzo691oABFKDmABFKDmzcfXDmDBFKDmDTZXHm, unix user: 41000:1307, name: "arc_pilot", owner: "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN
=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1", lrms: condor, queue: grid, lrmsid: 93742.gridarcce01


[root@gridarcce01 log]# grep -RH '93742' condor/EventLog | more

condor/EventLog:        937428  -  ResidentSetSize of job (KB)
condor/EventLog:006 (24968.000.000) 12/18 10:32:49 Image size of job updated: 937424
condor/EventLog:006 (26125.000.000) 12/19 11:22:07 Image size of job updated: 937424
condor/EventLog:006 (26254.000.000) 12/19 16:32:57 Image size of job updated: 937424
condor/EventLog:006 (26254.000.000) 12/19 16:37:57 Image size of job updated: 937424
condor/EventLog:        937424  -  ResidentSetSize of job (KB)
condor/EventLog:        937420  -  ResidentSetSize of job (KB)
condor/EventLog:006 (71776.000.000) 01/21 00:35:38 Image size of job updated: 937428
condor/EventLog:006 (73442.000.000) 01/22 02:29:37 Image size of job updated: 937428
condor/EventLog:        937428  -  ResidentSetSize of job (KB)
condor/EventLog:006 (78058.000.000) 01/26 02:56:24 Image size of job updated: 937428
condor/EventLog:000 (93742.000.000) 02/09 04:12:28 Job submitted from host: <193.55.252.153:9618?addrs=193.55.252.153-9618&noUDP&sock=3115801_e73c_4>
condor/EventLog:001 (93742.000.000) 02/09 19:03:03 Job executing on host: <193.55.252.169:9618?addrs=193.55.252.169-9618&noUDP&sock=2279_c86d_3>
condor/EventLog:006 (93742.000.000) 02/09 19:03:11 Image size of job updated: 2304
condor/EventLog:006 (93742.000.000) 02/09 19:08:11 Image size of job updated: 67160
condor/EventLog:006 (93742.000.000) 02/09 19:13:12 Image size of job updated: 110340
condor/EventLog:006 (93742.000.000) 02/09 19:18:13 Image size of job updated: 1410420
condor/EventLog:006 (93742.000.000) 02/09 19:23:13 Image size of job updated: 1887892
condor/EventLog:006 (93742.000.000) 02/09 19:33:15 Image size of job updated: 1887892
condor/EventLog:005 (93742.000.000) 02/10 23:38:21 Job terminated.


condor/ShadowLog.old:02/10/21 11:43:04 (93742.0) (3863434): Time to redelegate short-lived proxy to starter.
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): File transfer completed successfully.
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): Job 93742.0 terminated: exited with status 0
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): WriteUserLog checking for event log rotation, but no lock
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): **** condor_shadow (condor_SHADOW) pid 3863434 EXITING WITH STATUS 115


[root@gridarcce01 log]# grep -RH '93742' condor/SchedLog | more
condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Shadow pid 3863434 for job 93742.0 exited with status 115
condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <193.55.252.169:9618?addrs=193.55.252.169-9618&noUDP&sock=2279_c86d_3> for group_ATLAS.atlasprd_score.atlasprd, 937
42.0) deleted


Any ideas are welcome.

Thanks
Jean-Caude

------------------------------------------------------------------------
Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr >
Laboratoire de Physique Clermont
Campus Universitaire des CÃzeaux
4 Avenue Blaise Pascal
TSA 60026
CS 60026
63178 AubiÃre Cedex

Tel : 04 73 40 73 60

-------------------------------------------------------------------------

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/