[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job finished with status 115



Hello Todd,

It's really clear for me.
Many thanks for good explanations.

Best Regards
Jean-Claude

----- Mail original -----
De: "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>
Ã: "htcondor-users" <htcondor-users@xxxxxxxxxxx>, "Chevaleyre Jean-Claude" <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx>
Cc: "Jean-Claude CHEVALEYRE" <chevaleyre@xxxxxxxxxxxxxxxxx>
EnvoyÃ: Jeudi 11 FÃvrier 2021 16:49:12
Objet: Re: [HTCondor-users] Job finished with status 115

On 2/11/2021 3:34 AM, Jean-Claude CHEVALEYRE wrote:
> Hello,
>
> I have some Atlas jobs that are failling. I have look in the logs files.
> I can see by example for this jobs number 93742.0. This job finished with a status 115 . What does means exactly this 
> status ?
>

Hi Jean-Caude,

Looking at your investigation below (thank you for including this), I think the confusion here is the job did not exit 
with a status 115. The condor_shadow process (a component of the HTCondor service) exited with a status 115, but that 
is not the job process.

To see the exit status for a job, you could look in the EventLog or use the condor_history command.

Below I see that you grepped the event log and there is a Job Terminate event for job 93742.0... the exit status for 
that job will appear in the next line. In other words, events in the event log are multi-line, and thus your grep did 
not show it.

Alternatively, you can use the "condor_history" command. This command is similar to condor_q, but allows you to see 
attributes about jobs that have left the queue (due to completion or removal). From your submit machine enter the 
following to see the exitcode:

condor_history 93742.0 -limit 1 -af exitcode****

Or to see all attributes about this completed job do:

condor_history 93742.0 -limit 1 -l****

See the condor_history manual page (man condor_history) for more options, and documentation about most of the available 
job attributes can be found in the Manual appendix here:
https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html#job-classad-attributes

Hope the above helps,
Todd


> Bellow are some extract of logs outputs:
>
> *[root@gridarcce01 log]# grep -RH '93742' arc/arex-jobs* | more*
> *arc/arex-jobs.log-20210211:2021-02-10 23:45:00 Finished - job id: 
> 6PwKDm5cYTynOUEdEnzo691oABFKDmABFKDmzcfXDmDBFKDmDTZXHm, unix user: 41000:1307, name: "arc_pilot", owner: 
> "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN*
> *=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1", lrms: condor, queue: grid, lrmsid: 93742.gridarcce01*
>
>
> *[root@gridarcce01 log]# grep -RH '93742' condor/EventLog | more*
>
> *condor/EventLog: Â Â Â Â937428 Â- ÂResidentSetSize of job (KB)*
> *condor/EventLog:006 (24968.000.000) 12/18 10:32:49 Image size of job updated: 937424*
> *condor/EventLog:006 (26125.000.000) 12/19 11:22:07 Image size of job updated: 937424*
> *condor/EventLog:006 (26254.000.000) 12/19 16:32:57 Image size of job updated: 937424*
> *condor/EventLog:006 (26254.000.000) 12/19 16:37:57 Image size of job updated: 937424*
> *condor/EventLog: Â Â Â Â937424 Â- ÂResidentSetSize of job (KB)*
> *condor/EventLog: Â Â Â Â937420 Â- ÂResidentSetSize of job (KB)*
> *condor/EventLog:006 (71776.000.000) 01/21 00:35:38 Image size of job updated: 937428*
> *condor/EventLog:006 (73442.000.000) 01/22 02:29:37 Image size of job updated: 937428*
> *condor/EventLog: Â Â Â Â937428 Â- ÂResidentSetSize of job (KB)*
> *condor/EventLog:006 (78058.000.000) 01/26 02:56:24 Image size of job updated: 937428*
> *condor/EventLog:000 (93742.000.000) 02/09 04:12:28 Job submitted from host: 
> <193.55.252.153:9618?addrs=193.55.252.153-9618&noUDP&sock=3115801_e73c_4>*
> *condor/EventLog:001 (93742.000.000) 02/09 19:03:03 Job executing on host: 
> <193.55.252.169:9618?addrs=193.55.252.169-9618&noUDP&sock=2279_c86d_3>*
> *condor/EventLog:006 (93742.000.000) 02/09 19:03:11 Image size of job updated: 2304*
> *condor/EventLog:006 (93742.000.000) 02/09 19:08:11 Image size of job updated: 67160*
> *condor/EventLog:006 (93742.000.000) 02/09 19:13:12 Image size of job updated: 110340*
> *condor/EventLog:006 (93742.000.000) 02/09 19:18:13 Image size of job updated: 1410420*
> *condor/EventLog:006 (93742.000.000) 02/09 19:23:13 Image size of job updated: 1887892*
> *condor/EventLog:006 (93742.000.000) 02/09 19:33:15 Image size of job updated: 1887892*
> *condor/EventLog:005 (93742.000.000) 02/10 23:38:21 Job terminated.*
>
>
> *condor/ShadowLog.old:02/10/21 11:43:04 (93742.0) (3863434): Time to redelegate short-lived proxy to starter.*
> *condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): File transfer completed successfully.*
> *condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): Job 93742.0 terminated: exited with status 0*
> *condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): WriteUserLog checking for event log rotation, but no lock*
> *condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): **** condor_shadow (condor_SHADOW) pid 3863434 EXITING 
> WITH STATUS 115*
>
>
> *[root@gridarcce01 log]# grep -RH '93742' condor/SchedLog | more*
> *condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Shadow pid 3863434 for job 93742.0 exited with status 115*
> *condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
> <193.55.252.169:9618?addrs=193.55.252.169-9618&noUDP&sock=2279_c86d_3> for group_ATLAS.atlasprd_score.atlasprd, 937*
> *42.0) deleted*
>
>
> Any ideas are welcome.
>
> Thanks
> Jean-Caude
>
> ------------------------------------------------------------------------
> Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr >
> Laboratoire de Physique Clermont
> Campus Universitaire des CÃzeaux
> 4 Avenue Blaise Pascal
> TSA 60026
> CS 60026
> 63178 AubiÃre Cedex
>
> Tel : 04 73 40 73 60
>
> -------------------------------------------------------------------------
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/