[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Logging what compute node a job executed/failed on



Cheers Dan,

Shaun


-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: 26 October 2006 16:08
To: Condor-Users Mail List
Subject: Re: [Condor-users] Logging what compute node a job
executed/failed on


The schedd history file (in your SPOOL directory) contains a record of 
completed jobs, including LastRemoteHost. You can either scan through 
this file with your own script, or you can run queries with 
condor_history. Example:

condor_history -format "%s" ClusterId -format ".%s" ProcId -format " 
%s\n" LastRemoteHost

If you do use condor_history, be aware that it is much more efficient to

run one big bulk query than to run condor_history individually for a 
long list of jobs. Also be aware that the history file may be 
periodically rotated, depending on your configuration.

--Dan

Shaun J. O'Callaghan wrote:
>
> Is there a way to get a little more information about condor jobs and 
> where they run, exactly what happened other than having separate log 
> files for each job e.g.
>
> Log = log_$(PROCESS).log
>
> In the submit file?
>
> There's an issue when we're submitting 1000+ jobs and we need to know 
> which ones failed, and where they executed. We can of course get the 
> failures via the return codes and error output but it would be helpful

> to know exactly where this job executed. All we have at the minute is
>
> 001 (021.000.000) 09/29 09:58:54 Job executing on host: 
> <xxx.xxx.xxx.xxx:1104>
>
> And while this is useful, it would be helpful to have the execute node

> actually in the following:
>
> 005 (021.000.000) 09/29 09:58:55 Job terminated.
>
> (0) Abnormal termination (signal 53)
>
> (0) No core file
>
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
>
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
>
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
>
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
>
> 0 - Run Bytes Sent By Job
>
> 384684 - Run Bytes Received By Job
>
> 0 - Total Bytes Sent By Job
>
> 384684 - Total Bytes Received By Job
>
> .
>
> Rather than just the job id. E.g. what about:
>
> 005 (021.000.000) 09/29 09:58:55 Job terminated (after executing on 
> node xxx.xxx.xxx.xxx)
>
> This probably seems trivial, but if anyone can suggest other methods 
> I'd be more than happy to hear them.
>
> Kind Regards,
>
> Shaun
>
>
------------------------------------------------------------------------
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR