[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [ExternalEmail] Grid Universe with Condor as Grid Resource - Bug with run times?



Hello...... anyone? Or should I send this to htcondor-admin?

Cheers

Greg

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Greg.Hitchen@xxxxxxxx
Sent: Thursday, 14 August 2014 3:27 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [ExternalEmail] [HTCondor-users] Grid Universe with Condor as Grid Resource - Bug with run times?

Hi

I've been playing with running condor jobs on a "remote" schedd using the grid universe, with a submit file like:

universe = grid
grid_resource = condor another-schedd.csiro.au central-manager.csiro.au
etc.
etc.

This is all working OK, except for the job run times. Watching using condor_q on both the originating submit node
and on the "remote" schedd resource the times match reasonably OK. However as the remote schedd shows the jobs as
C (complete) the originating submit schedd suddenly jumps to twice (2X) the run-time. This is also reflected in the
history on each schedd (using condor_history to view). Example below, test executable just chews up CPU for 5 mins.
Note that the originating submit node shows run times > the difference between SUBMITTED an COMPLETED times.
The job log files on the originating submit node show the correct run-times.

Originating submit node 
 
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD
 157.9   hit023          8/13 16:49   0+00:10:23 C   8/13 16:55 C:\Data\condor_
 157.5   hit023          8/13 16:49   0+00:10:23 C   8/13 16:55 C:\Data\condor_
 157.1   hit023          8/13 16:49   0+00:10:24 C   8/13 16:55 C:\Data\condor_
 157.8   hit023          8/13 16:49   0+00:10:23 C   8/13 16:55 C:\Data\condor_
 157.4   hit023          8/13 16:49   0+00:10:25 C   8/13 16:55 C:\Data\condor_
 157.0   hit023          8/13 16:49   0+00:10:24 C   8/13 16:55 C:\Data\condor_
 157.7   hit023          8/13 16:49   0+00:10:23 C   8/13 16:55 C:\Data\condor_
 157.3   hit023          8/13 16:49   0+00:10:25 C   8/13 16:55 C:\Data\condor_
 157.6   hit023          8/13 16:49   0+00:10:18 C   8/13 16:55 C:\Data\condor_
 157.2   hit023          8/13 16:49   0+00:10:19 C   8/13 16:55 C:\Data\condor_

Remote grid node 
 
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD
   1.0   hit023          8/13 16:49   0+00:05:02 C   8/13 16:55 cpubound3.exe 5
   2.0   hit023          8/13 16:49   0+00:05:02 C   8/13 16:55 cpubound3.exe 5
   3.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
   4.0   hit023          8/13 16:49   0+00:05:02 C   8/13 16:55 cpubound3.exe 5
   5.0   hit023          8/13 16:49   0+00:05:03 C   8/13 16:55 cpubound3.exe 5
   6.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
   7.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
   8.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
   9.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
  10.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5

I have run this using several Condor versions and several OS'es and always get the same result.

Win7 8.0.6 to Linux 7.6.7
Win7 8.0.6 to Linux 8.0.6
Win7 8.0.6 to Win2008 8.0.6
Win7 8.2.1 to Linux 8.2.1
Linux 8.2.1 to Linux 7.6.7
Linux 8.2.1 to Win2008 8.0.6


Thanks for any info/help.

Cheers

Greg

P.S. One test run (for 10 minute job runs) had a job that was evicted after ~ 5mins
and restarted so that when finished it showed a run-time of ~15 mins on the grid resource
but ~30mins on the originating schedd, see below:

 161.0   hit023          8/14 13:30   0+00:21:00 C   8/14 13:49 C:\Data\condor_
 161.1   hit023          8/14 13:30   0+00:19:43 C   8/14 13:47 C:\Data\condor_
 161.2   hit023          8/14 13:30   0+00:20:31 C   8/14 13:42 C:\Data\condor_
 161.3   hit023          8/14 13:30   0+00:32:30 C   8/14 13:49 C:\Data\condor_
 161.4   hit023          8/14 13:30   0+00:20:28 C   8/14 13:41 C:\Data\condor_
 161.5   hit023          8/14 13:30   0+00:20:26 C   8/14 13:41 C:\Data\condor_
 161.6   hit023          8/14 13:30   0+00:20:26 C   8/14 13:41 C:\Data\condor_
 161.7   hit023          8/14 13:30   0+00:20:27 C   8/14 13:41 C:\Data\condor_
 161.8   hit023          8/14 13:30   0+00:20:26 C   8/14 13:41 C:\Data\condor_
 161.9   hit023          8/14 13:30   0+00:20:26 C   8/14 13:41 C:\Data\condor_

  24.0   hit023          8/14 13:30   0+00:10:22 C   8/14 13:49 cpubound3.exe 1
  25.0   hit023          8/14 13:30   0+00:10:14 C   8/14 13:47 cpubound3.exe 1
  26.0   hit023          8/14 13:30   0+00:10:08 C   8/14 13:42 cpubound3.exe 1
  27.0   hit023          8/14 13:30   0+00:15:42 C   8/14 13:49 cpubound3.exe 1
  28.0   hit023          8/14 13:30   0+00:10:03 C   8/14 13:41 cpubound3.exe 1
  29.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1
  30.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1
  31.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1
  32.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1
  33.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/