[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [ExternalEmail] Grid Universe with Condor as Grid Resource - Bug with run times?



I revisited this problem recently and it still exists with Condor version 8.2.6

Original submit node win7-32bit condor_schedd, remote grid resource win2008-64bit condor_schedd

condor_history on the original submit node shows ~DOUBLE the actual job time.

ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD
5692.0   hit023          2/10 09:57   0+00:17:47 C   2/10 10:06 C:\Data\condor_
5692.1   hit023          2/10 09:57   0+00:19:38 C   2/10 10:07 C:\Data\condor_
5692.2   hit023          2/10 09:57   0+00:17:44 C   2/10 10:07 C:\Data\condor_
5692.3   hit023          2/10 09:57   0+00:17:23 C   2/10 10:07 C:\Data\condor_
5692.4   hit023          2/10 09:57   0+00:16:29 C   2/10 10:06 C:\Data\condor_
5692.5   hit023          2/10 09:57   0+00:21:50 C   2/10 10:08 C:\Data\condor_
5692.6   hit023          2/10 09:57   0+00:21:50 C   2/10 10:08 C:\Data\condor_
5692.7   hit023          2/10 09:57   0+00:19:17 C   2/10 10:07 C:\Data\condor_
5692.8   hit023          2/10 09:57   0+00:20:36 C   2/10 10:08 C:\Data\condor_
5692.9   hit023          2/10 09:57   0+00:19:04 C   2/10 10:07 C:\Data\condor_

condor_history on the remote submit node shows CORRECT job time.

ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD
594117.0   hit023          2/10 12:57   0+00:08:27 C   2/10 13:06 cpubound_mult
594118.0   hit023          2/10 12:57   0+00:09:16 C   2/10 13:07 cpubound_mult
594119.0   hit023          2/10 12:57   0+00:08:55 C   2/10 13:07 cpubound_mult
803967.0   hit023          2/10 12:57   0+00:08:54 C   2/10 13:07 cpubound_mult
594120.0   hit023          2/10 12:57   0+00:08:11 C   2/10 13:06 cpubound_mult
594121.0   hit023          2/10 12:57   0+00:10:27 C   2/10 13:08 cpubound_mult
594122.0   hit023          2/10 12:57   0+00:10:27 C   2/10 13:08 cpubound_mult
803968.0   hit023          2/10 12:57   0+00:09:46 C   2/10 13:07 cpubound_mult
594123.0   hit023          2/10 12:57   0+00:10:14 C   2/10 13:08 cpubound_mult
803969.0   hit023          2/10 12:57   0+00:09:33 C   2/10 13:07 cpubound_mult

Job log file for 5692.0 on original submit node:

000 (5692.000.000) 02/10 09:57:33 Job submitted from host: <***.***.***.*:14815>
...
027 (5692.000.000) 02/10 09:57:54 Job submitted to grid resource
    GridResource: condor win2008-gjh2-vc.*****.*****.** condor-***.*****.**
    GridJobId: condor win2008-gjh2-vc.*****.*****.** condor-***.*****.** 594117.0
...
001 (5692.000.000) 02/10 09:58:15 Job executing on host: condor win2008-gjh2-vc.*****.*****.** condor-***.*****.**
...
005 (5692.000.000) 02/10 10:07:46 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:02:33, Sys 0 00:02:16  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:02:33, Sys 0 00:02:16  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Greg.Hitchen@xxxxxxxx
Sent: Thursday, 14 August 2014 3:27 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [ExternalEmail] [HTCondor-users] Grid Universe with Condor as Grid Resource - Bug with run times?

Hi

I've been playing with running condor jobs on a "remote" schedd using the grid universe, with a submit file like:

universe = grid
grid_resource = condor another-schedd.csiro.au central-manager.csiro.au
etc.
etc.

This is all working OK, except for the job run times. Watching using condor_q on both the originating submit node
and on the "remote" schedd resource the times match reasonably OK. However as the remote schedd shows the jobs as
C (complete) the originating submit schedd suddenly jumps to twice (2X) the run-time. This is also reflected in the
history on each schedd (using condor_history to view). Example below, test executable just chews up CPU for 5 mins.
Note that the originating submit node shows run times > the difference between SUBMITTED an COMPLETED times.
The job log files on the originating submit node show the correct run-times.

Originating submit node 
 
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD
 157.9   hit023          8/13 16:49   0+00:10:23 C   8/13 16:55 C:\Data\condor_
 157.5   hit023          8/13 16:49   0+00:10:23 C   8/13 16:55 C:\Data\condor_
 157.1   hit023          8/13 16:49   0+00:10:24 C   8/13 16:55 C:\Data\condor_
 157.8   hit023          8/13 16:49   0+00:10:23 C   8/13 16:55 C:\Data\condor_
 157.4   hit023          8/13 16:49   0+00:10:25 C   8/13 16:55 C:\Data\condor_
 157.0   hit023          8/13 16:49   0+00:10:24 C   8/13 16:55 C:\Data\condor_
 157.7   hit023          8/13 16:49   0+00:10:23 C   8/13 16:55 C:\Data\condor_
 157.3   hit023          8/13 16:49   0+00:10:25 C   8/13 16:55 C:\Data\condor_
 157.6   hit023          8/13 16:49   0+00:10:18 C   8/13 16:55 C:\Data\condor_
 157.2   hit023          8/13 16:49   0+00:10:19 C   8/13 16:55 C:\Data\condor_

Remote grid node 
 
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD
   1.0   hit023          8/13 16:49   0+00:05:02 C   8/13 16:55 cpubound3.exe 5
   2.0   hit023          8/13 16:49   0+00:05:02 C   8/13 16:55 cpubound3.exe 5
   3.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
   4.0   hit023          8/13 16:49   0+00:05:02 C   8/13 16:55 cpubound3.exe 5
   5.0   hit023          8/13 16:49   0+00:05:03 C   8/13 16:55 cpubound3.exe 5
   6.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
   7.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
   8.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
   9.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5
  10.0   hit023          8/13 16:49   0+00:05:01 C   8/13 16:55 cpubound3.exe 5

I have run this using several Condor versions and several OS'es and always get the same result.

Win7 8.0.6 to Linux 7.6.7
Win7 8.0.6 to Linux 8.0.6
Win7 8.0.6 to Win2008 8.0.6
Win7 8.2.1 to Linux 8.2.1
Linux 8.2.1 to Linux 7.6.7
Linux 8.2.1 to Win2008 8.0.6


Thanks for any info/help.

Cheers

Greg

P.S. One test run (for 10 minute job runs) had a job that was evicted after ~ 5mins
and restarted so that when finished it showed a run-time of ~15 mins on the grid resource
but ~30mins on the originating schedd, see below:

 161.0   hit023          8/14 13:30   0+00:21:00 C   8/14 13:49 C:\Data\condor_
 161.1   hit023          8/14 13:30   0+00:19:43 C   8/14 13:47 C:\Data\condor_
 161.2   hit023          8/14 13:30   0+00:20:31 C   8/14 13:42 C:\Data\condor_
 161.3   hit023          8/14 13:30   0+00:32:30 C   8/14 13:49 C:\Data\condor_
 161.4   hit023          8/14 13:30   0+00:20:28 C   8/14 13:41 C:\Data\condor_
 161.5   hit023          8/14 13:30   0+00:20:26 C   8/14 13:41 C:\Data\condor_
 161.6   hit023          8/14 13:30   0+00:20:26 C   8/14 13:41 C:\Data\condor_
 161.7   hit023          8/14 13:30   0+00:20:27 C   8/14 13:41 C:\Data\condor_
 161.8   hit023          8/14 13:30   0+00:20:26 C   8/14 13:41 C:\Data\condor_
 161.9   hit023          8/14 13:30   0+00:20:26 C   8/14 13:41 C:\Data\condor_

  24.0   hit023          8/14 13:30   0+00:10:22 C   8/14 13:49 cpubound3.exe 1
  25.0   hit023          8/14 13:30   0+00:10:14 C   8/14 13:47 cpubound3.exe 1
  26.0   hit023          8/14 13:30   0+00:10:08 C   8/14 13:42 cpubound3.exe 1
  27.0   hit023          8/14 13:30   0+00:15:42 C   8/14 13:49 cpubound3.exe 1
  28.0   hit023          8/14 13:30   0+00:10:03 C   8/14 13:41 cpubound3.exe 1
  29.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1
  30.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1
  31.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1
  32.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1
  33.0   hit023          8/14 13:30   0+00:10:01 C   8/14 13:41 cpubound3.exe 1



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/