[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] CondorLoadAvg and memory information on RHEL3 versus FC4



We have a mix of RHEL 3 and FC4 machines in our pool
and are experiencing problems with jobs suspending and
then being evicted on Fedora Core 4 machines running
the latest stable release 6.6.11. The CondorLoad and
memory information do not appear to be accurate: 
 
VirtualMemory = 1073741824
Memory = 3
TotalVirtualMemory = 2147483647
...
CondorLoadAvg = 0.000000
 
The memory related data may be due to meminfo format
differences on the platforms. We've tested this with
the development release 6.7.18 for FC4 and the meminfo
data accurately represents what is on the machines:
 
VirtualMemory = 3144718
Memory = 1009
TotalVirtualMemory = 6289436
TotalMemory = 2019

The CondorLoadAvg appears to work intermittently on
some of the FC4 machines. Here is the load from our
jobs:

vm1@hoeplx144 LINUX       INTEL  Claimed    Busy      
1.000  1009  0+00:14:33
vm2@hoeplx144 LINUX       INTEL  Claimed    Busy      
1.020  1009  0+00:14:43
vm1@hoeplx144 LINUX       INTEL  Claimed    Busy      
1.000  1009  0+00:14:16
vm2@hoeplx144 LINUX       INTEL  Claimed    Busy      
1.200  1009  0+00:14:05
vm1@hoeplx144 LINUX       INTEL  Claimed    Busy      
1.000  1009  0+00:14:21
vm2@hoeplx144 LINUX       INTEL  Claimed    Busy      
1.400  1009  0+00:14:09
vm1@hoeplx144 LINUX       INTEL  Claimed    Busy      
1.050  1009  0+00:10:22
vm2@hoeplx144 LINUX       INTEL  Claimed    Busy      
1.030  1009  0+00:14:32
vm1@hoeplx145 LINUX       INTEL  Claimed    Busy      
1.150  1009  0+00:14:32
vm2@hoeplx145 LINUX       INTEL  Claimed    Busy      
1.140  1009  0+00:14:18
vm1@hoeplx145 LINUX       INTEL  Claimed    Busy      
1.050  1009  0+00:10:29
vm2@hoeplx145 LINUX       INTEL  Claimed    Busy      
1.050  1009  0+00:14:14
vm1@hoeplx145 LINUX       INTEL  Claimed    Busy      
1.000  1009  0+00:14:34
vm2@hoeplx145 LINUX       INTEL  Claimed    Busy      
1.630  1009  0+00:14:29
vm1@hoeplx145 LINUX       INTEL  Claimed    Busy      
1.000  1009  0+00:14:23
vm2@hoeplx145 LINUX       INTEL  Claimed    Busy      
1.540  1009  0+00:14:07

Here is the CondorLoadAvg and TotalCondorLoadAvg for
the same timeframe. These are dedicated machines which
are only running our jobs.

TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 1.030000
TotalCondorLoadAvg = 2.070000
CondorLoadAvg = 1.030000
TotalCondorLoadAvg = 2.070000
CondorLoadAvg = 1.140000
TotalCondorLoadAvg = 2.280000
CondorLoadAvg = 1.140000
TotalCondorLoadAvg = 2.280000
CondorLoadAvg = 1.040000
TotalCondorLoadAvg = 2.090000
CondorLoadAvg = 1.050000
TotalCondorLoadAvg = 2.090000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000
CondorLoadAvg = 0.000000
TotalCondorLoadAvg = 0.010000

Any suggestions on how to resolve the inconsistency we
are seeing with CondorLoad or is there something else
that we should investigate that could be causing our
jobs to suspend then be evicted?

 Thanks, Jeff
 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com