[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] exec failure



I've seen this failure mentioned before, but 
haven't been able to resolve it.

I have two RHE4 machines with a shared file system
containing the condor binaries and libraries.

Each machine has their own condor user and group and
home directory.

Install and startup are fine I'm able to submit standalone
jobs on each of the machines, but when I submit a job from
one machine to another it fails with an entry in the start log.
condor_status behaves expectedly on each machine.

EXEC of user process failed, probably insufficient swap

RESERVED_SWAP is set to 0 in all config files, machine 1
has 512M of swap, machine 2 over 3G
------------------------------------------------------------------
[condor@geronimo log]$ free
             total       used       free     shared    buffers
cached
Mem:        256060     227312      28748          0      41984
138356
-/+ buffers/cache:      46972     209088
Swap:       514040        144     513896
----------------------------------------------------------------
[condor@chinle test]$ free
             total       used       free     shared    buffers
cached
Mem:       1555884     875876     680008          0      71016
502888
-/+ buffers/cache:     301972    1253912
Swap:      3068372          0    3068372

I'm trying to submit the following job from machine 2 to
machine 1 (condor_compile gcc -o tester test.o).  I've also
tried submitting vanilla jobs on the nfs mount with the same 
result.  The job runs fine on each of the machines as a standalone.

Executable      = tester
Universe        = standard
Log             = tester.log
output          = tester.out
error           = tester.error
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Requirements = machine == "geronimo.localdomain"

machine 1 "central manager"
------------------------------
MyType = "Scheduler"
TargetType = ""
CondorVersion = "$CondorVersion: 6.7.16 Feb  2 2006 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
Machine = "geronimo.localdomain"
QuillEnabled = FALSE
ScheddIpAddr = "<192.168.1.132:32871>"
NumUsers = 0
MaxJobsRunning = 200
StartLocalUniverse = TRUE
StartSchedulerUniverse = TRUE
Name = "geronimo.localdomain"
VirtualMemory = 2147483647
TotalIdleJobs = 0
TotalRunningJobs = 0
TotalJobAds = 0
TotalHeldJobs = 0
TotalFlockedJobs = 0
TotalRemovedJobs = 0
MonitorSelfTime = 1140705379
MonitorSelfCPUUsage = 0.004182
MonitorSelfImageSize = 7992.000000
MonitorSelfResidentSetSize = 3812
MonitorSelfAge = 47761
WantResAd = TRUE
DaemonStartTime = 1140705434
UpdateSequenceNumber = 0
MyAddress = "<192.168.1.132:32871>"
ServerTime = 1140705434
LastHeardFrom = 1140705434
UpdatesTotal = 569
UpdatesSequenced = 568
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

machine 2 dedicated node
------------------------
MyType = "Scheduler"
TargetType = ""
CondorVersion = "$CondorVersion: 6.7.16 Feb  2 2006 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
Machine = "chinle.localdomain"
QuillEnabled = FALSE
ScheddIpAddr = "<192.168.1.130:34439>"
MyAddress = "<192.168.1.130:34439>"
NumUsers = 1
MaxJobsRunning = 200
StartLocalUniverse = TRUE
StartSchedulerUniverse = TRUE
Name = "chinle.localdomain"
VirtualMemory = 2147483647
TotalIdleJobs = 1
TotalRunningJobs = 0
TotalJobAds = 1
TotalHeldJobs = 0
TotalFlockedJobs = 0
TotalRemovedJobs = 0
MonitorSelfTime = 1140705915
MonitorSelfCPUUsage = 0.008333
MonitorSelfImageSize = 7992.000000
MonitorSelfResidentSetSize = 3784
MonitorSelfAge = 960
WantResAd = TRUE
DaemonStartTime = 1140704955
DaemonStartTime = 1140704955
UpdateSequenceNumber = 7
ServerTime = 1140706088
LastHeardFrom = 1140706088
UpdatesTotal = 335
UpdatesSequenced = 334
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

Thanks,
Todd Applewhite