[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Why are my jobs sent a HUP signal instantly?




For a test try setting START = TRUE in the config of your executing machines to see if that makes a differrence.
and take out any start requirements that you set in the submit file.


as a test.

JW

Tim Robertson wrote:

Sorry for the repost, but I'm getting a bit desperate -- can anyone help with this?

I've done some more investigation, and I'm not finding a good reason for the behaviour I'm seeing. For some unknown reason, my remote jobs are being sent a SIGHUP as soon as they begin running (return 129). I can run the binaries on the nodes in question without condor, and I've checked every obvious thing I can think of (swap, disk space, file permissions, memory).

I reconfigured to do more complete logging for the Starter and the Shadow processes, and the only thing I see that looks suspicious is the following entry in the StartLog:

1/12 16:22:44 Error: can't find resource with capability (<10.0.1.200:32798>#5315939804)

Do any of you have an explanation for this, or at least, another way to diagnose the problem?

Thanks,
Tim

Tim Robertson wrote:

Hi,

I'm trying to test a newly-configured condor pool (condor version 6.6.7, all machines use Fedora Core 1) using a few binaries in standard universe. When I submit jobs, however, only the submitting machine can execute -- all other jobs are matched to idle nodes, begin to execute, and are immediately vacated from the nodes.

When I examine the logs of these machines, I always see the following lines in the StarterLog file:

 > Process XXXXX exited with status 129
 > EXEC of user process failed, probably with insufficient swap

They always occur within 1 second of the the exceve call.

I found this thread in the mailing list archives, dealing with a similar problem:

http://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/ msg00253.shtml

But (wouldn't you know it), the thread goes dead before any useful information is given about the problem. Sigh.

What could be going on here? It isn't related to the binaries, as far as I can tell (I can log into the nodes and run the programs without condor), so I'm at a loss.

Thanks in advance,
Tim

PS: If it helps anyone, I've copied the result of running condor_status -l on one of the nodes below.

------------------------------------ condor_status -l :

MyType = "Machine"
TargetType = "Job"
Name = "baloo1.bagley069.varanilab"
Machine = "baloo1.bagley069.varanilab"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "agni.bagley069.varanilab"
CondorVersion = "$CondorVersion: 6.6.7 Oct 11 2004 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 1831400
Disk = 33307820
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 2899
ConsoleIdle = 41895557
Memory = 945
Cpus = 1
StartdIpAddr = "<xxxxxxx:32798>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "localdomain"
FileSystemDomain = "localdomain"
Subnet = "10.0.1"
HasIOProxy = TRUE
TotalVirtualMemory = 1831400
TotalDisk = 33307820
KFlops = 723715
Mips = 2587
LastBenchmark = 1105527071
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 200
ClockDay = 3
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,HasPVM,HasRem oteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1105528090
Activity = "Idle"
EnteredCurrentActivity = 1105528090
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" && State != "Owner")))
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1104733528
UpdateSequenceNumber = 2657
MyAddress = "<xxxxxxxx:32798>"
LastHeardFrom = 1105531215
UpdatesTotal = 2237
UpdatesSequenced = 2233
UpdatesLost = 36
UpdatesHistory = "0x00000000008808000000000100000000"


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users