[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stuck in queue



Em quarta-feira 24 agosto 2011, às 01:21:37, Koller, Garrett escreveu:
> Mr. Cannini,
> 
> You're receiving these errors because Condor is trying to be cautious with
> the power you give it.  "With great power comes great responsibility." 
> Root processes have the power to change their effective user and group IDs
> while they are running.  So, even though Condor is being run as root at
> first, Condor only uses that power when it needs it.  When Condor is doing
> normal Condor stuff that doesn't need the extra permissions, it changes
> its effective user and group IDs to be 'condor'.  That is why when you
> check the Condor processes with ps or top, they almost always are listed
> as being owned by the 'condor' user and group.  When Condor needs the
> extra permissions, it changes its effective user ID to be root but then
> changes back to 'condor' when its done doing the dangerous stuff.

Yes, i understand it, but like i said, i need to make it work first and then 
close it down.

> Because of this, perhaps the '/var/spool/condor/' directory or one of its
> subdirectories needs to be owned by root:root.  I have mine owned by
> condor:condor, though, so I don't know why this is a problem.  Try
> chowning it to 'root:root' and see if that helps. For a similar reason,
> perhaps '/var/lib/condor/execute/' needs to be owned by root:root. 
> (Root-squashed usually refers to not giving special permissions to a local
> 'root' user on a shared filesystem that doesn't care about root, I think.)
>  Why is this directory have the sticky bit set, though?  (According to the
> "t" in the "drwx-rwx-rwt" permissions.)  Try unsetting the sticky bit in
> '/var/lib/condor/execute/' by running 'chmod -t /var/lib/condor/execute'
> as root.  My execute directory doesn't have the sticky bit set, so I think
> it's safe to unset it (I don't think it's set by default, that is).

Tried unsetting sticky bit, changing ownership , but no dice.

> Hopefully, this will fix your problems or at least get you that much closer
> to figuring it all out once and for all.  I don't know why the job stays
> stuck on the queue.  Unfortunately, I'm not yet familiar with the parallel
> universe.  What I do know is that after you make these changes and correct
> the most recent errors in your log files, restart Condor and try again. 
> If they still stay in the queue, run the 'condor_q -better-analyze' to see
> if you get more information this time.  Before, it mentioned that your job
> didn't match any resource constraints, which tells me that the
> Requirements of the job and the capabilities of the machine don't quite
> match up right.  Look through the log files I mentioned again to see if
> you get any new errors.  If 'condor_q -better-analyze' and the log files
> don't help, give me the output of 'condor_q -long' for the appropriate
> cluster/job and 'condor_status -long' for the appropriate machines
> (node-01 and node-02?).
> 
> Best Regards,
>  ~ Garrett K.
> condor.cs.wlu.edu

There it goes.
example job
===============================
universe      = parallel
Error           = err-$(node).log
Output        = out-$(node).log
Log             = log-$(node).log

executable      = /usr/bin/mpirun
arguments       = /home/user/hw -np 8 -host $NODE

machine_count   = 1

WhenToTransferOutput = ON_EXIT

transfer_input_files = /home/user/hw

Queue
===============================


Output of 'condor_q -long 57'
+++++++++++++++++++++++++++++++
-- Submitter: master.internal.domain : <172.17.8.121:9632> : 
master.internal.domain
PeriodicRemove = false
CommittedSlotTime = 0
Out = "out-#pArAlLeLnOdE#.log"
WantIOProxy = true
ImageSize_RAW = 51
NumCkpts_RAW = 0
JobRequiresSandbox = true
EnteredCurrentStatus = 1314306012
CommittedSuspensionTime = 0
WhenToTransferOutput = "ON_EXIT"
NumSystemHolds = 0
StreamOut = false
NumRestarts = 0
ImageSize = 75
Cmd = "/usr/bin/mpirun"
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx"
CurrentHosts = 0
Iwd = "/home/user"
CumulativeSlotTime = 0
ExecutableSize_RAW = 51
CondorVersion = "$CondorVersion: 7.6.0 Apr 19 2011 BuildID: Debian 
[7.6.0-1~nd60+1] $"
RemoteUserCpu = 0.0
NumCkpts = 0
JobStatus = 1
RemoteSysCpu = 0.0
OnExitRemove = true
BufferBlockSize = 32768
ClusterId = 58
In = "/dev/null"
LocalUserCpu = 0.0
MinHosts = 1
Environment = ""
JobUniverse = 11
RequestDisk = DiskUsage
RootDir = "/"
NumJobStarts = 0
WantRemoteIO = true
RequestMemory = ceiling(ifThenElse(JobVMMemory =!= 
undefined,JobVMMemory,ImageSize / 1024.000000))
GlobalJobId = "master.internal.domain#58.0#1314306012"
LocalSysCpu = 0.0
PeriodicRelease = false
DiskUsage = 75
CumulativeSuspensionTime = 0
JobLeaseDuration = 1200
TransferInput = "/home/user/hw"
UserLog = "/home/user/log-#pArAlLeLnOdE#.log"
KillSig = "SIGTERM"
ExecutableSize = 75
MaxHosts = 1
ServerTime = 1314306330
CoreSize = 0
DiskUsage_RAW = 61
ProcId = 0
TransferFiles = "ONEXIT"
ShouldTransferFiles = "YES"
CommittedTime = 0
TotalSuspensions = 0
Err = "err-#pArAlLeLnOdE#.log"
RequestCpus = 1
StreamErr = false
NiceUser = false
RemoteWallClockTime = 0.0
TargetType = "Machine"
PeriodicHold = false
QDate = 1314306012
OnExitHold = false
Rank = 0.0
ExitBySignal = false
CondorPlatform = "$CondorPlatform: X86_64-Debian_6.0 $"
JobPrio = 0
LastSuspensionTime = 0
Args = "/home/user/hw -np 8 -host $NODE"
CurrentTime = time()
JobNotification = 2
User = "user@xxxxxxxxxxxxxxx"
BufferSize = 524288
WantRemoteSyscalls = false
LeaveJobInQueue = false
ExitStatus = 0
CompletionDate = 0
MyType = "Job"
Requirements = ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( 
TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) && ( ( 
RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.HasFileTransfer )
WantCheckpoint = false
Owner = "user"
LastJobStatus = 0
TransferIn = false
+++++++++++++++++++++++++++++++