[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Idle Jobs



A new problem, maybe two, popped up this morning.

1. Jobs submitted Wed. that ran are not running today. I also tried a simple sleep job with no success. They are all stuck in idle

2. Also the only machine seen in the pool with condor_status is just the CM/Submit. When doing a condodr_status on another machine we receive this error.
CEDAR:6001:Failed to connect to (IP of server:9618)

condor_q -better-analyze
-- Schedd: igoi00 : <0.0.0.0:9618?...
The Requirements _expression_ for job 1168.000 is

  ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer )

Job 1168.000 defines the following attributes:

  DiskUsage = 1
  ImageSize = 1
  RequestDisk = DiskUsage
  RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024)

slot1@igoi00 has the following attributes:

  TARGET.Arch = "X86_64"
  TARGET.Disk = 5515256
  TARGET.HasFileTransfer = true
  TARGET.Memory = 64156
  TARGET.OpSys = "LINUX"

The Requirements _expression_ for job 1168.000 reduces to these conditions:

    ÂSlots
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â 1 ÂTARGET.Arch == "X86_64"
[1] Â Â Â Â Â 1 ÂTARGET.OpSys == "LINUX"
[3] Â Â Â Â Â 1 ÂTARGET.Disk >= RequestDisk
[5] Â Â Â Â Â 1 ÂTARGET.Memory >= RequestMemory
[7] Â Â Â Â Â 1 ÂTARGET.HasFileTransfer

No successful match recorded.
Last failed match: Fri May 19 16:06:32 2017

Reason for last match failure: no match found

1168.000: ÂRun analysis summary ignoring user priority. Of 1 machines,
   0 are rejected by your job's requirements
   0 reject your job because of their own requirements
   0 match and are already running your jobs
   0 match but are serving other users
   0 are available to run your job



Here is an error from the CollectorLog.
DaemonCore: Can't receive command request from (Server IP) (perhaps a timeout?)

The Masterlog has this in it.
DefaultReaper unexpectedly called on pid 5435, status 0.
05/19/17 12:54:09 PERMISSION DENIED to unauthenticated@unmapped from host 0.0.0.0 for command 60005 (DC_OFF_GRACEFUL), access level ADMINISTRATOR: reason: ADMINISTRATOR authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 0.0.0.0,igoi00, hostname size = 1, original ip address = 0.0.0.0
05/19/17 12:54:09 DC_AUTHENTICATE: Command not authorized, done!

Permission Denied? To what is this referring to? And DC_AUTHENTICATE: Command not authorized? Â


NegotiatorLog
Rejected 1165.0 .... 9618&noUDP&sock=1739_6348_4>: no match found

startd_history
BadputCausedByDraining = false
StarterExitStatus = 0
ExitCode = 0
JobDuration = 423.628746
JobPid = 608905
MyAddress = "<0.0.0.0:9618?addrs=0.0.0.0-9618+[--1]-9618&noUDP&sock=491051_9a1a_1431>"
CommittedTime = 0
Cmd = "/home/smccalla/step3_outformat_6.sh"
NumPids = 0
MyType = "Job"
CumulativeSuspensionTime = 0
NumCkpts = 0
WantRemoteIO = true
In = "/dev/null"
WhenToTransferOutput = "ON_EXIT"
>
NumCkpts_RAW = 0
StreamErr = true
DiskUsage = 1
ImageSize_RAW = 1
CompletionDate = 1495068052
BadputCausedByPreemption = false
ServerTime = 1495067628
JobNotification = 0
Iwd = "/home/smccalla/GAIM_28/data/fasta_files/fastqjoin.join.fna.formatted"
NumJobCompletions = 0
LastSuspensionTime = 0
PeriodicRemove = false
Owner = "smccalla"
LocalSysCpu = 0.0
TransferInputSizeMB = 0
NumSystemHolds = 0
StreamOut = true
BlockWriteKbytes = 0
LastJobStatus = 1
ImageSize = 788952
NumRestarts = 0
RemoteSysCpu = 12.0
JobRunCount = 1
RemoteWallClockTime = 423
WantCheckpoint = false
AutoClusterId = 8
OrigMaxHosts = 1
NiceUser = false
Out = "/home/smccalla/GAIM_28/data/fasta_files/fastqjoin.join.fna.formatted/blast.out.group_761"
RequestDisk = 1
UserLog = "/home/smccalla/GAIM_28/data/fasta_files/fastqjoin.join.fna.formatted/blast.log"
TotalSuspensions = 0
JobStatus = 4
JobCurrentStartDate = 1495067628
PeriodicHold = false
TargetType = "Machine"
LocalUserCpu = 0.0
RemoteHost = "slot1_12@igoi00"
RootDir = "/"
LastMatchTime = 1495067628
RemoteSlotID = 1
CommittedSlotTime = 0
ExitBySignal = false
ProvisionedResources = "Cpus Memory Disk Swap"
Err = "/home/smccalla/GAIM_28/data/fasta_files/fastqjoin.join.fna.formatted/logblast.error.761"
StartdIpAddr = "<0.0.0.0:9618?addrs=0.0.0.0-9618+[--1]-9618&noUDP&sock=491002_1c55_5>"
ProcId = 761
JobUniverse = 5
RecentStatsLifetime = 415
BufferSize = 524288
MachineAttrCpus0 = 1
EnteredCurrentStatus = 1495067628
CoreSize = 0
TransferIn = false
BlockWrites = 0
RemoteUserCpu = 412.0
BufferBlockSize = 32768
AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,MachineLastMatchTime,_condor_RequestCpus,_condor_RequestDisk,_condor_RequestMemory,RequestCpus,RequestDisk,RequestMemory,ConcurrencyLimits,NiceUser,Rank,Requirements,DiskUsage,FileSystemDomain,ImageSize"
WantRemoteSyscalls = false
FileSystemDomain = "igoi00"
ExecutableSize = 1
CumulativeRemoteSysCpu = 0.0
CumulativeSlotTime = 0
BlockReads = 3
ClusterId = 1149
TotalSubmitProcs = 765
PeriodicRelease = false
LastJobLeaseRenewal = 1495067628
StartdPrincipal = "execute-side@matchsession/0.0.0.0"
RunAsOwner = true
CpusProvisioned = 1
CommittedSuspensionTime = 0
CumulativeRemoteUserCpu = 0.0
PublicClaimId = "<0.0.0.0:9618>#1494530289#1577#..."
ClaimId = "<0.0.0.0:9618>#1494530289#1577#[CryptoMethods=\"3DES\";Encryption=\"NO\";Integrity=\"NO\";]7b2767b8f5f70ae52984ba401410e6c6f309364e"
Environment = ""
RecentBlockWrites = 0
CondorVersion = "$CondorVersion: 8.6.2 Apr 23 2017 BuildID: 404257 $"
MaxHosts = 1
CondorPlatform = "$CondorPlatform: x86_64_RedHat7 $"
CurrentHosts = 1
NumShadowStarts = 1
JobLeaseDuration = 2400
RequestCpus = 1
DiskUsage_RAW = 1
QDate = 1495056728
Requirements = ( ( Machine == "igoi00" ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
NumJobMatches = 1
MinHosts = 1
>
JobPrio = 0
MachineAttrSlotWeight0 = 1
JobStartDate = 1495067629
RequestMemory = 1
LeaveJobInQueue = false
NumJobStarts = 0
ExecutableSize_RAW = 1
ShouldTransferFiles = "IF_NEEDED"
EncryptExecuteDirectory = false
StatsLifetime = 423
ExitStatus = 0
MemoryProvisioned = 1
User = "smccalla@igoi00"
DiskProvisioned = 64558
Rank = 0.0
Arguments = "761"
ShadowBday = 1495067628
RecentBlockReadKbytes = 200
StarterIpAddr = "<0.0.0.0:9618?addrs=0.0.0.0-9618+[--1]-9618&noUDP&sock=491052_4e7a_2045>"
JobState = "Exited"
MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )
RecentBlockReads = 3
GlobalJobId = "igoi00#1149.761#1495056729"
ResidentSetSize = 348900
BlockReadKbytes = 200
RecentBlockWriteKbytes = 0
*** Offset = 2219429 ClusterId = 1149 ProcId = 761 Owner = "smccalla" CompletionDate = 1495068052


StartLog
05/19/17 10:47:25 Using config source: /etc/condor/condor_config
05/19/17 10:47:25 Using local config sources:
05/19/17 10:47:25 Â Â/etc/condor/config.d/00condor_config_Daemon_Resources
05/19/17 10:47:25 Â Â/etc/condor/config.d/01condor_config_IP
05/19/17 10:47:25 Â Â/etc/condor/config.d/01condor_config_IP_Host
05/19/17 10:47:25 Â Â/etc/condor/config.d/02condor_config_Access
05/19/17 10:47:25 Â Â/etc/condor/config.d/03condor_config_flocking
05/19/17 10:47:25 Â Â/etc/condor/config.d/04condor_config_Docker
05/19/17 10:47:25 Â Â/etc/condor/config.d/condor_config_Groups
05/19/17 10:47:25 Â Â/etc/condor/condor_config.local
05/19/17 10:47:25 config Macros = 85, Sorted = 85, StringBytes = 2558, TablesBytes = 3156
05/19/17 10:47:25 CLASSAD_CACHING is ENABLED
05/19/17 10:47:25 Daemon Log is logging: D_ALWAYS D_ERROR
05/19/17 10:47:25 SharedPortEndpoint: waiting for connections to named socket 1739_6348_5
05/19/17 10:47:25 DaemonCore: command socket at <0.0.0.0:9618?addrs=0.0.0.0-9618+[--1]-9618&noUDP&sock=1739_6348_5>
05/19/17 10:47:25 DaemonCore: private command socket at <0.0.0.0:9618?addrs=159.189.88.24-9618+[--1]-9618&noUDP&sock=1739_6348_5>
05/19/17 10:47:25 my_popenv: Failed to exec in child, errno=2 (No such file or directory)
05/19/17 10:47:25 my_popenv failed
05/19/17 10:47:32 my_popenv: Failed to exec in child, errno=2 (No such file or directory)
05/19/17 10:47:32 Failed to execute /usr/sbin/condor_starter.std, ignoring
05/19/17 10:47:32 VM-gahp server reported an internal error
05/19/17 10:47:32 VM universe will be tested to check if it is available
05/19/17 10:47:32 History file rotation is enabled.
05/19/17 10:47:32 Â Maximum history file size is: 20971520 bytes
05/19/17 10:47:32 Â Number of rotated history files is: 2
05/19/17 10:47:32 Allocating auto shares for slot type 1: Cpus: 32.000000, Memory: auto, Swap: auto, Disk: auto
slot type 1: Cpus: 32.000000, Memory: 64156, Swap: 100.00%, Disk: 100.00%
05/19/17 10:47:32 slot1: New machine resource of type 1 allocated
05/19/17 10:47:32 Setting up slot pairings
05/19/17 10:47:32 my_popenv: Failed to exec in child, errno=2 (No such file or directory)
05/19/17 10:47:32 my_popenv failed
05/19/17 10:47:32 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.
05/19/17 10:47:32 CronJobList: Adding job 'mips'
05/19/17 10:47:32 CronJobList: Adding job 'kflops'
05/19/17 10:47:32 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips)
05/19/17 10:47:32 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops)
05/19/17 10:47:32 slot1: State change: IS_OWNER is false
05/19/17 10:47:32 slot1: Changing state: Owner -> Unclaimed
05/19/17 10:47:32 State change: RunBenchmarks is TRUE
05/19/17 10:47:32 slot1: Changing activity: Idle -> Benchmarking
05/19/17 10:47:32 BenchMgr:StartBenchmarks()
05/19/17 10:47:35 Initial update sent to collector(s)
05/19/17 10:47:35 Sending DC_SET_READY message to master <0.0.0.0:9618?addrs=159.189.88.24-9618+[--1]-9618&noUDP&sock=1739_6348>
05/19/17 10:47:59 State change: benchmarks completed
05/19/17 10:47:59 slot1: Changing activity: Benchmarking -> Idle
05/19/17 10:54:09 Got SIGHUP. Re-reading config files.
05/19/17 10:54:09 History file rotation is enabled.
05/19/17 10:54:09 Â Maximum history file size is: 20971520 bytes
05/19/17 10:54:09 Â Number of rotated history files is: 2
05/19/17 10:54:09 my_popenv: Failed to exec in child, errno=2 (No such file or directory)
05/19/17 10:54:09 my_popenv failed
05/19/17 10:54:15 my_popenv: Failed to exec in child, errno=2 (No such file or directory)
05/19/17 10:54:15 Failed to execute /usr/sbin/condor_starter.std, ignoring
05/19/17 11:27:16 Got SIGHUP. Re-reading config files.
05/19/17 11:27:16 History file rotation is enabled.
05/19/17 11:27:16 Â Maximum history file size is: 20971520 bytes
05/19/17 11:27:16 Â Number of rotated history files is: 2
05/19/17 11:27:16 my_popenv: Failed to exec in child, errno=2 (No such file or directory)
05/19/17 11:27:16 my_popenv failed
05/19/17 11:27:22 my_popenv: Failed to exec in child, errno=2 (No such file or directory)
05/19/17 11:27:22 Failed to execute /usr/sbin/condor_starter.std, ignoring
05/19/17 11:52:22 Unable to calculate keyboard/mouse idle time due to them both being USB or not present, assuming infinite idle time for these devices.

Not sure what's going on. Hopefully someone can point me in the right direction.


Thanks

Jon Knudson