[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Newbie issue: all jobs are idle!



Hi Edier,

Thank you! I did the tests you mentioned. Did a (not so) little search and it seems that the issue is at user autentication (log is bellow).

Is there an example for a single machine cluster authentication? (latter I'll expand, but one problem at time...)

I put the following lines on condor_config. I'm not sure if is correct...


ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_READ = *
ALLOW_WRITE = *
ALLOW_NEGOTIATOR = $(COLLECTOR_HOST)
ALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTDÂÂÂ = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_READ_COLLECTORÂ = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTDÂÂÂÂ = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_CLIENT = *

The /var/log/condor/NegotiatorLog gives:

12/01/17 09:10:24 ---------- Started Negotiation Cycle ----------
12/01/17 09:10:24 Phase 1:Â Obtaining ads from collector ...
12/01/17 09:10:24ÂÂ Getting startd private ads ...
12/01/17 09:10:24ÂÂ Getting Scheduler, Submitter and Machine ads ...
12/01/17 09:10:24ÂÂ Sorting 4 ads ...
12/01/17 09:10:24 Got ads: 4 public and 2 private
12/01/17 09:10:24 Public ads include 1 submitter, 2 startd
12/01/17 09:10:24 Phase 2:Â Performing accounting ...
12/01/17 09:10:24 Phase 3:Â Sorting submitter ads by priority ...
12/01/17 09:10:24 Phase 4.1:Â Negotiating with schedds ...
12/01/17 09:10:24ÂÂ Negotiating with tavares@xxxxxxxxx at <200.xxx.xxx.xxx:31601?addrs=200.xxx.xxx.xxx-31601>
12/01/17 09:10:24 0 seconds so far for this submitter
12/01/17 09:10:24 0 seconds so far for this schedd
12/01/17 09:10:24 SECMAN: FAILED: Received "DENIED" from server for user unauthenticated@unmapped using method (no authentication).
12/01/17 09:10:24 ERROR: SECMAN:2010:Received "DENIED" from server for user unauthenticated@unmapped using method (no authentication).
12/01/17 09:10:24ÂÂÂÂ Failed to send NEGOTIATE command to tavares@xxxxxxxxxxxxxxxxx (<200.xxx.xxx.xxx:31601?addrs=200.xxx.xxx.xxx-31601>)
12/01/17 09:10:24ÂÂ Error: Ignoring submitter for this cycle
12/01/17 09:10:24Â negotiateWithGroup resources used scheddAds length 0
12/01/17 09:10:24 ---------- Finished Negotiation Cycle ----------
Â

Thank you!

On Thu, Nov 30, 2017 at 1:17 PM, Edier Zapata <edalzap@xxxxxxxxx> wrote:
Hi Roberto,
ÂFirst run a condor_q to check the job status and Ids
ÂSecond, condor_q -better-analyze ClusterId of your Jobs
 ClusterId is the number before de period, example: 1.2 has ClusterId = 1, ProcessId=2 and JobId=1.2
ÂThe condor_q -better-analyze will show you why your jobs don't run and will give you some ideas to fix it.
ÂYou can post the condor_q output here too.

Bye


On Wed, Nov 29, 2017 at 11:00 AM, Roberto Tavares <tavares@xxxxxxxxxxxxx> wrote:
Hello,

I'm having a little trouble to put mu jobs to run. I've installed condor from the deb package (v. 8.6.7). I'm trying to run a job, it is accepted. The file to be executed is chmod'ed to 777.

The idea is: I had condor installed at one machine, and for now I'd like to run jobs on 2 of the 4 available cores.

The main issue is that I don't know what to look for. condor_status -long gives me the following output.

What should be the next steps for this debug?

Thank you!

Roberto

HasTDP = true
TotalLoadAvg = 0.07000000000000001
HasMPI = true
has_ssse3 = true
Disk = 66646642
HibernationSupportedStates = "S4"
SlotID = 1
OpSysLegacy = "LINUX"
HasEncryptExecuteDirectory = true
SlotTypeID = 0
Rank = 0.0
MyType = "Machine"
JobUserPrioPreemptions = 0
HasVM = false
TotalSlotCpus = 1
LoadAvg = 0.07000000000000001
Cpus = 1
OpSysShortName = "LINUX"
TotalVirtualMemory = 16074816
CondorLoadAvg = 0.0
EnteredCurrentState = 1511970509
PrivateNetworkName = "the domain seems ok"
Memory = 3943
HasIOProxy = true
SlotWeight = Cpus
TotalCpus = 2.0
Name = "slot1@the domain seems ok"
IsWakeAble = true
SlotType = "Static"
MyAddress = "<the ip seems ok:17345?addrs=the ip seems ok-17345>"
Machine = "the domain seems ok"
COLLECTOR_HOST_STRING = "the domain seems ok"
WakeOnLanSupportedFlags = "Physical Packet,UniCast Packet,MultiCast Packet,BroadCast Packet,Magic Packet"
UpdateSequenceNumber = 3
IsWakeOnLanEnabled = true
OpSysAndVer = "LINUX0"
CondorPlatform = "$CondorPlatform: x86_64_Ubuntu14 $"
Unhibernate = MY.MachineLastMatchTime =!= undefined
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.5 )
UidDomain = "the domain seems ok"
CpuBusyTime = 0
StartdIpAddr = "<the ip seems ok:17345?addrs=the ip seems ok-17345>"
TargetType = "Job"
AddressV1 = "{[ p=\"primary\"; a=\"the ip seems ok\"; port=17345; n=\"Internet\"; ], [ p=\"IPv4\"; a=\"the ip seems ok\"; port=17345; n=\"Internet\"; ]}"
TotalCondorLoadAvg = 0.0
TotalMemory = 7886
ConsoleIdle = 300
TotalTimeUnclaimedBenchmarking = 26
NumPids = 0
UtsnameRelease = "3.13.0-37-generic"
ExpectedMachineGracefulDrainingBadput = 0
TotalSlotMemory = 3943
RecentJobUserPrioPreemptions = 0
JavaSpecificationVersion = "1.7"
IsValidCheckpointPlatform = ( TARGET.JobUniverse =!= 1 || ( ( MY.CheckpointPlatform =!= undefined ) && ( ( TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform ) || ( TARGET.NumCkpts == 0 ) ) ) )
Activity = "Idle"
ClockMin = 833
ExpectedMachineQuickDrainingCompletion = 1511970509
RecentJobPreemptions = 0
WakeOnLanEnabledFlags = "Magic Packet"
HasCheckpointing = true
TotalDisk = 133293284
TotalSlots = 2
DetectedMemory = 7886
HasJava = true
JavaVendor = "Oracle Corporation"
HasFileTransferPluginMethods = "file,ftp,http,data"
HasReconnect = true
has_sse4_1 = true
EnteredCurrentActivity = 1511970535
HasPerFileEncryption = true
HasJICLocalStdin = true
TotalSlotDisk = 66646632.0
VirtualMemory = 8037408
CurrentRank = 0.0
RetirementTimeRemaining = 0
HardwareAddress = "a4:1f:72:fa:59:42"
AuthenticatedIdentity = "unauthenticated@unmapped"
Start = true
MonitorSelfResidentSetSize = 4968
UtsnameMachine = "x86_64"
MonitorSelfSecuritySessions = 3
HasFileTransfer = true
SubnetMask = "255.255.255.0"
UpdatesLost = 0
MonitorSelfRegisteredSocketCount = 1
IsWakeOnLanSupported = true
MonitorSelfCPUUsage = 0.0
CanHibernate = true
DaemonCoreDutyCycle = 3.960616506737402E-05
LastHeardFrom = 1511970813
ExpectedMachineGracefulDrainingCompletion = 1511970509
LastBenchmark = 1511970535
HasJobDeferral = true
UpdatesSequenced = 6
CpuIsBusy = false
UtsnameVersion = "#64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 2014"
HasJICLocalConfig = true
CheckpointPlatform = "LINUX X86_64 3.13.0-37-generic normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2"
JavaVersion = "1.7.0_151"
KeyboardIdle = 17
LastFetchWorkSpawned = 0
JobRankPreemptions = 0
HibernationLevel = 0
HibernationState = "NONE"
KFlops = 1703298
OpSysName = "LINUX"
UpdatesHistory = "00000000000000000000000000000000"
JobPreemptions = 0
RecentJobStarts = 0
IsLocalStartd = false
RecentJobRankPreemptions = 0
CondorVersion = "$CondorVersion: 8.4.12 Jul 06 2017 BuildID: 409562 $"
TotalTimeUnclaimedIdle = 278
UpdatesTotal = 7
RecentDaemonCoreDutyCycle = 3.960616506737402E-05
StarterAbilityList = "HasEncryptExecuteDirectory,HasJava,HasFileTransfer,HasTDP,HasPerFileEncryption,HasVM,HasReconnect,HasMPI,HasFileTransferPluginMethods,HasJobDeferral,HasJICLocalStdin,HasJICLocalConfig,HasRemoteSyscalls,HasCheckpointing"
JobStarts = 0
UtsnameNodename = "dot"
UtsnameSysname = "Linux"
LastUpdate = 1511970535
ClockDay = 3
HasRemoteSyscalls = true
Arch = "X86_64"
MonitorSelfImageSize = 45360
has_sse4_2 = true
OpSys = "LINUX"
FileSystemDomain = "the domain seems ok"
JavaMFlops = 1819.134766
MachineResources = "Cpus Memory Disk Swap"
ExpectedMachineQuickDrainingBadput = 0
DaemonStartTime = 1511970503
State = "Unclaimed"
Mips = 27304
DetectedCpus = 4
MyCurrentTime = 1511970813
LastFetchWorkCompleted = 0
Requirements = ( START ) && ( IsValidCheckpointPlatform )
NextFetchWorkDelay = -1
MachineMaxVacateTime = 10 * 60
TimeToLive = 2147483647
MonitorSelfTime = 1511970749
MonitorSelfAge = 246
MaxJobRetirementTime = 0
OpSysLongName = "Unknown"

HasTDP = true
TotalLoadAvg = 0.07000000000000001
HasMPI = true
has_ssse3 = true
Disk = 66646642
HibernationSupportedStates = "S4"
SlotID = 2
OpSysLegacy = "LINUX"
HasEncryptExecuteDirectory = true
SlotTypeID = 0
Rank = 0.0
MyType = "Machine"
JobUserPrioPreemptions = 0
HasVM = false
TotalSlotCpus = 1
LoadAvg = 0.0
Cpus = 1
OpSysShortName = "LINUX"
TotalVirtualMemory = 16074816
CondorLoadAvg = 0.0
EnteredCurrentState = 1511970509
PrivateNetworkName = "the domain seems ok"
Memory = 3943
HasIOProxy = true
SlotWeight = Cpus
TotalCpus = 2.0
Name = "slot2@the domain seems ok"
IsWakeAble = true
SlotType = "Static"
MyAddress = "<the ip seems ok:17345?addrs=the ip seems ok-17345>"
Machine = "the domain seems ok"
COLLECTOR_HOST_STRING = "the domain seems ok"
WakeOnLanSupportedFlags = "Physical Packet,UniCast Packet,MultiCast Packet,BroadCast Packet,Magic Packet"
UpdateSequenceNumber = 3
IsWakeOnLanEnabled = true
OpSysAndVer = "LINUX0"
CondorPlatform = "$CondorPlatform: x86_64_Ubuntu14 $"
Unhibernate = MY.MachineLastMatchTime =!= undefined
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.5 )
UidDomain = "the domain seems ok"
CpuBusyTime = 0
StartdIpAddr = "<the ip seems ok:17345?addrs=the ip seems ok-17345>"
TargetType = "Job"
AddressV1 = "{[ p=\"primary\"; a=\"the ip seems ok\"; port=17345; n=\"Internet\"; ], [ p=\"IPv4\"; a=\"the ip seems ok\"; port=17345; n=\"Internet\"; ]}"
TotalCondorLoadAvg = 0.0
TotalMemory = 7886
MaxJobRetirementTime = 0
MonitorSelfAge = 246
UtsnameRelease = "3.13.0-37-generic"
NumPids = 0
ExpectedMachineGracefulDrainingBadput = 0
TotalSlotMemory = 3943
RecentJobUserPrioPreemptions = 0
JavaSpecificationVersion = "1.7"
IsValidCheckpointPlatform = ( TARGET.JobUniverse =!= 1 || ( ( MY.CheckpointPlatform =!= undefined ) && ( ( TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform ) || ( TARGET.NumCkpts == 0 ) ) ) )
Activity = "Idle"
ClockMin = 833
ExpectedMachineQuickDrainingCompletion = 1511970509
RecentJobPreemptions = 0
WakeOnLanEnabledFlags = "Magic Packet"
HasCheckpointing = true
TotalDisk = 133293284
TotalSlots = 2
DetectedMemory = 7886
HasJava = true
JavaVendor = "Oracle Corporation"
HasFileTransferPluginMethods = "file,ftp,http,data"
HasReconnect = true
has_sse4_1 = true
EnteredCurrentActivity = 1511970509
HasPerFileEncryption = true
HasJICLocalStdin = true
TotalSlotDisk = 66646632.0
VirtualMemory = 8037408
CurrentRank = 0.0
RetirementTimeRemaining = 0
HardwareAddress = "a4:1f:72:fa:59:42"
AuthenticatedIdentity = "unauthenticated@unmapped"
Start = true
MonitorSelfResidentSetSize = 4968
UtsnameMachine = "x86_64"
MonitorSelfSecuritySessions = 3
HasFileTransfer = true
SubnetMask = "255.255.255.0"
UpdatesLost = 0
MonitorSelfRegisteredSocketCount = 1
IsWakeOnLanSupported = true
MonitorSelfCPUUsage = 0.0
CanHibernate = true
DaemonCoreDutyCycle = 4.42108100822125E-05
LastHeardFrom = 1511970814
ExpectedMachineGracefulDrainingCompletion = 1511970509
LastBenchmark = 1511970535
HasJobDeferral = true
UpdatesSequenced = 6
CpuIsBusy = false
UtsnameVersion = "#64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 2014"
HasJICLocalConfig = true
CheckpointPlatform = "LINUX X86_64 3.13.0-37-generic normal 0x2aaaaaaab000 ssse3 sse4_1 sse4_2"
JavaVersion = "1.7.0_151"
KeyboardIdle = 17
LastFetchWorkSpawned = 0
JobRankPreemptions = 0
HibernationLevel = 0
HibernationState = "NONE"
KFlops = 1703298
OpSysName = "LINUX"
UpdatesHistory = "00000000000000000000000000000000"
JobPreemptions = 0
RecentJobStarts = 0
IsLocalStartd = false
RecentJobRankPreemptions = 0
CondorVersion = "$CondorVersion: 8.4.12 Jul 06 2017 BuildID: 409562 $"
TotalTimeUnclaimedIdle = 305
UpdatesTotal = 7
RecentDaemonCoreDutyCycle = 4.42108100822125E-05
StarterAbilityList = "HasEncryptExecuteDirectory,HasJava,HasFileTransfer,HasTDP,HasPerFileEncryption,HasVM,HasReconnect,HasMPI,HasFileTransferPluginMethods,HasJobDeferral,HasJICLocalStdin,HasJICLocalConfig,HasRemoteSyscalls,HasCheckpointing"
JobStarts = 0
UtsnameNodename = "dot"
ConsoleIdle = 300
ClockDay = 3
HasRemoteSyscalls = true
Arch = "X86_64"
MonitorSelfImageSize = 45360
has_sse4_2 = true
OpSys = "LINUX"
FileSystemDomain = "the domain seems ok"
JavaMFlops = 1819.134766
MachineResources = "Cpus Memory Disk Swap"
ExpectedMachineQuickDrainingBadput = 0
LastUpdate = 1511970535
UtsnameSysname = "Linux"
DaemonStartTime = 1511970503
State = "Unclaimed"
Mips = 27304
DetectedCpus = 4
OpSysLongName = "Unknown"
MyCurrentTime = 1511970814
LastFetchWorkCompleted = 0
Requirements = ( START ) && ( IsValidCheckpointPlatform )
NextFetchWorkDelay = -1
MachineMaxVacateTime = 10 * 60
TimeToLive = 2147483647
MonitorSelfTime = 1511970749


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/