[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Daylight savings put all our jobs on hold?



Starting to think it's a permissions problem - we did swap some storage around before it broke.
Seeing this is the ShadowLog:

04/08/14 07:55:06 ******************************************************
04/08/14 07:55:06 ******************************************************
04/08/14 07:55:06 ** condor_shadow (CONDOR_SHADOW) STARTING UP
04/08/14 07:55:06 ** /usr/sbin/condor_shadow
04/08/14 07:55:06 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
04/08/14 07:55:06 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
04/08/14 07:55:06 ** $CondorVersion: 7.8.2 Aug 08 2012 $
04/08/14 07:55:06 ** $CondorPlatform: x86_64_rhap_6.3 $
04/08/14 07:55:06 ** PID = 34959
04/08/14 07:55:06 ** Log last touched 4/8 07:55:06
04/08/14 07:55:06 ** Log last touched 4/8 07:55:06
04/08/14 07:55:06 ******************************************************
04/08/14 07:55:06 ******************************************************
04/08/14 07:55:06 Using config source: /etc/condor/condor_config
04/08/14 07:55:06 Using local config sources:
04/08/14 07:55:06    /etc/condor/condor_config.local
04/08/14 07:55:06 DaemonCore: command socket at <147.158.130.183:48277?noUDP>
04/08/14 07:55:06 DaemonCore: private command socket at <147.158.130.183:48277>
04/08/14 07:55:06 DaemonCore: private command socket at <147.158.130.183:48277>
04/08/14 07:55:06 Setting maximum accepts per cycle 8.
04/08/14 07:55:06 ERROR "reading ClassAd from (STDIN): file is empty" at line 202 in file /slots/04/dir_64295/userdir/src/condor_shadow.V6.1/shadow_v61_main.cpp
04/08/14 07:55:06 Setting maximum accepts per cycle 8.
04/08/14 07:55:06 ******************************************************

Any idea which dir I should be looking for?
The dir it mentions is part of the src so I don't think it's an actual dir I have control over.
I can't see any condor_shadow processes running, shouldn't there be one per job that was submitted?

--Russell

 :-(

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Smithies, Russell
Sent: Monday, 7 April 2014 3:41 p.m.
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Daylight savings put all our jobs on hold?

It's a simple test job that just returns the hostnames.
Starting to think it's not related to daylight savings - that may have been a coincidence.
But no new jobs will run - all sit at 'idle'

--Russell
------------------------
intrepid$ cq -bet 115920.0


-- Schedd: inbfop03.agresearch.co.nz : <147.158.130.182:37044>
        Last successful match: Mon Apr  7 15:36:23 2014

The Requirements expression for your job is:

( TARGET.Site == MY.Site ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( TARGET.Site == "invermay" )     68
2   ( TARGET.Arch == "X86_64" )       70
3   ( TARGET.OpSys == "LINUX" )       70
4   ( TARGET.Disk >= 1 )              70
5   ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                                      70
6   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "agresearch.co.nz" ) )
                                      70
-------------------------------

intrepid$ cq 115951.0


-- Schedd: inbfop03.agresearch.co.nz : <147.158.130.182:37044>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
115951.0   smithiesr       4/7  15:18   0+00:00:03 I  0   0.0  do_stuff.sh

---------------------------------------------------

intrepid$ cq -l 115951.0


-- Schedd: inbfop03.agresearch.co.nz : <147.158.130.182:37044> MaxHosts = 1 User = "smithiesr@xxxxxxxxxxxxxxxx"
OnExitHold = false
CoreSize = 0
MachineAttrCpus0 = 1
WantRemoteSyscalls = false
MyType = "Job"
Rank = ( 64 / CPUs )
CumulativeSuspensionTime = 0
MinHosts = 1
PeriodicHold = false
PeriodicRemove = false
Err = "error.0.txt"
ProcId = 0
EnteredCurrentStatus = 1396841845
UserLog = "/home/smithiesr/condor/all.log"
NumShadowExceptions = 72
NumJobStarts = 0
AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,_condor_RequestCpus,_condor_RequestDisk,_condor_RequestMemory,RequestCpus,RequestDisk,RequestMemory,Site,DiskUsage,ImageSize,Requirements,NiceUser,ConcurrencyLimits"
JobUniverse = 5
AutoClusterId = 2
In = "/dev/null"
Requirements = ( TARGET.Site == MY.Site ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) ClusterId = 115951 WhenToTransferOutput = "ON_EXIT_OR_EVICT"
LastMatchTime = 1396841845
CompletionDate = 0
BufferSize = 524288
Environment = ""
StartdPrincipal = "unauthenticated@unmapped/147.158.128.82"
TargetType = "Machine"
LeaveJobInQueue = false
JobNotification = 1
Owner = "smithiesr"
CondorPlatform = "$CondorPlatform: x86_64_rhap_6.3 $"
CommittedTime = 0
QDate = 1396840718
JobLeaseDuration = 1200
TransferIn = false
ExitStatus = 0
NumCkpts_RAW = 0
RootDir = "/"
NumJobMatches = 72
JobCurrentStartDate = 1396841845
CurrentHosts = 0
GlobalJobId = "inbfop03.agresearch.co.nz#115951.0#1396840719"
RemoteSysCpu = 0.0
TotalSuspensions = 0
WantCheckpoint = false
LastJobLeaseRenewal = 1396841845
LastRemoteHost = "slot1@xxxxxxxxxxxxxxxxxxxxxxx"
PeriodicRelease = false
LastPublicClaimId = "<147.158.128.82:60954>#1396667299#1133#..."
CondorVersion = "$CondorVersion: 7.8.2 Aug 08 2012 $"
Out = "out.0.txt"
ShouldTransferFiles = "YES"
DiskUsage = 1
JobRunCount = 72
CumulativeSlotTime = 3.000000
CommittedSlotTime = 0
LocalUserCpu = 0.0
DiskUsage_RAW = 1
JobStartDate = 1396840721
ExitBySignal = false
StreamErr = false
NumSystemHolds = 0
NumRestarts = 0
RequestDisk = DiskUsage
OrigMaxHosts = 1
JobPrio = 0
NumCkpts = 0
BufferBlockSize = 32768
ImageSize = 1
CommittedSuspensionTime = 0
ExecutableSize_RAW = 1
Cmd = "/home/smithiesr/condor/do_stuff.sh"
LocalSysCpu = 0.0
Iwd = "/home/smithiesr/condor"
ServerTime = 1396841890
ImageSize_RAW = 1
LastSuspensionTime = 0
JobStatus = 1
ExecutableSize = 1
MachineAttrSlotWeight0 = 1
Site = "invermay"
RemoteWallClockTime = 3.000000
OnExitRemove = true
Arguments = ""
StreamOut = false
CurrentTime = time()
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,( ImageSize + 1023 ) / 1024) RemoteUserCpu = 0.0 NiceUser = false NumShadowStarts = 72 RequestCpus = 1 JobLastStartDate = 1396841845 WantRemoteIO = true LastJobStatus = 2


------------------------------------------------





-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Ben Cotton
Sent: Monday, 7 April 2014 1:01 p.m.
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Daylight savings put all our jobs on hold?

Russell,

When the U.S. switched to daylight saving time last month, I had a customer who discovered that all of their execute nodes died. The culprit there turned out to be that condor_master thought the timestamp of the condor_master.exe binary had changed and attempted to restart. HTCondor was running as an Active Directory user that wasn't correctly figured and so condor_master was not able to restart. Ticket 3572[1] has some information on that.

Without knowing what version of HTCondor and what OS you're running, I can't say if that's in any way related. Generally, I'm not sure why the time change would cause your jobs to remain in idle state. Would it be possible for you to share the output of a representative condor_q -bet?

[1] https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3572


Thanks,
BC

--
Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

http://www.cyclecomputing.com
twitter: @cyclecomputing
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/