[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] ShadowLog not accessible for Shadows, jobs not starting/not ending up in the right slot for their Shadow



Hi all,

we are currently wondering about an odd behaviour of schedds on ARC grid CEs

job shadows seem to not able to update their local status anymore with a
symptome, that they cannot write to the ShadowLog anymore [1].

Such a shadow process tries to open the ShadowLog but seem to fail with
a 'Operation not permitted' [2].

However, all file system operations work for us when trying them
manually; the VM admins insist, that there are no problems with the
underlying HVs; and also there are no CPU steals are significant I/O waits.
So it is not clear, why the Shadows really fail.

The job itself seem to run happy on its node [3]. The odd thing is, that
the Shadow does not know about is proper sub-slot but just knows about
the partitionable parent slot [4]. Also the remote walltime etc. are 0.0
- which does not look healthy [5] as on the job has been running for ~3h
on the node already.

Best guess is at the moment, that the Shadows cannot update their states
on the schedds.
As the schedd machines themselves look good (Condor versions [6]), it is
currently a mystery for us, what upsets the shadows??
The schedd VMs are on different HVs and have SSD backends, so we have
two different machines showing the same behaviour??

Maybe somebody has seen something before and has an idea for us?

Cheers and thanks,
  Thomas

btw: attached are the local starter log excerpt for the job pid and the
jobs local class ads files.



[1]
02/04/20 16:20:27 (22252523.0) (1701263): WriteUserLog checking for
event log rotation, but no lock
02/04/20 16:20:27 (22252523.0) (1701263): WriteUserLog checking for
event log rotation, but no lock

[2]
root@grid-arcce0: [/var/log/condor] strace -p 1701263
strace: Process 1701263 attached
select(21, [1 6 20], [], [], {3, 241992}) = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 974658071}) = 0
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV PROF RTMIN
RT_1], NULL, 8) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 974887253}) = 0
read(1, 0x7ffe3a5d1b60, 8)              = -1 EAGAIN (Resource
temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 975123273}) = 0
chmod("/var/log/condor/ShadowLog", 0644) = -1 EPERM (Operation not
permitted)
clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 975352330}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {23071, 975439287}) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(21, [1 6 20], [], [], {2, 0})    = 0 (Timeout)
clock_gettime(CLOCK_MONOTONIC_RAW, {23073, 981660291}) = 0
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV PROF RTMIN
RT_1], NULL, 8) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {23073, 981850254}) = 0
read(1, 0x7ffe3a5d1b60, 8)              = -1 EAGAIN (Resource
temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC_RAW, {23073, 982017237}) = 0
clock_gettime(CLOCK_MONOTONIC_RAW, {23073, 982136875}) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(21, [1 6 20], [], [], {13, 0}^Cstrace: Process 1701263 detached
 <detached ...>


[3]
[root@batch0819 ~]# condor_who | grep 22251616.0
belleprd004@xxxxxxx grid-arcce0.desy.de 1_20 22251616.0   0+03:02:45
7203       /var/lib/condor/execute/dir_6929/condor_exec.exe

[4]
> condor_q -l 22251616.0 | grep -i host
CurrentHosts = 1
MaxHosts = 1
MinHosts = 1
OrigMaxHosts = 1
RemoteHost = "slot1@xxxxxxxxxxxxxxxxx"

>  condor_q -l 22251616.0 | grep -i remote
CumulativeRemoteSysCpu = 287.0
CumulativeRemoteUserCpu = 9136.0
PeriodicRemove = false || RemoteUserCpu + RemoteSysCpu > JobCpuLimit ||
RemoteWallClockTime > JobTimeLimit
RemoteHost = "slot1@xxxxxxxxxxxxxxxxx"
RemoteSlotID = 1
RemoteSysCpu = 287.0
RemoteUserCpu = 9136.0
RemoteWallClockTime = 0.0
WantRemoteIO = true
WantRemoteSyscalls = false


[5]
condor_q -l 22251616.0 | grep -i time
AutoClusterAttrs =
"_cp_orig_RequestCpus,_cp_orig_RequestDisk,_cp_orig_RequestMemory,JobUniverse,LastCheckpointPlatform,MachineLastMatchTime,NumCkpts,RequestCpus,RequestDisk,RequestMemory,ConcurrencyLimits,FlockTo,NiceUser,Rank,Requirements,NODE_IS_HEALTHY,DESYAcctGroup,DESYAcctSubGroup,DiskUsage,Owner,x509UserProxyVOName"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CumulativeSlotTime = 0
CumulativeSuspensionTime = 0
JobTimeLimit = 172800
LastMatchTime = 1580819160
LastRejMatchTime = 1580819160
LastSuspensionTime = 0
PeriodicRemove = false || RemoteUserCpu + RemoteSysCpu > JobCpuLimit ||
RemoteWallClockTime > JobTimeLimit
RecentStatsLifetimeStarter = 1200
RemoteWallClockTime = 0.0
ServerTime = 1580830525
StatsLifetimeStarter = 10528


[6]
condor-external-libs-8.9.5-1.el7.x86_64
condor-8.9.5-1.el7.x86_64
python2-condor-8.9.5-1.el7.x86_64
condor-classads-8.9.5-1.el7.x86_64
condor-procd-8.9.5-1.el7.x86_64

Attachment: 6929.tar.xz
Description: application/xz

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature