[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Trying to understand DedicatedScheduler related problems



Hi again,

we just had another job behave like this.

It was submitted (requesting 32 nodes which were free at that point),
one could watch

condor_status -const 'PartitionableSlot isnt true' -af ClientMachine
RemoteUser Cpus JobId

with a rising number of slots with an undefined JobId until it reached
30. At that point condor_q showed this job as running and within seconds
it re-appeared to be 'idle' and 12 nodes were left from condor_status'
view without a defined JobId.

Looking further through the logs, not much is seen, e.g.Schedlog:

10/02/20 14:43:44 (pid:1398) Starting add_shadow_birthdate(969.0)
10/02/20 14:43:44 (pid:1398) Started shadow for job 969.0 on
slot1@xxxxxxxxxxxxxxxxx
<10.10.82.1:9618?addrs=10.10.82.1-9618&noUDP&sock=2209_c4cc_3> for
DedicatedSchedule
r, (shadow pid = 1864058)
10/02/20 14:43:45 (pid:1398) Received a superuser command
10/02/20 14:43:45 (pid:1398) Number of Active Workers 0
10/02/20 14:43:46 (pid:1398) In DedicatedScheduler::reaper pid 1864058
has status 27648
10/02/20 14:43:46 (pid:1398) Shadow pid 1864058 exited with status 108
10/02/20 14:43:46 (pid:1398) Dedicated job abnormally ended, releasing claim
10/02/20 14:43:46 (pid:1398) Dedicated job abnormally ended, releasing claim
[..]

Thus, still being puzzled about it. Anyone with an idea, where to dig
out more information about what may have gotten wrong?

Cheers
Carsten

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature