[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow pid <> for job <> exited with status 108



Hi Christoph. 

Are you saying that there is nothing in the StartLog or StarterLog* files on bird664.desy.de for these failures?

If there is nothing in those files, perhaps there is something in the SharedPortLog?

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beyer, Christoph <christoph.beyer@xxxxxxx>
Sent: Thursday, December 23, 2021 4:48 AM
To: htcondor-users <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Shadow pid <> for job <> exited with status 108
 
Hi,

I see a lot of jobs starting up to a hundred shadows before running successfully IMHO the worker denies to start the job maybe due to conditions not met that were previously considered fullfilled (?)

The job leaves no trace at all on the workernode, hence it must be a very early thing happening once the claim on the workernode is activated ?


/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) Starting add_shadow_birthdate(28709962.0)
/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) Started shadow for job 28709962.0 on slot2@xxxxxxxxxxxxxxx <131.169.163.103:33302?addrs=131.169.163.103-33302&alias=bird664.desy.de> for BIRD_cms.lite.uid, (shadow pid = 1596023)
/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) Shadow pid 1596023 for job 28709962.0 exited with status 108
/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) Match record (slot2@xxxxxxxxxxxxxxx <131.169.163.103:33302?addrs=131.169.163.103-33302&alias=bird664.desy.de> for BIRD_cms.lite.uid, 28709962.0) deleted
/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) match (slot2@xxxxxxxxxxxxxxxxx <131.169.160.194:33133?addrs=131.169.160.194-33133&alias=batch1188.desy.de> for BIRD_cms.lite.uid) switching to job 28709962.0
/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) Starting add_shadow_birthdate(28709962.0)
/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) Started shadow for job 28709962.0 on slot2@xxxxxxxxxxxxxxxxx <131.169.160.194:33133?addrs=131.169.160.194-33133&alias=batch1188.desy.de> for BIRD_cms.lite.uid, (shadow pid = 1596024)
/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) Shadow pid 1596024 for job 28709962.0 exited with status 108
/var/log/condor/SchedLog:12/22/21 22:57:38 (pid:3577) Match record (slot2@xxxxxxxxxxxxxxxxx <131.169.160.194:33133?addrs=131.169.160.194-33133&alias=batch1188.desy.de> for BIRD_cms.lite.uid, 28709962.0) deleted

I would loveto get this down to a more reasonable number as it is irritating and clogging the log files ...

Any hints ?

Best
Christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/