[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor-CE held job due to expired user proxy




Dear all,

We are experiencing issues with the recent migration to htcondor in the CBPF GRID. Often, multiple jobs are held with HTCondor-CE Hold Job status due to expired user proxy. This is a strange behavior in LHCb jobs, as can be seen below:

[root@ce01 spool]# pwd
/var/lib/condor-ce/spool
[root@ce01 spool]# ll -R 3839/|tail
3839/9:
total 8
drwx------ 2 simple simple 4096 Jan  2 15:12 cluster3839.proc9.subproc0
drwx------ 2 simple simple 4096 Jan  2 15:12 cluster3839.proc9.subproc0.tmp

3839/9/cluster3839.proc9.subproc0:
total 0

3839/9/cluster3839.proc9.subproc0.tmp:
total 0
[root@ce01 spool]# condor_ce_q 3839


-- Schedd: ce01.cat.cbpf.br : <10.0.0.10:5015> @ 01/02/20 17:39:36
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
simple ID: 3839     1/2  14:42      _      _     40     40 3839.0-39

Total for query: 40 jobs; 0 completed, 0 removed, 40 idle, 0 running, 0 held, 0 suspended Total for all users: 2118 jobs; 1033 completed, 0 removed, 365 idle, 720 running, 0 held, 0 suspended

[root@ce01 spool]#
-----------------------------------------------------------------------------

The HTCondor CE thinks the 40 subjobs of job 3839 are waiting to get run,
yet all their input files are absent, despite messages like these:

-----------------------------------------------------------------------------
[root@ce01 condor-ce]# pwd
/var/log/condor-ce
[root@ce01 condor-ce]# grep -w 3839.9 SchedLog
01/02/20 14:42:41 (cid:503887) Submitting new job 3839.9
01/02/20 14:42:50 New job: 3839.9
01/02/20 14:42:50 Writing record to user logfile=/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/3839.9.log owner=simple 01/02/20 14:42:50 WriteUserLog::initialize: opened /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/3839.9.log successfully
01/02/20 14:42:50 New job: 3839.9, Duplicate Keys: 2, Total Keys: 18
01/02/20 14:43:08 (cid:503905) Transferring files for jobs 3839.0, 3839.1, 3839.2, 3839.3, 3839.4, 3839.5, 3839.6, 3839.7, 3839.8, 3839.9, 3839.10, 3839.11, 3839.12, 3839.13, 3839.14, 3839.15, 3839.16, 3839.17, 3839.18, 3839.19, 3839.20, 3839.21, 3839.22, 3839.23, 3839.24, 3839.25, 3839.26, 3839.27, 3839.28, 3839.29, 3839.30, 3839.31, 3839.32, 3839.33, 3839.34, 3839.35, 3839.36, 3839.37, 3839.38, 3839.39
01/02/20 14:43:17 generalJobFilesWorkerThread(): transfer files for job 3839.9
SUBMIT_UserLog = "/data/HTCondor/work/08D/18A/3839.9.log"
Iwd = "/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0"
GlobalJobId = "ce01.cat.cbpf.br#3839.9#1577986961"
Environment = "HTCONDOR_JOBID=3839.9"
Err = "3839.9.err"
Out = "3839.9.out"
UserLog = "3839.9.log"
01/02/20 14:43:17 Sending GoAhead for 188.184.80.201 to send /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/tmpsNBYHw and all further files. 01/02/20 14:43:17 Received GoAhead from peer to receive /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/tmpsNBYHw and all further files. 01/02/20 14:43:18 get_file(): going to write to filename /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/DIRAC_RTwVjf_pilotwrapper.py
01/02/20 14:43:18 (cid:503905) Received proxy for job 3839.9
01/02/20 14:43:18 (cid:503905) proxy path: /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/tmpsNBYHw
01/02/20 14:43:51 No HoldReasonSubCode found for job 3839.9
01/02/20 14:43:51 Job 3839.9 released from hold: Data files spooled
[root@ce01 condor-ce]#
-----------------------------------------------------------------------------

So, it seems the files were there at some point, then mysteriously disappeared...

Meanwhile HTCondor will try to schedule the job at some point,
which will fail immediately due to absence of the proxy,
which is not seen as a fatal error --> it will keep retrying after a while,
until after many days the job finally gets cleaned up, probably when
it is known that the proxy must have expired by then (LHCb proxies appear
to have lifetimes of 7 days).

I have not yet found any activity logged that would explain why those files
were removed while the job was left around...

Anyone can help?

Cheers,

Daniel Rodrigues de Silos Moraes
CBPF/MCTIC - Grid Site Manager