Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor-CE held job due to expired user proxy

Date: Thu, 09 Jan 2020 13:09:07 -0300
From: Daniel Rodrigues de Silos Moraes <dsilos@xxxxxxx>
Subject: [HTCondor-users] HTCondor-CE held job due to expired user proxy


Dear all,

We are experiencing issues with the recent migration to htcondor inthe CBPF GRID. Often, multiple jobs are held with HTCondor-CE Hold Jobstatus due to expired user proxy. This is a strange behavior in LHCbjobs, as can be seen below:


[root@ce01 spool]# pwd
/var/lib/condor-ce/spool
[root@ce01 spool]# ll -R 3839/|tail
3839/9:
total 8
drwx------ 2 simple simple 4096 Jan  2 15:12 cluster3839.proc9.subproc0
drwx------ 2 simple simple 4096 Jan  2 15:12 cluster3839.proc9.subproc0.tmp

3839/9/cluster3839.proc9.subproc0:
total 0

3839/9/cluster3839.proc9.subproc0.tmp:
total 0
[root@ce01 spool]# condor_ce_q 3839


-- Schedd: ce01.cat.cbpf.br : <10.0.0.10:5015> @ 01/02/20 17:39:36
OWNER  BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
simple ID: 3839     1/2  14:42      _      _     40     40 3839.0-39

Total for query: 40 jobs; 0 completed, 0 removed, 40 idle, 0 running,0 held, 0 suspendedTotal for all users: 2118 jobs; 1033 completed, 0 removed, 365 idle,720 running, 0 held, 0 suspended


[root@ce01 spool]#
-----------------------------------------------------------------------------

The HTCondor CE thinks the 40 subjobs of job 3839 are waiting to get run,
yet all their input files are absent, despite messages like these:

-----------------------------------------------------------------------------
[root@ce01 condor-ce]# pwd
/var/log/condor-ce
[root@ce01 condor-ce]# grep -w 3839.9 SchedLog
01/02/20 14:42:41 (cid:503887) Submitting new job 3839.9
01/02/20 14:42:50 New job: 3839.9

01/02/20 14:42:50 Writing record to userlogfile=/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/3839.9.logowner=simple01/02/20 14:42:50 WriteUserLog::initialize: opened/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/3839.9.logsuccessfully

01/02/20 14:42:50 New job: 3839.9, Duplicate Keys: 2, Total Keys: 18

01/02/20 14:43:08 (cid:503905) Transferring files for jobs 3839.0,3839.1, 3839.2, 3839.3, 3839.4, 3839.5, 3839.6, 3839.7, 3839.8,3839.9, 3839.10, 3839.11, 3839.12, 3839.13, 3839.14, 3839.15, 3839.16,3839.17, 3839.18, 3839.19, 3839.20, 3839.21, 3839.22, 3839.23,3839.24, 3839.25, 3839.26, 3839.27, 3839.28, 3839.29, 3839.30,3839.31, 3839.32, 3839.33, 3839.34, 3839.35, 3839.36, 3839.37,3839.38, 3839.39

01/02/20 14:43:17 generalJobFilesWorkerThread(): transfer files for job 3839.9
SUBMIT_UserLog = "/data/HTCondor/work/08D/18A/3839.9.log"
Iwd = "/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0"
GlobalJobId = "ce01.cat.cbpf.br#3839.9#1577986961"
Environment = "HTCONDOR_JOBID=3839.9"
Err = "3839.9.err"
Out = "3839.9.out"
UserLog = "3839.9.log"

01/02/20 14:43:17 Sending GoAhead for 188.184.80.201 to send/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/tmpsNBYHw andall further files.01/02/20 14:43:17 Received GoAhead from peer to receive/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/tmpsNBYHw andall further files.01/02/20 14:43:18 get_file(): going to write to filename/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/DIRAC_RTwVjf_pilotwrapper.py

01/02/20 14:43:18 (cid:503905) Received proxy for job 3839.9

01/02/20 14:43:18 (cid:503905) proxy path:/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/tmpsNBYHw

01/02/20 14:43:51 No HoldReasonSubCode found for job 3839.9
01/02/20 14:43:51 Job 3839.9 released from hold: Data files spooled
[root@ce01 condor-ce]#
-----------------------------------------------------------------------------

So, it seems the files were there at some point, then mysteriouslydisappeared...


Meanwhile HTCondor will try to schedule the job at some point,
which will fail immediately due to absence of the proxy,
which is not seen as a fatal error --> it will keep retrying after a while,
until after many days the job finally gets cleaned up, probably when
it is known that the proxy must have expired by then (LHCb proxies appear
to have lifetimes of 7 days).

I have not yet found any activity logged that would explain why those files
were removed while the job was left around...

Anyone can help?

Cheers,

Daniel Rodrigues de Silos Moraes
CBPF/MCTIC - Grid Site Manager

Follow-Ups:
- Re: [HTCondor-users] HTCondor-CE held job due to expired user proxy
  - From: Brian Lin

Prev by Date: [HTCondor-users] Docker Job on Windows 10
Next by Date: Re: [HTCondor-users] HTCondor 8.8.7 Released
Previous by thread: Re: [HTCondor-users] Docker Job on Windows 10
Next by thread: Re: [HTCondor-users] HTCondor-CE held job due to expired user proxy
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] HTCondor-CE held job due to expired user proxy