[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE held job due to expired user proxy



Hi Daniel,

What kind of filesystem does the SPOOL directory live on? Are you seeing 
this for all jobs or just some jobs?

HTCondor-CE and the HTCondor schedd on the same host share the SPOOL 
directory. So if you're not seeing anything in the CE logs, you may also 
be able to glean some more information in /var/log/condor/SchedLog, 
looking for the routed job IDs.

Thanks,
Brian

On 1/9/20 10:09 AM, Daniel Rodrigues de Silos Moraes wrote:
>
> Dear all,
>
> We are experiencing issues with the recent migration to htcondor in 
> the CBPF GRID. Often, multiple jobs are held with HTCondor-CE Hold Job 
> status due to expired user proxy. This is a strange behavior in LHCb 
> jobs, as can be seen below:
>
> [root@ce01 spool]# pwd
> /var/lib/condor-ce/spool
> [root@ce01 spool]# ll -R 3839/|tail
> 3839/9:
> total 8
> drwx------ 2 simple simple 4096 Jan 2 15:12 cluster3839.proc9.subproc0
> drwx------ 2 simple simple 4096 Jan 2 15:12 
> cluster3839.proc9.subproc0.tmp
>
> 3839/9/cluster3839.proc9.subproc0:
> total 0
>
> 3839/9/cluster3839.proc9.subproc0.tmp:
> total 0
> [root@ce01 spool]# condor_ce_q 3839
>
>
> -- Schedd: ce01.cat.cbpf.br : <10.0.0.10:5015> @ 01/02/20 17:39:36
> OWNERÂ BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂ TOTAL JOB_IDS
> simple ID: 3839ÂÂÂÂ 1/2Â 14:42ÂÂÂÂÂ _ÂÂÂÂÂ _ÂÂÂÂ 40ÂÂÂÂ 40 3839.0-39
>
> Total for query: 40 jobs; 0 completed, 0 removed, 40 idle, 0 running, 
> 0 held, 0 suspended
> Total for all users: 2118 jobs; 1033 completed, 0 removed, 365 idle, 
> 720 running, 0 held, 0 suspended
>
> [root@ce01 spool]#
> ----------------------------------------------------------------------------- 
>
>
> The HTCondor CE thinks the 40 subjobs of job 3839 are waiting to get run,
> yet all their input files are absent, despite messages like these:
>
> ----------------------------------------------------------------------------- 
>
> [root@ce01 condor-ce]# pwd
> /var/log/condor-ce
> [root@ce01 condor-ce]# grep -w 3839.9 SchedLog
> 01/02/20 14:42:41 (cid:503887) Submitting new job 3839.9
> 01/02/20 14:42:50 New job: 3839.9
> 01/02/20 14:42:50 Writing record to user 
> logfile=/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/3839.9.log 
> owner=simple
> 01/02/20 14:42:50 WriteUserLog::initialize: opened 
> /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/3839.9.log 
> successfully
> 01/02/20 14:42:50 New job: 3839.9, Duplicate Keys: 2, Total Keys: 18
> 01/02/20 14:43:08 (cid:503905) Transferring files for jobs 3839.0, 
> 3839.1, 3839.2, 3839.3, 3839.4, 3839.5, 3839.6, 3839.7, 3839.8, 
> 3839.9, 3839.10, 3839.11, 3839.12, 3839.13, 3839.14, 3839.15, 3839.16, 
> 3839.17, 3839.18, 3839.19, 3839.20, 3839.21, 3839.22, 3839.23, 
> 3839.24, 3839.25, 3839.26, 3839.27, 3839.28, 3839.29, 3839.30, 
> 3839.31, 3839.32, 3839.33, 3839.34, 3839.35, 3839.36, 3839.37, 
> 3839.38, 3839.39
> 01/02/20 14:43:17 generalJobFilesWorkerThread(): transfer files for 
> job 3839.9
> SUBMIT_UserLog = "/data/HTCondor/work/08D/18A/3839.9.log"
> Iwd = "/var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0"
> GlobalJobId = "ce01.cat.cbpf.br#3839.9#1577986961"
> Environment = "HTCONDOR_JOBID=3839.9"
> Err = "3839.9.err"
> Out = "3839.9.out"
> UserLog = "3839.9.log"
> 01/02/20 14:43:17 Sending GoAhead for 188.184.80.201 to send 
> /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/tmpsNBYHw 
> and all further files.
> 01/02/20 14:43:17 Received GoAhead from peer to receive 
> /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/tmpsNBYHw 
> and all further files.
> 01/02/20 14:43:18 get_file(): going to write to filename 
> /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0.tmp/DIRAC_RTwVjf_pilotwrapper.py
> 01/02/20 14:43:18 (cid:503905) Received proxy for job 3839.9
> 01/02/20 14:43:18 (cid:503905) proxy path: 
> /var/lib/condor-ce/spool/3839/9/cluster3839.proc9.subproc0/tmpsNBYHw
> 01/02/20 14:43:51 No HoldReasonSubCode found for job 3839.9
> 01/02/20 14:43:51 Job 3839.9 released from hold: Data files spooled
> [root@ce01 condor-ce]#
> ----------------------------------------------------------------------------- 
>
>
> So, it seems the files were there at some point, then mysteriously 
> disappeared...
>
> Meanwhile HTCondor will try to schedule the job at some point,
> which will fail immediately due to absence of the proxy,
> which is not seen as a fatal error --> it will keep retrying after a 
> while,
> until after many days the job finally gets cleaned up, probably when
> it is known that the proxy must have expired by then (LHCb proxies appear
> to have lifetimes of 7 days).
>
> I have not yet found any activity logged that would explain why those 
> files
> were removed while the job was left around...
>
> Anyone can help?
>
> Cheers,
>
> Daniel Rodrigues de Silos Moraes
> CBPF/MCTIC - Grid Site Manager
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/