Thanks Zach, Iâll give your suggestion a try.
So many of my scripts/programs include kludges as workarounds for certain situations! ð
Iâd be surprised if any piece of code doesnât have a hack in it somewhere.
We don't currently have anything like _ALLOW or _DENY for NETWORK_INTERFACE. And actually, I think your solution of enforcing this at the Central Manager is the better solution, as it prevents people from potentially running their own personal HTCondor with their own configuration on the laptop (through the VPN).
If HTCondor is sitting idle on the laptop, I don't believe it would be using a lot of resources but it would still be attempting to send updates every five minutes, so you probably don't want it to be running at all.
My best suggestion there is to put something like this in condor_config:
MASTER.DAEMON_SHUTDOWN = (regexp("xxx\.yyy\.zzz\.", MyAddress))
It's a bit of a hack maybe, but seems to work in a quick test. Hope that helps!
Just wondering if there is some way of having the equivalent of allow and deny for NETWORK_INTERFACE
Due to covid-19 and the major increase in staff working from home our organisation has mandated that the default
machine allocated to new workers in now a laptop, rather than a desktop. They have also rolled out a desktop-to-laptop
replacement program for existing workers. Everyone has/will have delivered desk/chair/webcam/dock/keyboard/mouse/wireless headphones
delivered to their house.
Previously we have only included desktops in our HTCondor pools. Our HTCondor deployment script now includes laptops. This presents
some issues we need to deal with as we do NOT want laptops at home as part of the pools. This can work mostly OK by using:
NETWORK_INTERFACE = xxx.yyy.*
where xxx.yyy.* is the internal IP subnet space of our organisation. If a laptop at home boots up it will initially have a âhomeâ IP of
192.168.something, or maybe 10.0.something. So HTCondor will not even start.
Once this laptop connects to work via VPN, then thatâs still OK. However if HTCondor is then somehow started it will be
quite happy to join the pool, as it will have an IP xxx.yyy.zzz.*, where xxx.yyy.zzz.* is the specific subnet for VPN connections.
So ideally it would be nice to be able to do something like the following on the execute nodes:
NETWORK_INTERFACE_ALLOW = xxx.yyy.*
NETWORK_INTERFACE_DENY = xxx.yyy.zzz.*
We currently kludge around this by having the Central Manager Collector deny VPN IPs:
ALLOW_READ = xxx.yyy.*
ALLOW_WRITE = xxx.yyy.*
DENY_READ = xxx.yyy.zzz.*
DENY_WRITE = xxx.yyy.zzz.*
so that laptops with a VPN IP are not âseenâ in the pool, even though they are running the HTCondor service.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd Tannenbaum
Sent: Thursday, 11 February 2021 11:49 PM
Cc: Jean-Claude CHEVALEYRE <chevaleyre@xxxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Job finished with status 115
On 2/11/2021 3:34 AM, Jean-Claude CHEVALEYRE wrote:
I have some Atlas jobs that are failling. I have look in the logs files.
I can see by example for this jobs number 93742.0. This job finished with a status 115 . What does means exactly this status ?
Looking at your investigation below (thank you for including this), I think the confusion here is the job did not exit with a status 115. The condor_shadow process (a component of the HTCondor service) exited with a status 115, but that is not the job process.
To see the exit status for a job, you could look in the EventLog or use the condor_history command.
Below I see that you grepped the event log and there is a Job Terminate event for job 93742.0... the exit status for that job will appear in the next line. In other words, events in the event log are multi-line, and thus your grep did not show it.
Alternatively, you can use the "condor_history" command. This command is similar to condor_q, but allows you to see attributes about jobs that have left the queue (due to completion or removal). From your submit machine enter the following to see the exitcode:
condor_history 93742.0 -limit 1 -af exitcode
Or to see all attributes about this completed job do:
condor_history 93742.0 -limit 1 -l
See the condor_history manual page (man condor_history) for more options, and documentation about most of the available job attributes can be found in the Manual appendix here:
Hope the above helps,
Bellow are some extract of logs outputs:
[root@gridarcce01 log]# grep -RH '93742' arc/arex-jobs* | more
arc/arex-jobs.log-20210211:2021-02-10 23:45:00 Finished - job id: 6PwKDm5cYTynOUEdEnzo691oABFKDmABFKDmzcfXDmDBFKDmDTZXHm, unix user: 41000:1307, name: "arc_pilot", owner: "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN
=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1", lrms: condor, queue: grid, lrmsid: 93742.gridarcce01
[root@gridarcce01 log]# grep -RH '93742' condor/EventLog | more
condor/EventLog: 937428 - ResidentSetSize of job (KB)
condor/EventLog:006 (24968.000.000) 12/18 10:32:49 Image size of job updated: 937424
condor/EventLog:006 (26125.000.000) 12/19 11:22:07 Image size of job updated: 937424
condor/EventLog:006 (26254.000.000) 12/19 16:32:57 Image size of job updated: 937424
condor/EventLog:006 (26254.000.000) 12/19 16:37:57 Image size of job updated: 937424
condor/EventLog: 937424 - ResidentSetSize of job (KB)
condor/EventLog: 937420 - ResidentSetSize of job (KB)
condor/EventLog:006 (71776.000.000) 01/21 00:35:38 Image size of job updated: 937428
condor/EventLog:006 (73442.000.000) 01/22 02:29:37 Image size of job updated: 937428
condor/EventLog: 937428 - ResidentSetSize of job (KB)
condor/EventLog:006 (78058.000.000) 01/26 02:56:24 Image size of job updated: 937428
condor/EventLog:000 (93742.000.000) 02/09 04:12:28 Job submitted from host: <22.214.171.124:9618?addrs=126.96.36.199-9618&noUDP&sock=3115801_e73c_4>
condor/EventLog:001 (93742.000.000) 02/09 19:03:03 Job executing on host: <188.8.131.52:9618?addrs=184.108.40.206-9618&noUDP&sock=2279_c86d_3>
condor/EventLog:006 (93742.000.000) 02/09 19:03:11 Image size of job updated: 2304
condor/EventLog:006 (93742.000.000) 02/09 19:08:11 Image size of job updated: 67160
condor/EventLog:006 (93742.000.000) 02/09 19:13:12 Image size of job updated: 110340
condor/EventLog:006 (93742.000.000) 02/09 19:18:13 Image size of job updated: 1410420
condor/EventLog:006 (93742.000.000) 02/09 19:23:13 Image size of job updated: 1887892
condor/EventLog:006 (93742.000.000) 02/09 19:33:15 Image size of job updated: 1887892
condor/EventLog:005 (93742.000.000) 02/10 23:38:21 Job terminated.
condor/ShadowLog.old:02/10/21 11:43:04 (93742.0) (3863434): Time to redelegate short-lived proxy to starter.
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): File transfer completed successfully.
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): Job 93742.0 terminated: exited with status 0
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): WriteUserLog checking for event log rotation, but no lock
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): **** condor_shadow (condor_SHADOW) pid 3863434 EXITING WITH STATUS 115
[root@gridarcce01 log]# grep -RH '93742' condor/SchedLog | more
condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Shadow pid 3863434 for job 93742.0 exited with status 115
condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <220.127.116.11:9618?addrs=18.104.22.168-9618&noUDP&sock=2279_c86d_3> for group_ATLAS.atlasprd_score.atlasprd, 937
Any ideas are welcome.
Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr >
Laboratoire de Physique Clermont
Campus Universitaire des CÃzeaux
4 Avenue Blaise Pascal
63178 AubiÃre Cedex
Tel : 04 73 40 73 60
_______________________________________________HTCondor-users mailing listTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with asubject: UnsubscribeYou can also unsubscribe by visitinghttps://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at:https://lists.cs.wisc.edu/archive/htcondor-users/