[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] NETWORK_INTERFACE - is there an allow/deny equivalent?



Hi All

 

Just wondering if there is some way of having the equivalent of allow and deny for NETWORK_INTERFACE

 

Our situation:

 

Due to covid-19 and the major increase in staff working from home our organisation has mandated that the default

machine allocated to new workers in now a laptop, rather than a desktop. They have also rolled out a desktop-to-laptop

replacement program for existing workers. Everyone has/will have delivered desk/chair/webcam/dock/keyboard/mouse/wireless headphones

delivered to their house.

 

Previously we have only included desktops in our HTCondor pools. Our HTCondor deployment script now includes laptops. This presents

some issues we need to deal with as we do NOT want laptops at home as part of the pools. This can work mostly OK by using:

 

NETWORK_INTERFACE = xxx.yyy.*

 

where xxx.yyy.* is the internal IP subnet space of our organisation. If a laptop at home boots up it will initially have a âhomeâ IP of

192.168.something, or maybe 10.0.something. So HTCondor will not even start.

Once this laptop connects to work via VPN, then thatâs still OK. However if HTCondor is then somehow started it will be

quite happy to join the pool, as it will have an IP xxx.yyy.zzz.*, where xxx.yyy.zzz.* is the specific subnet for VPN connections.

 

So ideally it would be nice to be able to do something like the following on the execute nodes:

 

NETWORK_INTERFACE_ALLOW = xxx.yyy.*

NETWORK_INTERFACE_DENY = xxx.yyy.zzz.*

 

We currently kludge around this by having the Central Manager Collector deny VPN IPs:

 

ALLOW_READ = xxx.yyy.*

ALLOW_WRITE = xxx.yyy.*

DENY_READ = xxx.yyy.zzz.*

DENY_WRITE = xxx.yyy.zzz.*

 

so that laptops with a VPN IP are not âseenâ in the pool, even though they are running the HTCondor service.

 

Thanks.

 

Cheers

 

Greg

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd Tannenbaum
Sent: Thursday, 11 February 2021 11:49 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Jean-Claude CHEVALEYRE <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx>
Cc: Jean-Claude CHEVALEYRE <chevaleyre@xxxxxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Job finished with status 115

 

On 2/11/2021 3:34 AM, Jean-Claude CHEVALEYRE wrote:

Hello,

I have some Atlas jobs that are failling. I have look in the logs files.
I can see by example for this jobs number 93742.0. This job  finished with a status 115 . What does means exactly this status ?


Hi Jean-Caude,

Looking at your investigation below (thank you for including this), I think the confusion here is the job did not exit with a status 115.  The condor_shadow process (a component of the HTCondor service) exited with a status 115, but that is not the job process.

To see the exit status for a job, you could look in the EventLog or use the condor_history command.

Below I see that you grepped the event log and there is a Job Terminate event for job 93742.0... the exit status for that job will appear in the next line.  In other words, events in the event log are multi-line, and thus your grep did not show it.

Alternatively, you can use the "condor_history" command.  This command is similar to condor_q, but allows you to see attributes about jobs that have left the queue (due to completion or removal).  From your submit machine enter the following to see the exitcode:

      condor_history 93742.0 -limit 1 -af exitcode

Or to see all attributes about this completed job do:

      condor_history 93742.0 -limit 1 -l

See the condor_history manual page  (man condor_history) for more options, and documentation about most of the available job attributes can be found in the Manual appendix here:
  https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html#job-classad-attributes

Hope the above helps,
Todd



Bellow are some extract of  logs outputs:

[root@gridarcce01 log]# grep -RH '93742' arc/arex-jobs* | more
arc/arex-jobs.log-20210211:2021-02-10 23:45:00 Finished - job id: 6PwKDm5cYTynOUEdEnzo691oABFKDmABFKDmzcfXDmDBFKDmDTZXHm, unix user: 41000:1307, name: "arc_pilot", owner: "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN
=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1", lrms: condor, queue: grid, lrmsid: 93742.gridarcce01


[root@gridarcce01 log]# grep -RH '93742' condor/EventLog | more

condor/EventLog:        937428  -  ResidentSetSize of job (KB)
condor/EventLog:006 (24968.000.000) 12/18 10:32:49 Image size of job updated: 937424
condor/EventLog:006 (26125.000.000) 12/19 11:22:07 Image size of job updated: 937424
condor/EventLog:006 (26254.000.000) 12/19 16:32:57 Image size of job updated: 937424
condor/EventLog:006 (26254.000.000) 12/19 16:37:57 Image size of job updated: 937424
condor/EventLog:        937424  -  ResidentSetSize of job (KB)
condor/EventLog:        937420  -  ResidentSetSize of job (KB)
condor/EventLog:006 (71776.000.000) 01/21 00:35:38 Image size of job updated: 937428
condor/EventLog:006 (73442.000.000) 01/22 02:29:37 Image size of job updated: 937428
condor/EventLog:        937428  -  ResidentSetSize of job (KB)
condor/EventLog:006 (78058.000.000) 01/26 02:56:24 Image size of job updated: 937428
condor/EventLog:000 (93742.000.000) 02/09 04:12:28 Job submitted from host: <193.55.252.153:9618?addrs=193.55.252.153-9618&noUDP&sock=3115801_e73c_4>
condor/EventLog:001 (93742.000.000) 02/09 19:03:03 Job executing on host: <193.55.252.169:9618?addrs=193.55.252.169-9618&noUDP&sock=2279_c86d_3>
condor/EventLog:006 (93742.000.000) 02/09 19:03:11 Image size of job updated: 2304
condor/EventLog:006 (93742.000.000) 02/09 19:08:11 Image size of job updated: 67160
condor/EventLog:006 (93742.000.000) 02/09 19:13:12 Image size of job updated: 110340
condor/EventLog:006 (93742.000.000) 02/09 19:18:13 Image size of job updated: 1410420
condor/EventLog:006 (93742.000.000) 02/09 19:23:13 Image size of job updated: 1887892
condor/EventLog:006 (93742.000.000) 02/09 19:33:15 Image size of job updated: 1887892
condor/EventLog:005 (93742.000.000) 02/10 23:38:21 Job terminated.


condor/ShadowLog.old:02/10/21 11:43:04 (93742.0) (3863434): Time to redelegate short-lived proxy to starter.
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): File transfer completed successfully.
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): Job 93742.0 terminated: exited with status 0
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): WriteUserLog checking for event log rotation, but no lock
condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): **** condor_shadow (condor_SHADOW) pid 3863434 EXITING WITH STATUS 115


[root@gridarcce01 log]# grep -RH '93742' condor/SchedLog | more
condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Shadow pid 3863434 for job 93742.0 exited with status 115
condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <193.55.252.169:9618?addrs=193.55.252.169-9618&noUDP&sock=2279_c86d_3> for group_ATLAS.atlasprd_score.atlasprd, 937
42.0) deleted

 

Any ideas are welcome.

Thanks

Jean-Caude

 

------------------------------------------------------------------------
Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr >
Laboratoire de Physique Clermont
Campus Universitaire des CÃzeaux
4 Avenue Blaise Pascal
TSA 60026
CS 60026
63178 AubiÃre Cedex

Tel : 04 73 40 73 60

-------------------------------------------------------------------------



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/