[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] NETWORK_INTERFACE - is there an allow/deny equivalent?



Hi Greg,

 

We don't currently have anything like _ALLOW or _DENY for NETWORK_INTERFACE.  And actually, I think your solution of enforcing this at the Central Manager is the better solution, as it prevents people from potentially running their own personal HTCondor with their own configuration on the laptop (through the VPN).

 

If HTCondor is sitting idle on the laptop, I don't believe it would be using a lot of resources but it would still be attempting to send updates every five minutes, so you probably don't want it to be running at all.

 

My best suggestion there is to put something like this in condor_config:

 

MASTER.DAEMON_SHUTDOWN = (regexp("xxx\.yyy\.zzz\.", MyAddress))

 

It's a bit of a hack maybe, but seems to work in a quick test.  Hope that helps!

 

 

Cheers,

-zach

 

 

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Hitchen, Greg (IM&T, Kensington WA) <Greg.Hitchen@xxxxxxxx>
Date: Wednesday, February 17, 2021 at 8:26 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] NETWORK_INTERFACE - is there an allow/deny equivalent?

Hi All

 

Just wondering if there is some way of having the equivalent of allow and deny for NETWORK_INTERFACE

 

Our situation:

 

Due to covid-19 and the major increase in staff working from home our organisation has mandated that the default

machine allocated to new workers in now a laptop, rather than a desktop. They have also rolled out a desktop-to-laptop

replacement program for existing workers. Everyone has/will have delivered desk/chair/webcam/dock/keyboard/mouse/wireless headphones

delivered to their house.

 

Previously we have only included desktops in our HTCondor pools. Our HTCondor deployment script now includes laptops. This presents

some issues we need to deal with as we do NOT want laptops at home as part of the pools. This can work mostly OK by using:

 

NETWORK_INTERFACE = xxx.yyy.*

 

where xxx.yyy.* is the internal IP subnet space of our organisation. If a laptop at home boots up it will initially have a “home” IP of

192.168.something, or maybe 10.0.something. So HTCondor will not even start.

Once this laptop connects to work via VPN, then that’s still OK. However if HTCondor is then somehow started it will be

quite happy to join the pool, as it will have an IP xxx.yyy.zzz.*, where xxx.yyy.zzz.* is the specific subnet for VPN connections.

 

So ideally it would be nice to be able to do something like the following on the execute nodes:

 

NETWORK_INTERFACE_ALLOW = xxx.yyy.*

NETWORK_INTERFACE_DENY = xxx.yyy.zzz.*

 

We currently kludge around this by having the Central Manager Collector deny VPN IPs:

 

ALLOW_READ = xxx.yyy.*

ALLOW_WRITE = xxx.yyy.*

DENY_READ = xxx.yyy.zzz.*

DENY_WRITE = xxx.yyy.zzz.*

 

so that laptops with a VPN IP are not “seen” in the pool, even though they are running the HTCondor service.

 

Thanks.

 

Cheers

 

Greg

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd Tannenbaum

Sent: Thursday, 11 February 2021 11:49 PM

To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Jean-Claude CHEVALEYRE <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx>

Cc: Jean-Claude CHEVALEYRE <chevaleyre@xxxxxxxxxxxxxxxxx>

Subject: Re: [HTCondor-users] Job finished with status 115

 

 

 

On 2/11/2021 3:34 AM, Jean-Claude CHEVALEYRE wrote:

 

 

Hello,

 

I have some Atlas jobs that are failling. I have look in the logs files.

I can see by example for this jobs number 93742.0. This job  finished with a status 115 . What does means exactly this status ?

 

 

 

 

 

Hi Jean-Caude,

 

Looking at your investigation below (thank you for including this), I think the confusion here is the job did not exit with a status 115.  The condor_shadow process (a component of the HTCondor service) exited with a status 115, but that is not the job process.

 

To see the exit status for a job, you could look in the EventLog or use the condor_history command.

 

Below I see that you grepped the event log and there is a Job Terminate event for job 93742.0... the exit status for that job will appear in the next line.  In other words, events in the event log are multi-line, and thus your grep did not show it.

 

Alternatively, you can use the "condor_history" command.  This command is similar to condor_q, but allows you to see attributes about jobs that have left the queue (due to completion or removal).  From your submit machine enter the following to see the exitcode:

 

      condor_history 93742.0 -limit 1 -af exitcode

 

Or to see all attributes about this completed job do:

 

      condor_history 93742.0 -limit 1 -l

 

See the condor_history manual page  (man condor_history) for more options, and documentation about most of the available job attributes can be found in the Manual appendix here:

 

Hope the above helps,

Todd

 

 

 

 

 

Bellow are some extract of  logs outputs:

 

[root@gridarcce01 log]# grep -RH '93742' arc/arex-jobs* | more

arc/arex-jobs.log-20210211:2021-02-10 23:45:00 Finished - job id: 6PwKDm5cYTynOUEdEnzo691oABFKDmABFKDmzcfXDmDBFKDmDTZXHm, unix user: 41000:1307, name: "arc_pilot", owner: "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN

=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1", lrms: condor, queue: grid, lrmsid: 93742.gridarcce01

 

 

[root@gridarcce01 log]# grep -RH '93742' condor/EventLog | more

 

condor/EventLog:        937428  -  ResidentSetSize of job (KB)

condor/EventLog:006 (24968.000.000) 12/18 10:32:49 Image size of job updated: 937424

condor/EventLog:006 (26125.000.000) 12/19 11:22:07 Image size of job updated: 937424

condor/EventLog:006 (26254.000.000) 12/19 16:32:57 Image size of job updated: 937424

condor/EventLog:006 (26254.000.000) 12/19 16:37:57 Image size of job updated: 937424

condor/EventLog:        937424  -  ResidentSetSize of job (KB)

condor/EventLog:        937420  -  ResidentSetSize of job (KB)

condor/EventLog:006 (71776.000.000) 01/21 00:35:38 Image size of job updated: 937428

condor/EventLog:006 (73442.000.000) 01/22 02:29:37 Image size of job updated: 937428

condor/EventLog:        937428  -  ResidentSetSize of job (KB)

condor/EventLog:006 (78058.000.000) 01/26 02:56:24 Image size of job updated: 937428

condor/EventLog:000 (93742.000.000) 02/09 04:12:28 Job submitted from host: <193.55.252.153:9618?addrs=193.55.252.153-9618&noUDP&sock=3115801_e73c_4>

condor/EventLog:001 (93742.000.000) 02/09 19:03:03 Job executing on host: <193.55.252.169:9618?addrs=193.55.252.169-9618&noUDP&sock=2279_c86d_3>

condor/EventLog:006 (93742.000.000) 02/09 19:03:11 Image size of job updated: 2304

condor/EventLog:006 (93742.000.000) 02/09 19:08:11 Image size of job updated: 67160

condor/EventLog:006 (93742.000.000) 02/09 19:13:12 Image size of job updated: 110340

condor/EventLog:006 (93742.000.000) 02/09 19:18:13 Image size of job updated: 1410420

condor/EventLog:006 (93742.000.000) 02/09 19:23:13 Image size of job updated: 1887892

condor/EventLog:006 (93742.000.000) 02/09 19:33:15 Image size of job updated: 1887892

condor/EventLog:005 (93742.000.000) 02/10 23:38:21 Job terminated.

 

 

condor/ShadowLog.old:02/10/21 11:43:04 (93742.0) (3863434): Time to redelegate short-lived proxy to starter.

condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): File transfer completed successfully.

condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): Job 93742.0 terminated: exited with status 0

condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): WriteUserLog checking for event log rotation, but no lock

condor/ShadowLog.old:02/10/21 23:38:21 (93742.0) (3863434): **** condor_shadow (condor_SHADOW) pid 3863434 EXITING WITH STATUS 115

 

 

[root@gridarcce01 log]# grep -RH '93742' condor/SchedLog | more

condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Shadow pid 3863434 for job 93742.0 exited with status 115

condor/SchedLog:02/10/21 23:38:21 (pid:3115849) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <193.55.252.169:9618?addrs=193.55.252.169-9618&noUDP&sock=2279_c86d_3> for group_ATLAS.atlasprd_score.atlasprd, 937

42.0) deleted

 

 

 

Any ideas are welcome.

 

 

 

Thanks

 

Jean-Caude

 

 

 

------------------------------------------------------------------------

Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr >

Laboratoire de Physique Clermont

Campus Universitaire des Cézeaux

4 Avenue Blaise Pascal

TSA 60026

CS 60026

63178 Aubière Cedex

 

Tel : 04 73 40 73 60

 

-------------------------------------------------------------------------

 

 

 

 

 

_______________________________________________HTCondor-users mailing listTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with asubject: UnsubscribeYou can also unsubscribe by visitinghttps://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at:https://lists.cs.wisc.edu/archive/htcondor-users/