[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_off broken?



On Wed, Nov 27, 2013 at 01:05:40PM +0100, Pek Daniel wrote:
> So now I have some more information:
> 
> condor_off command and friends won't work if the hostname is set to
> condorworker02 on the machine. It has to be set to condorworker02.domain.tld. 
> 
> The question: why is that?

In my *opinion*, this should work.

But clearly, it does not.  I will need to investigate the code, but my general
feeling is that at some point, the tool (condor_off in this case) gets clever
and "promotes" the short host name to the long host name.  Then, as you can see
in the collector, it doesn't match, and you get your "Daemon not found" error.


Cheers,
-zach



> 
> 2013/11/26 Pek Daniel <pekdaniel@xxxxxxxxx>
> 
>     OK, the problem a bit more detailed:
> 
>     I'm using this version:
>     [root@lxbrb1815 ~]# condor_version
>     $CondorVersion: 8.1.2 Oct 19 2013 BuildID: 189797 $
>     $CondorPlatform: x86_64_RedHat5 $
> 
>     Here's a snippet from condor_status -master output:
>     [root@condormaster1 ~]# condor_status -master
>     Name                
> 
>     condormaster1       
>     condormaster2       
>     condorworker02      
>     lxbrb1815.domain.tld   
>     ...
> 
>     I have physical nodes and VMs as startd nodes. Physical nodes have more
>     than one core, so more than one jobslots, while VMs have only one core.
> 
>     Here's a snippet from condor_status -startd | head:
>     Name               OpSys      Arch   State     Activity LoadAv Mem  
>     ActvtyTime
> 
>     condorworker02     LINUX      X86_64 Claimed   Busy      0.000  490  
>     0+00:03:13
>     slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.060 1991  
>     0+00:11:51
>     slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:13
>     ...
> 
>     As you can see, condorworker02 is a VM, while lxbrb1815.domain.tld is a
>     physical node with a lot of cores. And that's the only difference. The
>     config file is exactly the same for both cases, and the condor version as
>     well.
> 
>     Now, my questions:
>     - Why I see the slotID@xxxxxxxxxxxxxxxxxxxxxx in case of physical nodes and
>     just the hostname in case of VMs?
>     - Why can't I query the status of a VM but it's working in case of a
>     physical node:
> 
>     [root@condormaster1 ~]# condor_status -startd lxbrb1815
>     Name               OpSys      Arch   State     Activity LoadAv Mem  
>     ActvtyTime
> 
>     slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.060 1991  
>     0+00:11:51
>     slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:13
>     slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:14
>     slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:15
>     slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:16
>     slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:17
>     slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:18
>     slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:11
>                          Total Owner Claimed Unclaimed Matched Preempting
>     Backfill
> 
>             X86_64/LINUX     8     0       0         8       0          0      
>      0
> 
>                    Total     8     0       0         8       0          0      
>      0
>     [root@condormaster1 ~]# condor_status -startd lxbrb1815.domain.tld
>     Name               OpSys      Arch   State     Activity LoadAv Mem  
>     ActvtyTime
> 
>     slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.060 1991  
>     0+00:11:51
>     slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:13
>     slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:14
>     slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:15
>     slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:16
>     slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:17
>     slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:18
>     slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000 1991  
>     0+00:12:11
>                          Total Owner Claimed Unclaimed Matched Preempting
>     Backfill
> 
>             X86_64/LINUX     8     0       0         8       0          0      
>      0
> 
>                    Total     8     0       0         8       0          0      
>      0
> 
>     [root@condormaster1 ~]# condor_status -startd condorworker02
>     [root@condormaster1 ~]# condor_status -startd condorworker02.domain.tld
>     [root@condormaster1 ~]# 
> 
>     - Why can't I send condor_off command to VMs but it's working fine in case
>     of physical nodes:
>     [root@condormaster1 ~]# condor_off -startd lxbrb1815
>     Sent "Kill-Daemon" command for "startd" to master lxbrb1815.domain.tld
> 
>     [root@condormaster1 ~]# condor_off -startd condorworker02
>     Can't find address for master condorworker02.domain.tld
>     Perhaps you need to query another pool.
> 
>     Thanks,
>     Daniel
> 
> 
> 
>     2013/11/26 Zachary Miller <zmiller@xxxxxxxxxxx>
> 
>         On Tue, Nov 26, 2013 at 11:37:48AM +0100, Pek Daniel wrote:
>         > I'm trying to "deactivate" some startd machines:
>         > [root@cm1 ~]# condor_status
>         > Name               OpSys      Arch   State     Activity LoadAv Mem  
>         ActvtyTime
>         >
>         > condorworker01     LINUX      X86_64 Unclaimed Idle      0.000 2006  
>         5+16:16:41
>         > condorworker03     LINUX      X86_64 Unclaimed Idle      0.000  490  
>         0+00:21:47
>         > slot1@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991  
>         4+18:20:46
>         > slot2@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991  
>         4+18:21:07
>         > slot3@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991  
>         4+18:21:08
>         > slot4@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991  
>         4+18:21:09
>         > slot5@lxbrl2305    LINUX      X86_64 Unclaimed Idle      1.000 1991  
>         4+18:21:10
>         > slot6@lxbrl2305    LINUX      X86_64 Unclaimed Idle      0.960 1991 4
>         +18:21:11
>         > slot7@lxbrl2305    LINUX      X86_64 Unclaimed Idle      0.000 1991  
>         4+18:21:12
>         > slot8@lxbrl2305    LINUX      X86_64 Unclaimed Idle      0.000 1991  
>         4+18:21:05
>         >                      Total Owner Claimed Unclaimed Matched Preempting
>         Backfill
>         >
>         >         X86_64/LINUX    10     0       0        10       0          0
>                0
>         >
>         >                Total    10     0       0        10       0          0
>                0
>         >
>         > [root@condormaster1 ~]# condor_off -startd -graceful condorworker01
>         > Can't find address for master condorworker01
> 
>         Hmmm.  What does "condor_status -master" have to say?
> 
> 
>         Cheers,
>         -zach
> 
>         _______________________________________________
>         HTCondor-users mailing list
>         To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>         with a
>         subject: Unsubscribe
>         You can also unsubscribe by visiting
>         https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
>         The archives can be found at:
>         https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> 
> 

> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/