[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Fw: Problems about condor slots



 
 
2009-12-29

***********************************************
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1115@xxxxxxxxx
* Address: G413, New Main Building in Beihang University,
*              No.37 XueYuan Road,HaiDian District,
*              Beijing,P.R.China,100191
***********************************************

发件人: hailong.yang1115
发送时间: 2009-12-27 23:00:08
收件人: Alain Roy
抄送:
主题: Problems about condor slots
 
Hi All,
 
Recently we have installed the newest condor release version 7.4.1 in our clusters. We encountered the following problems on some nodes during the installation:
 
1. The slot number of some nodes in the condor pool mismatched the number of logic cpu cores, which could be seen from /proc/cpuinfo. The slot number of node9 we noticed from condor_status was 6, while the logic cpu cores we found from  /proc/cpuinfo is 4. Detailed information can be found in the attachment.
 
[root@monitor ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   505  0+09:45:54
slot1@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.140   505  0+00:50:04
slot1@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   168  0+22:46:51
slot2@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   505  0+00:15:05
slot2@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   505  1+00:50:43
slot2@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   168  0+01:10:05
slot3@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   168  1+01:10:33
slot4@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   168  1+01:10:34
slot5@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   168  1+01:10:35
slot6@xxxxxxxxxx   LINUX      INTEL  Unclaimed Idle     0.000   168  1+01:10:36
slot1@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   493  0+23:29:41
slot2@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.040   493  0+01:10:05
slot3@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   493  1+01:11:00
slot4@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   493  1+01:11:01
slot5@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   493  1+01:11:02
slot6@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   493  1+01:11:03
slot7@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   493  1+01:11:04
slot8@xxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle     0.000   493  1+01:10:57
                     Total Owner Claimed Unclaimed Matched Preempting Backfill
         INTEL/LINUX    10     0       0        10       0          0        0
        X86_64/LINUX     8     0       0         8       0          0        0
               Total    18     0       0        18       0          0        0
 
2. After installed condor on some nodes, we started condor_master but nothing happened. We checked the MasterLog file, it gave the following error:
12/27 10:48:41 ******************************************************
12/27 10:48:41 ** condor_master (CONDOR_MASTER) STARTING UP
12/27 10:48:41 ** /ddgrid/condor/sbin/condor_master
12/27 10:48:41 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
12/27 10:48:41 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
12/27 10:48:41 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
12/27 10:48:41 ** $CondorPlatform: I386-LINUX_RHEL3 $
12/27 10:48:41 ** PID = 7012
12/27 10:48:41 ** Log last touched 12/26 23:53:14
12/27 10:48:41 ******************************************************
12/27 10:48:41 Using config source: /ddgrid/condor/etc/condor_config
12/27 10:48:41 Using local config sources: 
12/27 10:48:41    /ddgrid/condor/local.ddgrid/condor_config.local
12/27 10:48:41 ERROR "can't safe_open_wrapper(/tmp/condor-lock.ddgrid0.745993478763015/InstanceLock,O_WRONLY|O_CREAT|O_APPEND
,S_IRUSR|S_IWUSR) - errno 2" at line 946 in file master.cpp
 
It seems there is some privilege problems with condor_config file, but we can not figure out which part is wrong.
[root@ddgrid local.ddgrid]# pwd
/ddgrid/condor/local.ddgrid
[root@ddgrid local.ddgrid]# ll
total 4
-rw-r--r--  1 root   root 2918 Dec 26 23:36 condor_config.local
drwxrwxrwt  2 condor root    6 Dec 26 23:36 execute
drwxr-xr-x  2 condor root   22 Dec 26 23:40 log
drwxr-xr-x  2 condor root    6 Dec 26 23:36 spool
 
Best wishes!
 
-Hailong
 
2009-12-27

***********************************************
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1115@xxxxxxxxx
* Address: G413, New Main Building in Beihang University,
*              No.37 XueYuan Road,HaiDian District,
*              Beijing,P.R.China,100191
***********************************************

Attachment: cpuinfo
Description: Binary data