[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] multiple VM problem



Hi Dan,

Yes, by "user process" I refer to job run by condor on behalf of a user. There is no other user process, because the nodes are dedicated compute nodes.

I did try the D_LOAD setting. But I could not make out much out of the output - but the load values dont seem to agree with the the Claimed/Busy status. Here is a clipping from a node for which condor_status reports "vm1 owner/idle loadav=1.000" and "vm2 claimed/busy loadav=0.000". I request you to have a look at the output and see if you find a clue, or please advise me what to look out for.


/****
5/9 11:34:33 Load avg: 1.00 1.00 0.96
5/9 11:34:33 vm1: LoadQueue: Adding 5 entries of value 0.000000
5/9 11:34:33 vm1: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0. 00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5/9 11:34:33 vm1: LoadQueue: Size: 60 Avg value: 0.00 Share of system load: 0.00 5/9 11:34:33 vm2: Computing percent CPU usage with pids: 3559 3561 3566 3871 3929 3930 3934 4450 5239 5240 5242 12775 5/9 11:34:33 ProcAPI: new boottime = 1145611802; old_boottime = 1145611802; /proc/stat boottime = 1145611803; /proc/uptime boo
ttime = 1145611802
5/9 11:34:33 vm2: Percent CPU usage for those pids is: 0.000000
5/9 11:34:33 vm2: LoadQueue: Adding 5 entries of value 0.000000
5/9 11:34:33 vm2: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0. 00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5/9 11:34:33 vm2: LoadQueue: Size: 60 Avg value: 0.00 Share of system load: 0.00 5/9 11:34:33 SystemLoad: 1.000 TotalCondorLoad: 0.000 TotalOwnerLoad: 1.000
5/9 11:34:33 vm1: SystemLoad: 1.00  CondorLoad: 0.00  OwnerLoad: 1.00
5/9 11:34:33 vm2: SystemLoad: 0.00  CondorLoad: 0.00  OwnerLoad: 0.00
***/


Just to summarize my earlier mail - something seems to have changed with an OS upgrade, because with the older OS, the same node (same condor_config) a user process on each vm. Also, if I turn Hyperthreading on for the same node, two vm's run user process, and two are stuck in the "claimed/busy" state.

Thanks a lot for helping me with this.

Nagaraj







Dan Bradley wrote:

Nagaraj,

Off the top of my head, I can't think of any reason for this change in behavior. When you say "user process" you are referring to a job run by Condor on behalf of a user, right? You are not talking about processes run by users outside of Condor. Just want to be sure I understand.

You could try adding D_LOAD in your STARTD_DEBUG settings. This will show extra information about what's going on while monitoring system load.

--Dan

On May 8, 2006, at 12:46 PM, P. Nagaraj wrote:

Hi,

We have dual cpu nodes in our condor pool. Some of these run an older
version of Linux (RH 7.1), and these take two user jobs as shown below
There is a user process on each cpu here. CondorVersion is 6.6.5 and
platforms are all Intel/Linux

vm1 Claimed/Busy/LoadAv=1.000/Mem=502
vm2 Claimed/Busy/loadAv=1.020/Mem=502

When I upgrade the nodes (Scientific Linux 3), the behaviour changes,
condor_config being unchanged. The VM that is running the user process
shows up as vm2 below, while the vm that has no LoadAv shows up as Busy.

vm1  Claimed/Busy/LoadAv=0.000/Mem=500
vm2  Owner/Idle/LoadAv=1.000/Mem=500

Refering to the vm1 just above, its classads are as show here below.

CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
TotalLoadAvg = 1.000000     --this from the other vm which has LoadAv=1
TotalCondorLoadAvg = 0.000000
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Claimed"
Activity = "Busy"
Start = (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed"
&& State != "Owner"))
Requirements = START

Why is there this mismatch - CpuisBusy="FALSE" and Activity=Busy?
The Loadaverages indicate that a procss can start, but something makes the
Activity="Busy". How do I find out why one vm is always Busy, which was
now so before the OS upgrade?

Similarly, if HT is enabled in BIOS, there are 4 vm's on a node. Two of
these are Claimed/Busy (Loadav=0) and the other two are Owner/Idle (doing
user process, LoadAv=1).

Thanks in advance for any help on this

Nagaraj


-- +---------------------------------- +--------------------------------------+
Nagaraj Panyam                     | Office tel: +91-22-22782610
Dept of High Energy Physics        | Office fax: +91-22-22804610
Tata Instt. of Fundamental Research| Home  tel : +91-22-22804936
Mumbai - 400 005, INDIA            | **Email** : pn@xxxxxxxxxxx
+---------------------------------- +--------------------------------------+
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


--

+----------------------------------+--------------------------------------+
Nagaraj Panyam                     | Office tel: +91-22-22782610
Dept of High Energy Physics | Office fax: +91-22-22804610 Tata Instt. of Fundamental Research| Home tel : +91-22-22804936 Mumbai - 400 005, INDIA | **Email** : pn@xxxxxxxxxxx +----------------------------------+--------------------------------------+