Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] multiple VM problem

Date: Thu, 2 Nov 2006 11:28:45 -0000
From: "Wood, ian g" <Ian.G.Wood@xxxxxxxxxxxx>
Subject: Re: [Condor-users] multiple VM problem
Hi,

We are seeing the same problem here with Fedora 4 on dual Xeon systems.
Has there been any progress on finding a solution?

I have been told that in newer versions of Linux the OS is able to
arbitrarily move jobs between processors. Could the problem be that
Condor is not aware of the OS doing this?

Thanks

Ian

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: Wednesday, May 10, 2006 8:00 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] multiple VM problem


I am moving this over into condor-admin@xxxxxxxxxxxx  I'll feed back a
summary of what is found, if it seems of general interest.

Cheers,
--Dan

On May 10, 2006, at 8:15 AM, Nagaraj Panyam wrote:

> Hi,
>
> I have these flags for STARTD_DEBUG
> D_COMMAND D_FULLDEBUG D_MACHINE D_LOAD D_MATCH D_JOB D_ACCOUNTANT
>
> About the ProcAPI error. Yes, its there. Here it a clip from the log.

> It
> is sporadic, and appears to occur in bunches, so I missed it untill 
> you asked.
>
> /**
> 5/10 14:54:33 vm1: Computing percent CPU usage with pids: 5371 5373
> 5378
> 5421 5424 5425 5631 5632
> 5/10 14:54:33 ProcAPI: pid 5421 does not exist.
> 5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5421 does not exist, 
> ignoring.
> 5/10 14:54:33 ProcAPI: pid 5424 does not exist.
> 5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5424 does not exist, 
> ignoring.
> 5/10 14:54:33 ProcAPI: pid 5425 does not exist.
> 5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5425 does not exist, 
> ignoring.
> 5/10 14:54:33 ProcAPI: pid 5631 does not exist.
> 5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5631 does not exist, 
> ignoring.
> 5/10 14:54:33 ProcAPI: pid 5632 does not exist.
> 5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5632 does not exist, 
> ignoring.
> **/
>
> I came across the same error in another context. Here is the log, but 
> I am not sure if these are relevant to the problem.
>
> /**
> 5/2 18:08:21 Load avg: 0.00 0.00 0.01
> 5/2 18:08:21 vm1: Computing percent CPU usage with pids: 5976 5978 
> 5983
> 6026 6029
> 5/2 18:08:21 vm1: Percent CPU usage for those pids is: 0.000000
> 5/2 18:08:21 vm1: LoadQueue: Adding 5 entries of value 0.000000
> 5/2 18:08:21 vm1: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.
> 00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00

> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 5/2 18:08:21 vm1: LoadQueue: Size: 60  Avg value: 0.00  Share of 
> system
> load: 0.00
> 5/2 18:08:21 vm2: Computing percent CPU usage with pids: 5974 5975
> 5/2 18:08:21 vm2: Percent CPU usage for those pids is: 0.000000
> 5/2 18:08:21 vm2: LoadQueue: Adding 5 entries of value 0.000000
> 5/2 18:08:21 vm2: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.
> 00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 0.00

> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 5/2 18:08:21 vm2: LoadQueue: Size: 60  Avg value: 0.00  Share of 
> system
> load: 0.00
> 5/2 18:08:21 SystemLoad: 0.000  TotalCondorLoad: 0.000
TotalOwnerLoad:
> 0.000
> 5/2 18:08:21 vm1: SystemLoad: 0.00  CondorLoad: 0.00  OwnerLoad: 0.00
> 5/2 18:08:21 vm2: SystemLoad: 0.00  CondorLoad: 0.00  OwnerLoad: 0.00
> 5/2 18:08:21 vm2: State change: PREEMPT is TRUE
> 5/2 18:08:21 vm2: State change: WANT_VACATE is TRUE
> 5/2 18:08:21 vm2: In Starter::kill() with pid 5974, sig 101
> (DC_SIGCONTINUE)
> 5/2 18:08:21 STARTD_TIMEOUT_MULTIPLIER is undefined, using default 
> value of 0
> 5/2 18:08:21 SEC_DEBUG_PRINT_KEYS is undefined, using default value of

> False
> 5/2 18:08:21 vm2: Canceled claim timer (134)
> 5/2 18:08:21 vm2: Changing state and activity: Claimed/Suspended -> 
> Preempting/Vacating
> 5/2 18:08:21 vm2: In Starter::kill() with pid 5974, sig 15 (SIGTERM)
> 5/2 18:08:21 STARTD_TIMEOUT_MULTIPLIER is undefined, using default 
> value of 0
> 5/2 18:08:21 SEC_DEBUG_PRINT_KEYS is undefined, using default value of

> False
> 5/2 18:08:21 DaemonCore: No more children processes to reap.
> 5/2 18:08:21 Starter pid 5974 exited with status 0
> 5/2 18:08:21 ProcAPI::buildFamily failed: parent 5974 not found on 
> system.
> 5/2 18:08:21 ProcAPI: pid 5974 does not exist.
> 5/2 18:08:21 ProcAPI: pid 5975 does not exist.
> 5/2 18:08:21 vm2: State change: starter exited
> 5/2 18:08:21 Entered vacate_client <192.168.55.100:34034> 
> mcf000.tifr-mcfarm...
> 5/2 18:08:21 STARTD_TIMEOUT_MULTIPLIER is undefined, using default 
> value of 0
> 5/2 18:08:21 SEC_DEBUG_PRINT_KEYS is undefined, using default value of

> False
> 5/2 18:08:21 vm2: State change: No preempting claim, returning to 
> owner
> 5/2 18:08:21 vm2: Changing state and activity: Preempting/Vacating -> 
> Owner/Idle
> 5/2 18:08:21 vm2: State change: IS_OWNER is false
> 5/2 18:08:21 vm2: Changing state: Owner -> Unclaimed
> 5/2 18:08:21 DaemonCore: No more children processes to reap.
> **/
>
> I checked the list of pid's that appear in the startd log.
> 5/10 12:09:18 vm2: Computing percent CPU usage with pids: 23718 23720
> 23725 24346 24404 24405 24409 26504 27293 27294 27296 1238
>
> /** clippings from "top"
> 23718 root      15   0  1964 1964  1552 S     0.0  0.0   0:10   0
> condor_starter
> 23720 samgrid   25  10   272  272   228 S N   0.0  0.0   0:00   1 time
> 23725 samgrid   25  10  1068 1068   900 S N   0.0  0.0   0:00   1
> sandbox_mgr
> 24346 samgrid   25  10  1104 1104   912 S N   0.0  0.0   0:00   1
> jim_rapper.sh
> 24404 samgrid   25  10   344  344   288 S N   0.0  0.0   0:00   1 time
> 24405 samgrid   25  10  1136 1136   924 S N   0.0  0.0   0:00   1
> mc_sam_rapper.s
> 24409 samgrid   29  10  1060 1060   904 S N   0.0  0.0   0:00   0
> accelerator.sh
> 26504 samgrid   25  10 13684  13M  2248 S N   0.0  0.6   0:03   0  
> python
> 27293 samgrid   25  10   988  988   852 S N   0.0  0.0   0:00   0 sh
> 27294 samgrid   25  10  1016 1016   876 S N   0.0  0.0   0:00   0
> Tata20737488711
> 27296 samgrid   25  10  9272 9272  1744 S N   0.0  0.4   0:00   1  
> python
>  1238 samgrid   29  10   504  504   428 S N   0.0  0.0   0:00   0
sleep
> **/
>
> But I notice that the the pid's  listed against vm2 (show above) are 
> all related to the user process running on vm1. The one big user 
> process (shown below) that takes up 100% of vm1 is not listed in the 
> log against vm1
>
> /**
> 27356 samgrid   35  10  156M 156M 28424 R N  50.0  7.7 600:31   1  
> d0gstar2.x
> **/
>
> This does not seem to be OK. Because on the node with the older Linux,

> the log shows about 20 pid's (for one user job) against the 
> corresponding vm. On the problem nodes, Condor seems to be watching 
> about half of the user processes that "top" lists.
>
> I hope I am reporting all that is relevant, though this mail has 
> become rather long.
> Thank you sincerely.
>
> Nagaraj
>
>
> Dan Bradley wrote:
>
>> Nagaraj,
>>
>> Condor reads /proc/<pid>/stat to get the CPU utilization of a
process.
>> I've observed that this works under Scientific Linux 3, so I don't 
>> know why it is reporting no load for your processes.
>>
>> Looking at the code that reads /proc/<pid>/stat, I see that some 
>> failures to read the file are only reported if you add D_FULLDEBUG to

>> your STARTD_DEBUG list.  Could you please do that and see if there 
>> are errors containing the string "ProcAPI::getProcInfo()"?
>>
>> Also, does the list of pids that it is checking make sense?  In your 
>> log below, there is a message "Computing percent CPU usage with pids:

>> 3559
>> 3561 3566 3871 3929 3930 3934 4450 5239 5240 5242 12775".  Presumably

>> one or more of these was generating the cpu load.  Can you please 
>> verify that?  The question I am wondering about is whether the 
>> problem is the reading of process stat information or whether it is 
>> the list of pids itself that is the trouble.  If there are pids 
>> associated with the job that are generating cpu load but which are 
>> not in the list, that would cause a problem similar to what you are 
>> observing.  If you just use 'top' to see the top few pids and verify 
>> that these are in the list being watched by Condor, that should be 
>> enough to see if this is really the problem or not.
>>
>> --Dan
>>
>> Nagaraj Panyam wrote:
>>
>>
>>
>>> Hi Dan,
>>>
>>> Yes, by "user process" I refer to job run by condor on behalf of a 
>>> user.
>>> There is no other user process, because the nodes are dedicated 
>>> compute nodes.
>>>
>>> I did try the D_LOAD setting. But I could not make out much out of 
>>> the output - but the load values dont seem to agree with the the 
>>> Claimed/Busy status. Here is a clipping from a node for which 
>>> condor_status reports "vm1 owner/idle loadav=1.000" and "vm2 
>>> claimed/busy loadav=0.000". I request you to have a look at the 
>>> output and see if you find a clue, or please advise me what to look 
>>> out for.
>>>
>>>
>>> /****
>>> 5/9 11:34:33 Load avg: 1.00 1.00 0.96
>>> 5/9 11:34:33 vm1: LoadQueue: Adding 5 entries of value 0.000000
>>> 5/9 11:34:33 vm1: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.
>>> 00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 
>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
>>> 0.00
>>> 5/9 11:34:33 vm1: LoadQueue: Size: 60  Avg value: 0.00  Share of 
>>> system
>>> load: 0.00
>>> 5/9 11:34:33 vm2: Computing percent CPU usage with pids: 3559 3561
>>> 3566
>>> 3871 3929 3930 3934 4450 5239 5240 5242 12775
>>> 5/9 11:34:33 ProcAPI: new boottime = 1145611802; old_boottime = 
>>> 1145611802; /proc/stat boottime = 1145611803; /proc/uptime boo ttime

>>> = 1145611802
>>> 5/9 11:34:33 vm2: Percent CPU usage for those pids is: 0.000000
>>> 5/9 11:34:33 vm2: LoadQueue: Adding 5 entries of value 0.000000
>>> 5/9 11:34:33 vm2: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.
>>> 00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0 0.00 
>>> 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
>>> 0.00
>>> 5/9 11:34:33 vm2: LoadQueue: Size: 60  Avg value: 0.00  Share of 
>>> system
>>> load: 0.00
>>> 5/9 11:34:33 SystemLoad: 1.000  TotalCondorLoad: 0.000   
>>> TotalOwnerLoad:
>>> 1.000
>>> 5/9 11:34:33 vm1: SystemLoad: 1.00  CondorLoad: 0.00  OwnerLoad: 
>>> 1.00
>>> 5/9 11:34:33 vm2: SystemLoad: 0.00  CondorLoad: 0.00  OwnerLoad: 
>>> 0.00 ***/
>>>
>>>
>>> Just to summarize my earlier mail - something seems to have changed

>>> with
>>> an OS upgrade, because with the older OS, the same node (same
>>> condor_config) a user process on each vm. Also, if I turn  
>>> Hyperthreading
>>> on for the same node, two vm's run user process, and two are stuck
in
>>> the "claimed/busy" state.
>>>
>>> Thanks a lot for helping me with this.
>>>
>>> Nagaraj
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dan Bradley wrote:
>>>
>>>
>>>
>>>
>>>
>>>> Nagaraj,
>>>>
>>>> Off the top of my head, I can't think of any reason for this change

>>>> in
>>>> behavior.  When you say "user process" you are referring to a job  
>>>> run
>>>> by Condor on behalf of a user, right?  You are not talking about
>>>> processes run by users outside of Condor.  Just want to be sure I
>>>> understand.
>>>>
>>>> You could try adding D_LOAD in your STARTD_DEBUG settings.  This  
>>>> will
>>>> show extra information about what's going on while monitoring
system
>>>> load.
>>>>
>>>> --Dan
>>>>
>>>> On May 8, 2006, at 12:46 PM, P. Nagaraj wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> We have dual cpu nodes in our condor pool. Some of these run an  
>>>>> older
>>>>> version of Linux (RH 7.1), and these take two user jobs as shown  
>>>>> below
>>>>> There is a user process on each cpu here. CondorVersion is 6.6.5  
>>>>> and
>>>>> platforms are all Intel/Linux
>>>>>
>>>>> vm1 Claimed/Busy/LoadAv=1.000/Mem=502
>>>>> vm2 Claimed/Busy/loadAv=1.020/Mem=502
>>>>>
>>>>> When I upgrade the nodes (Scientific Linux 3), the behaviour  
>>>>> changes,
>>>>> condor_config being unchanged. The VM that is running the user  
>>>>> process
>>>>> shows up as vm2 below, while the vm that has no LoadAv shows up as
>>>>> Busy.
>>>>>
>>>>> vm1  Claimed/Busy/LoadAv=0.000/Mem=500
>>>>> vm2  Owner/Idle/LoadAv=1.000/Mem=500
>>>>>
>>>>> Refering to the vm1 just above, its classads are as show here  
>>>>> below.
>>>>>
>>>>> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
>>>>> CondorLoadAvg = 0.000000
>>>>> LoadAvg = 0.000000
>>>>> TotalLoadAvg = 1.000000     --this from the other vm which has  
>>>>> LoadAv=1
>>>>> TotalCondorLoadAvg = 0.000000
>>>>> CpuBusyTime = 0
>>>>> CpuIsBusy = FALSE
>>>>> State = "Claimed"
>>>>> Activity = "Busy"
>>>>> Start = (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State !=
>>>>> "Unclaimed"
>>>>> && State != "Owner"))
>>>>> Requirements = START
>>>>>
>>>>> Why is there this mismatch - CpuisBusy="FALSE" and Activity=Busy?
>>>>> The Loadaverages indicate that a procss can start, but something  
>>>>> makes
>>>>> the
>>>>> Activity="Busy". How do I find out why one vm is always Busy,  
>>>>> which was
>>>>> now so before the OS upgrade?
>>>>>
>>>>> Similarly, if HT is enabled in BIOS, there are 4 vm's on a node.  
>>>>> Two of
>>>>> these are Claimed/Busy (Loadav=0) and the other two are Owner/Idle
>>>>> (doing
>>>>> user process, LoadAv=1).
>>>>>
>>>>> Thanks in advance for any help on this
>>>>>
>>>>> Nagaraj
>>>>>
>>>>>
>>>>> --   
>>>>>
>>>>> +----------------------------------
>>>>> +--------------------------------------+
>>>>> Nagaraj Panyam                     | Office tel: +91-22-22782610
>>>>> Dept of High Energy Physics        | Office fax: +91-22-22804610
>>>>> Tata Instt. of Fundamental Research| Home  tel : +91-22-22804936
>>>>> Mumbai - 400 005, INDIA            | **Email** : pn@xxxxxxxxxxx
>>>>> +----------------------------------
>>>>> +--------------------------------------+
>>>>> _______________________________________________
>>>>> Condor-users mailing list
>>>>> Condor-users@xxxxxxxxxxx
>>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Condor-users mailing list
>>>> Condor-users@xxxxxxxxxxx
>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>> _______________________________________________
>> Condor-users mailing list
>> Condor-users@xxxxxxxxxxx
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>
>
>
> --  
>
> +---------------------------------- 
> +--------------------------------------+
> Nagaraj Panyam                     | Office tel: +91-22-22782610
> Dept of High Energy Physics        | Office fax: +91-22-22804610
> Tata Instt. of Fundamental Research| Home  tel : +91-22-22804936
> Mumbai - 400 005, INDIA            | **Email** : pn@xxxxxxxxxxx
> +---------------------------------- 
> +--------------------------------------+
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
Prev by Date: [Condor-users] file descriptor safety level exceeded error in schedd log??
Next by Date: Re: [Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"
Previous by thread: Re: [Condor-users] file descriptor safety level exceeded error in schedd log??
Next by thread: [Condor-users] DAGman stopped working with 6.8.x
Index(es):
- Date
- Thread
Mailing List Archives

Public Access

Re: [Condor-users] multiple VM problem