[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor mail error notification - Condor Error messages when running resource demanding jobs.



Thank you David, Matt, Greg and Michael for your replies.
The error messages only happen when I am sending this particular job
that is very resources consuming, specially RAM memory. I don't know why
this particular job is creating all these error message or why it's
doing it but I think turning on the debug in order to collect more
information is the best route to tackle this problem.

Matt,
When you suggest to setup ABORT_ON_EXCEPTION param to TRUE and
CREATE_CORE_FILE=TRUE should be done in all the machines or only in the
execute nodes. 

Michael, 
The changes you suggested (STARTD_DEBUG = D_JOB" ) should take place in
the config file of all the condor nodes (Scheduler, Startd and
negotiator) or only in the Startd machines. 

I don't know if these two changes can co-exist or will conflict ?
Thanks for your input...
 

David,
The reason why you see the subject of my original message like that is
because I was having problems with my e-mail client so I shoot the
message from a sent e-mail that never went through the condor mail list.
But is a good and valid suggestion to be more creative with the
message's subject.
Alex 



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Moore
Sent: Thursday, August 20, 2009 2:10 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor mail error notification

Alex,

I've also seen the "Can't find WANT_SUSPEND in internal ClassAd" errors.

In trying to debug the issue I noticed that setting the Startd debug 
flag prevented the issue from occurring. Specifically, I added 
"STARTD_DEBUG = D_JOB" to the config and haven't seen the issue since.

Michael
------------------------------------------------------
Alex,

I've also seen the "Can't find WANT_SUSPEND in internal ClassAd" errors.

In trying to debug the issue I noticed that setting the Startd debug
flag prevented the issue from occurring. Specifically, I added
"STARTD_DEBUG = D_JOB" to the config and haven't seen the issue since.

Michael
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matthew Farrellee
Sent: Thursday, August 20, 2009 11:59 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor mail error notification

Alex,

The ERROR means that Condor excepted. I've seen this particular error
sporadically, and never often enough that I could properly debug it. How
often does it happen for you?

You might try setting the ABORT_ON_EXCEPTION param to TRUE and
CREATE_CORE_FILE = TRUE to get more information about where this is
happening and what the startd's state is at the time.

Best,


matt

Alas, Alex [FEDI] wrote:
> Hello to all!,
> 
> Not trying to be annoying but I really don't have a clue of how to
> attack this issue, any ideas are welcome, 
> 
> Thanks again,
> 
> Alex 
> 
>  
> 
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Alas, Alex
[FEDI]
> Sent: Wednesday, August 19, 2009 12:31 PM
> To: Condor-Users Mail List
> Subject: [Condor-users] condor mail error notification
> 
>  
> 
> Hello to all,
> 
> I launched some jobs through my condor pool. I have a mixed farm of
> windows 2003 and windows XP boxes. The second ones are Virtual
machines
> running on Linux hosts. The jobs I ran last night are still running
but
> I am receiving several e-mail notifications from all the windows XP
> machines. I launched the jobs from a computer that belonged to another
> pool using "condor_submit -pool negotiation -name scheduler
> condor_submission_filename.sub"; The error message is the following:
> 
>
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> XXXXX
> 
> This is an automated email from the Condor system on machine
> "vm4-condor-xp.earthdata.com".  Do not reply.
> 
>  
> 
> "C:\Condor/bin/condor_startd.exe" on "vm4-condor-xp.earthdata.com"
> exited with status 4.
> 
> Condor will automatically restart this process in 10 seconds.
> 
>  
> 
> *** Last 20 line(s) of file C:\Condor/log/StartLog:
> 
> 8/18 18:07:08 slot1: State change: No preempting claim, returning to
> owner
> 
> 8/18 18:07:08 slot1: Changing state and activity: Preempting/Vacating
->
> Owner/Idle
> 
> 8/18 18:07:08 slot1: State change: IS_OWNER is false
> 
> 8/18 18:07:08 slot1: Changing state: Owner -> Unclaimed
> 
> 8/18 18:11:59 slot1: match_info called
> 
> 8/18 18:11:59 slot1: Received match
<10.2.168.99:1578>#1250626520#7#...
> 
> 8/18 18:11:59 slot1: State change: match notification protocol
> successful
> 
> 8/18 18:11:59 slot1: Changing state: Unclaimed -> Matched
> 
> 8/18 18:11:59 slot1: Request accepted.
> 
> 8/18 18:11:59 slot1: Remote owner is aalas@xxxxxxxxxxxxx
> 
> 8/18 18:11:59 slot1: State change: claiming protocol successful
> 
> 8/18 18:11:59 slot1: Changing state: Matched -> Claimed
> 
> 8/18 18:11:59 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at
> line 1226 in file..\src\condor_startd.V6\Resource.cpp
> 
> 8/18 18:11:59 slot1: Changing state and activity: Claimed/Idle ->
> Preempting/Killing
> 
> 8/18 18:11:59 slot1: State change: No preempting claim, returning to
> owner
> 
> 8/18 18:11:59 slot1: Changing state and activity: Preempting/Killing
->
> Owner/Idle
> 
> 8/18 18:11:59 slot1: State change: IS_OWNER is false
> 
> 8/18 18:11:59 slot1: Changing state: Owner -> Unclaimed
> 
> 8/18 18:11:59 slot2: Changing state and activity: Claimed/Busy ->
> Preempting/Killing
> 
> 8/18 18:11:59 startd exiting because of fatal exception.
> 
> *** End of file StartLog
> 
>
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> XXXXXXXXXXX
> 
> I am not an expert on condor so I don't know how to interpret this
error
> message? Any ideas?
> 
> Thanks in advance for your help,
> 
>  
> 
> Respectfully,
> 
> Alex Alas 
> Fugro EarthData Inc.
> 
>  
> 
> 
> 
> 
>
------------------------------------------------------------------------
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/