[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Daylight savings put all our jobs on hold?



After much testing (including strace wrappers) we've concluded that some inter-process communication was getting lost/blocked/failing somewhere :-(
So we upgraded to v8.0.6 everything started working again. 
Successful outcome but no idea what broke it in the first place.

--Russell

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Smithies, Russell
Sent: Tuesday, 8 April 2014 2:10 p.m.
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Daylight savings put all our jobs on hold?

I had a look thru the src and found the message - I think it means the class add isn't available and there's this " Write_Pipe failed" error below but not sure where it would be trying to write it.
I have debugging enabled and this looks promising in SchedLog:

------------------------------------
04/08/14 13:53:01 (pid:24041) Tables are consistent
04/08/14 13:53:01 (pid:24041) Rebuilt prioritized runnable job list in 0.000s.
04/08/14 13:53:01 (pid:24041) match (slot1@xxxxxxxxxxxxxxxxxxxxxxx <147.158.128.131:40939> for smithiesr) switching to job 10228.0
04/08/14 13:53:01 (pid:24041) Starting add_shadow_birthdate(10228.0)
04/08/14 13:53:01 (pid:24041) writeJobAd: Write_Pipe failed
04/08/14 13:53:01 (pid:24041) Started shadow for job 10228.0 on slot1@xxxxxxxxxxxxxxxxxxxxxxx <147.158.128.131:40939> for smithiesr, (shadow pid = 4413)
04/08/14 13:53:01 (pid:24041) Shadow pid 4413 for job 10228.0 exited with status 4
04/08/14 13:53:01 (pid:24041) ERROR: Shadow exited with job exception code!
04/08/14 13:53:01 (pid:24041) Checking consistency running and runnable jobs
04/08/14 13:53:01 (pid:24041) Tables are consistent
04/08/14 13:53:01 (pid:24041) Rebuilt prioritized runnable job list in 0.000s.
04/08/14 13:53:01 (pid:24041) match (slot1@xxxxxxxxxxxxxxxxxxxxxxx <147.158.128.131:40939> for smithiesr) switching to job 10228.0
04/08/14 13:53:01 (pid:24041) Starting add_shadow_birthdate(10228.0)
04/08/14 13:53:01 (pid:24041) writeJobAd: Write_Pipe failed
04/08/14 13:53:01 (pid:24041) Started shadow for job 10228.0 on slot1@xxxxxxxxxxxxxxxxxxxxxxx <147.158.128.131:40939> for smithiesr, (shadow pid = 4414)
04/08/14 13:53:01 (pid:24041) Shadow pid 4414 for job 10228.0 exited with status 4
04/08/14 13:53:01 (pid:24041) ERROR: Shadow exited with job exception code!
04/08/14 13:53:01 (pid:24041) Match for cluster 10228 has had 5 shadow exceptions, relinquishing.
04/08/14 13:53:01 (pid:24041) Completed RELEASE_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxxxxxxx <147.158.128.131:40939> for smithiesr
04/08/14 13:53:01 (pid:24041) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxx <147.158.128.131:40939> for smithiesr, 10228.0) deleted
-----------------------------------------

--Russell

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Ben Cotton
Sent: Tuesday, 8 April 2014 9:51 a.m.
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Daylight savings put all our jobs on hold?

On Mon, Apr 7, 2014 at 4:11 PM, Smithies, Russell <Russell.Smithies@xxxxxxxxxxxxxxxx> wrote:
> Any idea which dir I should be looking for?
> The dir it mentions is part of the src so I don't think it's an actual dir I have control over.

The directory it mentions points to where to find that error in the source:
https://github.com/htcondor/htcondor/blob/master/src/condor_shadow.V6.1/shadow_v61_main.cpp

I don't know enough C++ to decipher this for you, unfortunately. It's been a long day, but I don't see anything immediately problematic in the job ad. Can your execute nodes access /home/smithiesr/condor and does the UID_DOMAIN of the schedd and execute node match? The logs for slot1@xxxxxxxxxxxxxxxxxxxxxxx might also shed some light on what's going on here.

> I can't see any condor_shadow processes running, shouldn't there be one per job that was submitted?

One per job that's running.


Thanks,
BC

--
Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

http://www.cyclecomputing.com
twitter: @cyclecomputing
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/