Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs getting held for no obvious reason

Date: Thu, 20 Nov 2008 07:56:40 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Jobs getting held for no obvious reason

Ian,

I'm afraid I don't have any ideas about what could be causing"Permission denied" in the transfer of output files to the spooldirectory. If you hit a dead end in tracking that down, it may benecessary to add more information to the shadow debug log when it hitsthis problem.

Just to be clear: how were these jobs submitted? Are these vanilla jobssubmitted with condor_submit -s? Or SOAP?

When you look at the files in the job's spool directory, what ownershipdo you see? While the job is running, I would expect the files to beowned by the user. At other times, I would expect to see the filesowned by condor.

The apparent failure of SYSTEM_PERIODIC_RELEASE is also mysterious.Things to try:

1. Confirm that the schedd is using the setting you expect:condor_config_val -schedd SYSTEM_PERIODIC_RELEASE


2. Add D_FULLDEBUG to SCHEDD_DEBUG and check for messages like this:

Evaluated periodic expressions in 1.3s, scheduling next run in 60s

--Dan

Smith, Ian wrote:

Hi,

I've noticed that a lot of jobs on our pool are being held for no obvious
reason. It seems to happen to the longer running jobs ( > 1 day )
but there's no apparent pattern. The hold reason is given as typically:

HoldReason = "Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER at
138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
138.253.100.27 failed to write to file
/opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno 13)
Permission denied"

and the job log file shows:

        Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER at
138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
138.253.100.27 failed to write to file
/opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno 13)
Permission denied
        58428532  -  Run Bytes Sent By Job
        178835  -  Run Bytes Received By Job
...
012 (9648.000.000) 11/07 17:08:37 Job was held.
        Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxx: STARTER at
138.253.234.21 failed to send file(s) to <138.253.100.27:64942>; SHADOW at
138.253.100.27 failed to write to file
/opt1/condor/mws_pool_spool/cluster9648.proc0.subproc0.tmp/time194: (errno 13)
Permission denied
        Code 12 Subcode 13

But the directory in question is there and the permissions are OK. I'm running
the central manager/submit
host on a Sun V440 with Solaris 10. Execute hosts are all Win XP SP2 and
everything is Condor 7.0.2.

As I workaround I placed  this is in the config file:

#ICS workaround for "failed to write to file ... permission denied problem"
#ICS release the job upto 10 times if on hold for over 10 minutes
SYSTEM_PERIODIC_RELEASE = (JobRunCount < 10 && CurrentTime -
EnteredCurrentStatus > 600) &&\
                          (HoldReasonCode == 12 || HoldReasonSubCode == 13)

but as far as I can see the jobs aren't getting released automatically.

Any help would be most appreciated -  this has me baffled.

-ian.

--------------------------------------------
Dr Ian C. Smith,
e-Science Team,
The University Of Liverpool,
Computing Services Department,

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

References:
- [Condor-users] Jobs getting held for no obvious reason
  - From: Smith, Ian

Prev by Date: [Condor-users] Jobs getting held for no obvious reason
Next by Date: [Condor-users] How to start/restart Condor more quickly on Windows
Previous by thread: [Condor-users] Jobs getting held for no obvious reason
Next by thread: Re: [Condor-users] Jobs getting held for no obvious reason
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Jobs getting held for no obvious reason