[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job fails to run / Job leaves around unkillable processes



Here is everything in the starter log from the last 2 seconds of running that process.  As you can see from the log below, IWD is set to C:\condor\execute\dir_6728.  You can also see it failing to delete that directory later.  This is a directory that it created.  Again, usernames and domains have been changed to protect the guilty.  I'm not sure why the starter is allowed to create a directory, copy an executable into it, but then can't run it or later delete the directory.  This is very strange.

10/29 12:47:27 Calling client FileTransfer handler function.
10/29 12:47:27 TokenCache contents: 
USER@DOMAIN
10/29 12:47:27 HOOK_PREPARE_JOB not configured.
10/29 12:47:27 Job 5.0 set to execute immediately
10/29 12:47:27 DaemonCore: return from reaper for pid 6
10/29 12:47:27 Return from Timer handler 6 (FakeCreateThreadReaperCaller::CallReaper())
10/29 12:47:27 Calling Timer handler 9 (deferred job start)
10/29 12:47:27 Starting a VANILLA universe job with ID: 5.0
10/29 12:47:27 In OsProc::OsProc()
10/29 12:47:27 Main job KillSignal: 15 (Unknown)
10/29 12:47:27 Main job RmKillSignal: 15 (Unknown)
10/29 12:47:27 Main job HoldKillSignal: 15 (Unknown)
10/29 12:47:27 in VanillaProc::StartJob()
10/29 12:47:27 in OsProc::StartJob()
10/29 12:47:27 IWD: C:\condor\execute\dir_6728
10/29 12:47:27 TokenCache contents: 
USER@DOMAIN
10/29 12:47:27 Input file: NUL
10/29 12:47:27 Output file: C:\condor\execute\dir_6728\5.0.output
10/29 12:47:27 Error file: C:\condor\execute\dir_6728\5.0.error
10/29 12:47:28 Renice expr "10" evaluated to 10
10/29 12:47:28 About to exec C:\condor\execute\dir_6728\condor_exec.exe 
10/29 12:47:28 Env = TMP=C:\condor\execute\dir_6728 _CONDOR_JOB_IWD=C:\condor\execute\dir_6728 _CONDOR_SLOT=1 _CONDOR_MACHINE_AD=C:\condor\execute\dir_6728\.machine.ad TEMP=C:\condor\execute\dir_6728 TMPDIR=C:\condor\execute\dir_6728 _CONDOR_SCRATCH_DIR=C:\condor\execute\dir_6728 _CONDOR_JOB_AD=C:\condor\execute\dir_6728\.job.ad _CONDOR_JOB_PIDS=
10/29 12:47:28 ENFORCE_CPU_AFFINITY not true, not setting affinity
10/29 12:47:28 In OwnerProfile::update()
10/29 12:47:28 Create_Process(): executable: 'C:\condor\execute\dir_6728\condor_exec.exe'
10/29 12:47:28 GetBinaryType() returned 0
10/29 12:47:28 TokenCache contents: 
USER@DOMAIN
10/29 12:47:28 Create_Process: CreateProcess failed, errno=267
10/29 12:47:28 ERROR "Create_Process(C:\condor\execute\dir_6728\condor_exec.exe,, ...) failed: " at line 530 in file ..\src\condor_starter.V6.1\os_proc.cpp
10/29 12:47:28 ShutdownFast all jobs.
10/29 12:47:28 Got ShutdownFast when no jobs running.
10/29 12:47:28 HOOK_JOB_EXIT not configured.
10/29 12:47:28 Inside JICShadow::transferOutput(void)
10/29 12:47:28 JICShadow::transferOutput(void): Transferring...
10/29 12:47:28 Inside JICShadow::transferOutputMopUp(void)
10/29 12:47:28 Removing C:\condor\execute\dir_6728
10/29 12:47:28 Attempting to remove C:\condor\execute\dir_6728 as SuperUser (system)
10/29 12:47:28 Removing "C:\condor\execute\dir_6728" as SuperUser (system) failed: /bin/rm exited with status 32
10/29 12:47:28 perm::init() starting up for account (SYSTEM) domain (NT AUTHORITY)
10/29 12:47:28 perm::init: Found Account Name SYSTEM
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 Attempting to remove C:\condor\execute\dir_6728 as SuperUser (system)
10/29 12:47:28 Removing "C:\condor\execute\dir_6728" as SuperUser (system) failed: /bin/rm exited with status 32
10/29 12:47:28 ERROR: C:\condor\execute\dir_6728 still exists after trying to add Full control to ACLs for PRIV_ROOT
10/29 12:47:28 Deleting the StarterHookMgr


On Mon, Nov 1, 2010 at 8:02 AM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx> wrote:
Is there a line in the starter log  that looks like this

    IWD: <some path>

It would be before the message that Create_Process failed.   This is the Initial directory, if it's different
than that path to the executable, then that might be the directory that's invalid.

-tj


On 10/29/2010 3:00 PM, Torrin Jones wrote:
Thanks for the response.  Good points.  However . . . 

V: is actually a physical hard drive on my computer and at the moment, condor is only installed on my computer.  I was doing a test to see if my software would work with the latest version.  So everything is contained on my computer that has V as a physical hard drive.  So condor should be able to get at it.  I also checked to see if this directories actually does exist.  They do and as far as I can tell, they are accessible by anybody, including condor (which is running as NT AUTHORITY\SYSTEM).

After all this, I wanted to be sure, so I moved everything to c:\temp and changed all paths in the submit description file to relative paths and then submitted to condor to see if anything changed.  Unfortunately, I still have the same problem.

I've attached the new submit description file and the output log file.  IP address, port numbers, usernames, etc. have been changed to protect the guilty.   Below is what came out of StarterLog.slot1.

10/29 12:47:28 Create_Process: CreateProcess failed, errno=267
10/29 12:47:28 ERROR "Create_Process(C:\condor\execute\dir_6728\condor_exec.exe,, ...) failed: " at line 530 in file ..\src\condor_starter.V6.1\os_proc.cpp



On Fri, Oct 29, 2010 at 8:51 AM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx> wrote:
yep  267 is "The directory name is invalid".  From looking at your .job file.  I'm wondering if the invalid directory isn't
v:\temp\condor or v:\shared\condor rather than c:\condor\execute\dir_6136 as the error message seems to imply.

I'm guessing that v: is a network drive.  So I gotta wonder,  v: really valid in the context of the job?


On 10/29/2010 9:31 AM, Torrin Jones wrote:
Using Condor 7.4.4 on Windows XP.

Any idea what would cause an error 267?

From StarterLog.slot1 . . .

10/28 08:35:33 Create_Process: CreateProcess failed, errno=267
10/28 08:35:33 ERROR "Create_Process(C:\condor\execute\dir_6136\condor_exec.exe,, ...) failed: " at line 530 in file ..\src\condor_starter.V6.1\os_proc.cpp

The MSDN says 267 means, "The directory name is invalid."  However, the directory name is there.  Here is the scenario.  I submit a small job.  condor_dummy.job attached.  All condor_dummy.exe does is print out a line like this . . .

Run by DOMAIN\USER on COMPUTERNAME at DATE TIME.

It's basically a quick condor test.

Anyway, I submit the job and condor tries to run it.  However it fails and I get the above message in the StarterLog.slot1.  Here is the kicker.  It will retry and fail.  However, if I leave it in the queue long enough, it will eventually succeed.  When I ran the job yesterday, it tried 28 times.  The final time, it succeeded.  Here is another thing I'm seeing.  After it succeeded, I looked in Process Explorer and saw 27 condor_exec.exe running.  The condor_exec.exe's were unkillable.  I tried every approach I could think of.  Killing them as Admin, as NT AUTHORITY/SYSTEM, even putting a debugger on them and killing them that way, nothing works.

So I have 2 issues.

1. The job fails to run.
2. The job leaves around unkillable processes.

Any ideas?  Has anybody seen anything like this?
_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/