Re: [Condor-users] Problem with MSYS command started from condor_starter

I'd recognize that directory path anywhere. :)

If you try to use a MinGW command from that machine to touch that file while the one process is hung, what happens? Same file does not exist error? I'm wondering if MinGW uses some shared library that's in a bad state because of this one job, preventing all other binaries that need that library from working. Something akin to cygwin1.dll.

As for why: what do system resources look like when it hangs? Is io to UNC shares also failing? What about desktop heaps on the machine? Exhausted?

- Ian

On 2011-04-14, at 10:59 PM, yap munsoon <soonyee@xxxxxxxxx> wrote:

I am using Condor for some time as user. Recently, I am using MSYS (http://www.mingw.org/wiki/msys), calling MSYS command in some scripts.
We have 8 condor_starter running concurrently in a 8 core Windows XP machine.

To simulate the issue, I create a script just to copy some files and then remove them using MSYS command.
I submit 10000 jobs running this script (with MSYS calls) to the condor farm. After running concurrently for some time (1 or 2 hours), there is 1 MSYS command (eg. cp.exe) hang. I try to attach the hanging command with mingw gdb and get the following call stack.

(gdb) where
#0  0x7d61002e in strchr () from C:\WINDOWS\system32\ntdll.dll
#1  0x7d666ea1 in ntdll!RtlCopyUnicodeString ()
   from C:\WINDOWS\system32\ntdll.dll
#2  0x00000000 in ?? ()

With the hanging process, the subsequent jobs with MSYS call will fail mysteriously with the following error.

    cp: cannot stat `s:/data/regtestfiles/main/current/regtest/infrastructure/regutils/reg_run/reg_copy_files//prev/reg.rout': No such file or directory

This file does exist actually. If I kill the hanging process, the subsequent jobs will be back to normal.
I check the the process explorer, this hanging process and my script is the child process of condor_master->condor_starter.

Btw, it works fine on Windows 7.
I also tried simulate the same process on 8 cmd prompts (not from condor_starter), everything run fine.

Thanks in advance for your replies and insight.

Mun Soon

