[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Bug in Condor?



Hello,

I've been having some troubles with Condor while experimenting with fault tolerance. I've posted some email on the list about my troubles while I keep investigating. Since I couldn't find a solution I turned my attention to some other tests. The test I was working on was enabling a new execute machine while some jobs were executing (and more were on queue), ie. dynamically add more nodes. My test was simple, the executable is a simple C program that sleeps for 5 minutes and then prints the UID of the program executor (which since I have a common UID domain, is the submitter's UID). If I run it without adding nodes, it works flawlessly. However, if I do add a new node while some jobs are executing (with condor_on node-2 for example), the output files don't get returned to me. Is this a Condor bug? Here are some outputs:

Executable source code:
#include <unistd.h>

int main(int argc, char *argv[])
{
    int num;

    sleep(300);    /* Sleep 60 seconds */

    num = getuid();
   
    printf("UID: %d", num);
   
    return 0;
}

Normal output (no new execute machines added to condor pool):
UID: 500

Erroneous output (new execute machines added to condor pool) is an empty file. To see this, check out the file sizes:
$ ls -sh out.*
   0 out.0   4.0K out.11  4.0K out.14  4.0K out.17  4.0K out.2   4.0K out.22  4.0K out.4     0 out.7
   0 out.1   4.0K out.12  4.0K out.15  4.0K out.18  4.0K out.20     0 out.23  4.0K out.5     0 out.8
4.0K out.10  4.0K out.13  4.0K out.16  4.0K out.19  4.0K out.21  4.0K out.3      0 out.6     0 out.9

Here you can clearly see my testing method:
- Start with 1 execute machine with two CPUs. Submit the job, two begin executing.
- condor_on second execute machine (also two CPUs) before the jobs are finsihed (the machines enter Owner state).
- When the jobs complete (no output is transfered), the second machine leaves owner state and begins job execution too.
- After these four jobs are completed (output is transfered), condor_on third machine (also two CPUs), which becomes Owner.
- 6 jobs finish, no output is transfered. The new machine leaves Owner state and executes jobs too.
- All other jobs finish normally and have their output transfered.

Finally, here are two log files (one from a job with transfered output and one without):
$ cat log.2
000 (263.002.000) 08/15 15:01:45 Job submitted from host: <10.1.1.1:9663>
...
001 (263.002.000) 08/15 15:06:58 Job executing on host: <10.255.255.252:9670>
...
005 (263.002.000) 08/15 15:11:58 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    8  -  Run Bytes Sent By Job
    7075  -  Run Bytes Received By Job
    8  -  Total Bytes Sent By Job
    7075  -  Total Bytes Received By Job
...

$ cat log.0
000 (263.000.000) 08/15 15:01:45 Job submitted from host: <10.1.1.1:9663>
...
001 (263.000.000) 08/15 15:01:55 Job executing on host: <10.255.255.252:9670>
...
005 (263.000.000) 08/15 15:06:55 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    7075  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    7075  -  Total Bytes Received By Job
...

If it is my fault this happened, an someone help me fix it? If it isn't, I hope this helps. Thank you,

JVFF


check out the rest of the Windows Live™. More than mail–Windows Live™ goes way beyond your inbox. More than messages