[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Bug in Condor?



Matthew Farrellee wrote:
Janito Ferreira Filho wrote:
Hello,

I've been having some troubles with Condor while experimenting with fault tolerance. I've posted some email on the list about my troubles while I keep investigating. Since I couldn't find a solution I turned my attention to some other tests. The test I was working on was enabling a new execute machine while some jobs were executing (and more were on queue), ie. dynamically add more nodes. My test was simple, the executable is a simple C program that sleeps for 5 minutes and then prints the UID of the program executor (which since I have a common UID domain, is the submitter's UID). If I run it without adding nodes, it works flawlessly. However, if I do add a new node while some jobs are executing (with condor_on node-2 for example), the output files don't get returned to me. Is this a Condor bug? Here are some outputs:

Executable source code:
#include <unistd.h>

int main(int argc, char *argv[])
{
    int num;

    sleep(300);    /* Sleep 60 seconds */

    num = getuid();
printf("UID: %d", num); return 0;
}

Normal output (no new execute machines added to condor pool):
UID: 500

Erroneous output (new execute machines added to condor pool) is an empty file. To see this, check out the file sizes:
$ ls -sh out.*
   0 out.0   4.0K out.11  4.0K out.14  4.0K out.17  4.0K out.2   4.0K out.22  4.0K out.4     0 out.7
   0 out.1   4.0K out.12  4.0K out.15  4.0K out.18  4.0K out.20     0 out.23  4.0K out.5     0 out.8
4.0K out.10  4.0K out.13  4.0K out.16  4.0K out.19  4.0K out.21  4.0K out.3      0 out.6     0 out.9

Here you can clearly see my testing method:
- Start with 1 execute machine with two CPUs. Submit the job, two begin executing. - condor_on second execute machine (also two CPUs) before the jobs are finsihed (the machines enter Owner state).
- When the jobs complete (no output is transfered), the second machine leaves owner state and begins job execution too.
- After these four jobs are completed (output is transfered), condor_on third machine (also two CPUs), which becomes Owner.
- 6 jobs finish, no output is transfered. The new machine leaves Owner state and executes jobs too.
- All other jobs finish normally and have their output transfered.

Finally, here are two log files (one from a job with transfered output and one without):
$ cat log.2
000 (263.002.000) 08/15 15:01:45 Job submitted from host: <10.1.1.1:9663>
...
001 (263.002.000) 08/15 15:06:58 Job executing on host: <10.255.255.252:9670>
...
005 (263.002.000) 08/15 15:11:58 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    8  -  Run Bytes Sent By Job
    7075  -  Run Bytes Received By Job
    8  -  Total Bytes Sent By Job
    7075  -  Total Bytes Received By Job
...

$ cat log.0
000 (263.000.000) 08/15 15:01:45 Job submitted from host: <10.1.1.1:9663>
...
001 (263.000.000) 08/15 15:01:55 Job executing on host: <10.255.255.252:9670>
...
005 (263.000.000) 08/15 15:06:55 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    0  -  Run Bytes Sent By Job
    7075  -  Run Bytes Received By Job
    0  -  Total Bytes Sent By Job
    7075  -  Total Bytes Received By Job
...

If it is my fault this happened, an someone help me fix it? If it isn't, I hope this helps. Thank you,

JVFF

This certainly sounds strange. I notice that both proc 0 and 2 ran on the same machine (10.255.255.252).

Can you reproduce this with ALL_DEBUG=D_FULLDEBUG in your configuration files (restart condor after making the change) and check for ERROR, error or WARNING messages? You'll want to look in 10.1.1.1's SchedLog for sure and probably also the StartLog and StarterLog on the exec nodes (10.255.255.252).

Are your execute machines all sharing the same EXECUTE directory? That would certainly be bad. To see, run this command on all of your execute machines:

condor_config_val EXECUTE

--Dan