[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Bad file descriptors?




John Horne wrote:
Hello,

Installed Condor 6.7.6, and am currently trying to run the example
programs using just one remote client. Not sure how long they are
supposed to take but the first job seems to have run for about half an
hour or so and is still going (I think). I noticed in the workstations
'log' directory on the condor master server, in the StartLog file:

  3/29 13:17:46 StatInfo::fstat64(/dev/stdin) failed, errno: 9 = Bad
  file descriptor
  3/29 13:17:46 StatInfo::fstat64(/dev/stdout) failed, errno: 9 = Bad
  file descriptor
  3/29 13:17:46 StatInfo::fstat64(/dev/stderr) failed, errno: 9 = Bad
  file descriptor

The workstation is running the Linux terminal server project (LTSP)
version 4.1. I can see no obvious problem with /dev/stdin or the others;
they lead off to soft links pointing eventually to (for stdin) /dev/vc/1
which has the attributes:

   crw-------  1 root root  4, 1 Mar 29 13:25 /dev/vc/1

Anyone any ideas as to the problem with the file descriptors?


Thanks,

John.

  
Hi John,
We see the following error messages in our cluster also.  However, we are able to submit and run our
jobs successfully.  It is unlikely that this is what is preventing your job from terminating. 
  3/29 13:17:46 StatInfo::fstat64(/dev/stdin) failed, errno: 9 = Bad
  file descriptor
  3/29 13:17:46 StatInfo::fstat64(/dev/stdout) failed, errno: 9 = Bad
  file descriptor
  3/29 13:17:46 StatInfo::fstat64(/dev/stderr) failed, errno: 9 = Bad
  file descriptor

Here are a few things that you could try out.
1. Run a /bin/sleep job with arguments=60 (for one minute).  I've attached job description file for
such a job.

InitialDir   = /tmp
executable     = /bin/sleep
Universe     = Vanilla
output       = /tmp/test.out
error        = /tmp/test.err
log          = /tmp/test.log

arguments = 60
queue

2. Do check the /tmp/test.log file to see whether the job was run and terminated properly.  You'd see
something like this -

000 (1358.000.000) 03/24 15:20:28 Job submitted from host: <192.168.25.208:53222>

001 (1358.000.000) 03/24 15:25:15 Job executing on host: <192.168.25.195:41686>

005 (1358.000.000) 03/24 15:25:36 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

3. If you can successfully run a sleep job, you might want to try executing your original job by hand at
the command prompt to see whether it runs.  How long does it take - is the execution time bounded
or completely non-deterministic?  Have you checked the status of the job using condor_q?  If the job
is not running, what does condor_q -analyze report?  Is it possible that the job starts running and then
gets pre-empted because of the policy you've used?  If so, it should be reflected in the log file.

Let me know how it goes,
-- 
Rajesh Rajamani
Senior Member of Technical Staff
Direct : +1.408.321.9000
Fax    : +1.408.904.5992
Mobile : +1.650.218.9131
raj@xxxxxxxxxx


Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
www.optena.com
 

This electronic transmission (and any attached documents) contains information from Optena Corporation and is for the sole use of the individual or entity it is addressed to. If you receive this message in error, please notify me and destroy the attached message (and all attached documents) immediately.