[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_submit never return with condor 7.2.1



Hi,

I though that condor put all lock under a directory that we need to specify to be local(I do this). All my condor installation are local to the machine, but people(including me) send jobs from NFS directory. So this cause this lock to be on NFS...

The configuration variable
       IGNORE_NFS_LOCK_ERRORS = True
       LOG_ON_NFS_IS_ERROR = False
Didn't worked. I managed to get the log file on the local disk as needed with dagman as we use a script that generate the submit file.

Thanks for pointing me what was the cause.

Just one idea that could solve this problem for everybody. What about using a lock deamon on the central manager? We could specify what direcotry are under NFS(or condor could detect it if it is not too much difficult) and when condor want to take a lock that is on NFS, it use a lock on the central manager. That way, their is no lock on NFS.

anyway, I don't have the time to do it, so fell free to ignore it.

Thanks again.

Frédéric Bastien

On Tue, Apr 7, 2009 at 3:51 PM, Ian Chesal <ICHESAL@xxxxxxxxxx> wrote:
Frédéric,

> I finaly got the output from condor_submit with -debug. I read it, but
> I can't find anything to help me. I hesitate to send it to the mailing
> list as it contain information about the security that we use, so I
> send it to you in case you can help me. If not, just tell me.

You actually replied to the list so I'll answer here. If you ask the list admins they might be able to remove your attachment from the list archives so at least the output isn't kept around for all ages.

> I submit a very small job(echo 1). The process condor_submit is still
> running after 15 minutes. After a few seconds, their is no more output
> from condor_submit. So you have the full output.
>
> Do you have any idea of what could cause this?

These last few lines in your output:

4/6 17:14:14 (fd:2) (pid:841) FileLock object is updating timestamp on: /u/bastienf/testclaude/LOGS.NOBACKUP/echo_1_2009-04-06_17:14:05.784002/condor.log
4/6 17:14:14 (fd:2) (pid:841) PRIV_USER --> PRIV_CONDOR at file_lock.cpp:432
4/6 17:14:14 (fd:2) (pid:841) PRIV_CONDOR --> PRIV_USER at file_lock.cpp:444
4/6 17:14:14 (fd:2) (pid:841) PRIV_USER --> PRIV_UNKNOWN at user_log.cpp:173
4/6 17:14:14 (fd:2) (pid:841) PRIV_UNKNOWN --> PRIV_USER at user_log.cpp:767
4/6 17:14:14 (fd:2) (pid:841) FILE_LOCK_VIA_MUTEX is undefined, using default value of True

They make it look like condor_submit is waiting on a file lock to write something to a shared file.

Are your logs on a network file system? If so, perhaps the network file system protocol is causing a file locking issue here? There are known problems with logging on NFS-exported file systems in Condor. How are you mounting the /u file system? I usually NFS my mounts used by my Condor pools with:

<NAS>:/data   /data       nfs   exec,dev,suid,rw,tcp,hard,vers=3,rsize=32768,wsize=32768,timeo=10,retrans=600    1 1

That works well with the file locking semantics in Condor. I also run with:

       IGNORE_NFS_LOCK_ERRORS = True
       LOG_ON_NFS_IS_ERROR = Fals

In my condor_config files.

Hope that helps move along your debugging a bit!

Warm regards,
- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/