[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] All the other machines except central manager don't work!!



Is there anyone to help me.
 
Our machines still don't work except central machine.
 
central machine submits jobs to clients, but clients cannot work.
 
What file do you need to find out what is the problem?
 
Please help me.

2010/1/13 Genie Jhang <geniejhang@xxxxxxxxxxx>
Thanks for your reply, Dan.

As you said, I changed permission of the directory, /home/condor/execute, on all machines to 777.

And I don't use NFS.

Now, i'm getting this kind of error.

--------------------------------------------------

022 (218.000.000) 01/13 18:03:15 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to slot3@pheko05 <192.168.0.105:33682>
...
024 (218.000.000) 01/13 18:03:15 Job reconnection failed
    Job not found at execution machine 
    Can not reconnect to slot3@pheko05, rescheduling job

-------------------------------------------------------------

I set 
UID_DOMAIN = 192.168.0.109
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
USE_NFS = False 
SOFT_UID_DOMAIN = TRUE.



2010/1/13 Dan Bradley <dan@xxxxxxxxxxxx>

Genie,

Is your condor execute directory on NFS with root squashing?  The following line is what makes me guess that it might be:


01/13 06:32:30 get_file(): Failed to open file /home/condor/execute/dir_22496/condor_exec.exe, errno = 13: Permission denied.

If EXECUTE is on a NFS mount with root squashing, then it needs to be world-writable.

--Dan


Genie Jhang wrote:
Hello, again.
 Thanks to all of you, I succeed to run and to connect all the machines our lab have.
 But, when I finally tried to submit jobs to machines, I found that all the other machines except central manager doesn't work!!
 and I dug the log files.
 Here's the log.
 ----------------------------------------------------------------------------------------------------------------------------------  01/13 06:32:30 ******************************************************
01/13 06:32:30 ** condor_starter (CONDOR_STARTER) STARTING UP
01/13 06:32:30 ** /condor/sbin/condor_starter
01/13 06:32:30 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
01/13 06:32:30 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
01/13 06:32:30 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
01/13 06:32:30 ** $CondorPlatform: I386-LINUX_RHEL3 $
01/13 06:32:30 ** PID = 22496
01/13 06:32:30 ** Log last touched time unavailable (No such file or directory)
01/13 06:32:30 ******************************************************
01/13 06:32:30 Using config source: /condor/etc/condor_config
01/13 06:32:30 Using local config sources:
01/13 06:32:30    /home/condor/condor_config.local
01/13 06:32:30 DaemonCore: Command Socket at <192.168.0.105:33714 <http://192.168.0.105:33714>>

01/13 06:32:30 Done setting resource limits
01/13 06:32:30 Communicating with shadow <192.168.0.109:55237 <http://192.168.0.109:55237>>

01/13 06:32:30 Submitting machine is "pheko09"
01/13 06:32:30 setting the orig job name in starter
01/13 06:32:30 setting the orig job iwd in starter
01/13 06:32:30 get_file(): Failed to open file /home/condor/execute/dir_22496/condor_exec.exe, errno = 13: Permission denied.
01/13 06:32:30 get_file(): consumed 28023 bytes of file transmission
01/13 06:32:30 DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 192.168.0.105 failed to write to file /home/condor/execute/dir_22496/condor_exec.exe: (errno 13) Permission denied
01/13 06:32:30 WARNING: File /home/condor/execute/dir_22496/condor_exec.exe can not be accessed by Quill file transfer tracking.
01/13 06:32:30 File transfer failed (status=0).
01/13 06:32:30 ERROR "Failed to transfer files" at line 1882 in file jic_shadow.cpp
01/13 06:32:30 ShutdownFast all jobs.
 ------------------------------------------------------------------------------------------------------------------------------------
 What on the earth is the problem?
 I set ALLOW_WRITE = * in condor_config file of all the machines.
------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
 
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/