[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Bug



Hello,

we recently installed ROCKS with Condor and until now everything was fine.
Bellow are three logs where the problem is shown however so far we were
not
able to figure it out. In short Condor does not execute any submited job
using
either condor_submit command or condor_submit -n command. All tasks are
being
submited in the Vanilla universe.
Two type of error are shown in the logs:
when we subit the task using just condor_submit "executable" the error is
showing tat the directory of the executable file (here test.out0) cannot be
accessed.

ShadowLog:
10/1 17:22:26 ******************************************************
10/1 17:22:26 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/1 17:22:26 ** $CondorVersion: 6.6.0 Nov 13 2003 $
10/1 17:22:26 ** $CondorPlatform: INTEL-LINUX-GLIBC23 $
10/1 17:22:26 ** PID = 17003
10/1 17:22:26 ******************************************************
10/1 17:22:26 Using config file: /opt/condor/etc/condor_config
10/1 17:22:26 Using local config files:
/opt/condor/local.frontend-0/condor_config.local
10/1 17:22:26 DaemonCore: Command Socket at <10.1.1.1:53694>
10/1 17:22:27 Initializing a VANILLA shadow
10/1 17:22:27 (31.0) (17003): Request to run on <10.255.255.253:32775> was ACCEPTED
10/1 17:22:27 (31.0) (17003): ERROR "Error from starter on compute-0-1.local:
Failed to execute '/disk/local/NAMD/NAMD_2.5_Source/Linux-i686-MPI/test/test.out
condor_exec.exe': No such file or directory" at line 659 in file pseudo_ops.C
10/1 17:22:27 (31.0) (17003): Unable to log ULOG_SHADOW_EXCEPTION event

MasterLog:
9/30 17:17:13 Can't send UPDATE_MASTER_AD to collector frontend-0.local
<10.1.1.1:9618>: Failed to send UDP update command to collector
9/30 17:22:13 Can't connect to <10.1.1.1:9618>:0, errno = 111
9/30 17:22:13 Will keep trying for 10 seconds...
9/30 17:22:23 Connect failed for 10 seconds; returning FALSE
9/30 17:22:23 ERROR:
SECMAN:2003:TCP connection to <10.1.1.1:9618> failed

SchedLog
10/1 17:19:54 Sent ad to central manager for selvan@local
10/1 17:21:34 DaemonCore: Command received via TCP from host <10.1.1.1:53637>
10/1 17:21:34 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler
(actOnJobs)
10/1 17:21:34 UserLog::initialize:
open("/home/condor/spool/cluster30.proc0.subproc0/test.log") failed - errno 13
(Permission denied)
10/1 17:21:34 WARNING: Invalid user log file specified:
/home/condor/spool/cluster30.proc0.subproc0/test.log

in the second case wen using condor_submit -n the spool directory cannot be
accessed and the job is refused.

ShadowLog:
10/1 17:19:51 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/1 17:19:51 ** $CondorVersion: 6.6.0 Nov 13 2003 $
10/1 17:19:51 ** $CondorPlatform: INTEL-LINUX-GLIBC23 $
10/1 17:19:51 ** PID = 16954
10/1 17:19:51 ******************************************************
10/1 17:19:51 Using config file: /opt/condor/etc/condor_config
10/1 17:19:51 Using local config files:
/opt/condor/local.frontend-0/condor_config.local
10/1 17:19:51 DaemonCore: Command Socket at <10.1.1.1:53531>
10/1 17:19:52 Initializing a VANILLA shadow
10/1 17:19:52 (30.0) (16954): UserLog::initialize:
open("/home/condor/spool/cluster30.proc0.subproc0/test.log") failed - errno 13
(Permission denied)
10/1 17:19:53 (30.0) (16954): Request to run on <10.255.255.254:32774> was REFUSED
10/1 17:19:53 (30.0) (16954): Job 30.0 is being evicted
10/1 17:19:53 (30.0) (16954): logEvictEvent with unknown reason (108),
aborting10/1 17:19:53 (30.0) (16954): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 108

SchedLog:
10/1 17:19:29 DaemonCore: Command received via TCP from host <10.1.1.1:53505>
10/1 17:19:29 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler
(actOnJobs)
10/1 17:19:49 DaemonCore: Command received via TCP from host <10.1.1.1:53525>
10/1 17:19:49 DaemonCore: received command 480 (SPOOL_JOB_FILES), calling
handler (spoolJobFiles)
10/1 17:19:49 Job 30.0 released from hold: Data files spooled
10/1 17:19:49 DaemonCore: Command received via UDP from host <10.1.1.1:41716>
10/1 17:19:49 DaemonCore: received command 421 (RESCHEDULE), calling handler
(reschedule_negotiator)
10/1 17:19:49 Sent ad to central manager for selvan@local
10/1 17:19:49 Called reschedule_negotiator()
10/1 17:19:49 Activity on stashed negotiator socket
10/1 17:19:49 Negotiating for owner: selvan@local
10/1 17:19:49 Checking consistency running and runnable jobs
10/1 17:19:49 Tables are consistent
10/1 17:19:49 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
10/1 17:19:51 Started shadow for job 30.0 on "<10.255.255.254:32774>", (shadow
pid = 16954)
10/1 17:19:53 Sent RELEASE_CLAIM to startd on <10.255.255.254:32774>
10/1 17:19:53 Match record (<10.255.255.254:32774>, 30, 0) deleted

MAsterLog:
9/30 20:39:05 ******************************************************
9/30 20:39:05 ** condor_master (CONDOR_MASTER) STARTING UP
9/30 20:39:05 ** $CondorVersion: 6.6.0 Nov 13 2003 $
9/30 20:39:05 ** $CondorPlatform: INTEL-LINUX-GLIBC23 $
9/30 20:39:05 ** PID = 31349
9/30 20:39:05 ******************************************************
9/30 20:39:05 Using config file: /opt/condor/etc/condor_config
9/30 20:39:05 Using local config files:
/opt/condor/local.frontend-0/condor_config.local
9/30 20:39:05 DaemonCore: Command Socket at <10.1.1.1:36360>
9/30 20:39:05 Started DaemonCore process "/opt/condor/sbin/condor_collector",
pid and pgroup = 31350
9/30 20:39:05 Started DaemonCore process "/opt/condor/sbin/condor_negotiator",
pid and pgroup = 31351
9/30 20:39:05 Started DaemonCore process "/opt/condor/sbin/condor_schedd", pid
and pgroup = 31352
9/30 20:39:14 DaemonCore: Command received via TCP from host <10.1.1.1:36378>
9/30 20:39:14 DaemonCore: received command 455 (DAEMONS_ON), calling handler
(admin_command_handler)
9/30 20:40:24 DaemonCore: Command received via TCP from host <10.1.1.1:36499>
9/30 20:40:24 DaemonCore: received command 455 (DAEMONS_ON), calling handler
(admin_command_handler)
9/30 21:39:05 Preen pid is 32186
9/30 21:39:15 Child 32186 died, but not a daemon -- Ignored

(Sorry for the time delayes but the actual events corresponds to each other.)

 
could anyone give me a clue about what might be the possible problem or at
least
some directions how and where to start debugging?

also all machines are configured locally, no AFS is being used.

thank you

regards

martin lukac