[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trouble running jobs in the parallel universe



Can anyone reproduce? At this point, I don't know if it's a configuration issue or a legitimate bug.

thanks

Mark Visser wrote:
I have configured a scheduler as explained in the manual (http://www.cs.wisc.edu/condor/manual/v7.4/2_9Parallel_Applications.html).

I can submit the following job to the remote scheduler with condor_submit -r <mysched> test.submit
universe = parallel
executable = /bin/env
machine_count = 4
output = output.$(Node).txt
queue

The job appears in the remote scheduler's queue, is matched (machines appear in the RemoteHosts attribute of the job class ad), begins to run, then changes immediately to held.

According to -analyze:
> ...
> 026.000:  Request is held.
>
> Hold reason: Error from slot2@xxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to open '/var/spool/condor/spool/cluster26.proc0.subproc0/output.0.txt' as standard output: No such file or directory (errno 2)

rm1li025120:/var/spool/condor/spool/ is empty.

Here's the relevant part of the start log from that host:
> 02/08 13:57:07 slot2: Got activate_claim request from shadow (<192.168.100.85:34979>)
> 02/08 13:57:07 slot2: Remote job ID is 26.0
> 02/08 13:57:07 slot2: Got universe "PARALLEL" (11) from request classad
> 02/08 13:57:07 slot2: State change: claim-activation protocol successful
> 02/08 13:57:07 slot2: Changing activity: Idle -> Busy
> 02/08 13:57:07 slot2: Called deactivate_claim_forcibly()
> 02/08 13:57:07 Starter pid 10463 exited with status 0
> 02/08 13:57:07 slot2: State change: starter exited
> 02/08 13:57:07 slot2: Changing activity: Busy -> Idle
> 02/08 13:57:07 condor_write(): Socket closed when trying to write 56 bytes to <192.168.100.85:38432>, fd is 6
> 02/08 13:57:07 Buf::write(): condor_write() failed
> 02/08 13:57:07 slot2: Called deactivate_claim()

And the starter log:
> 02/08 13:57:07 ******************************************************
> 02/08 13:57:07 ** condor_starter (CONDOR_STARTER) STARTING UP
> 02/08 13:57:07 ** /dfs1/net/studio/noarch/free/condor/condor-7.4.1/sbin/condor_starter > 02/08 13:57:07 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) > 02/08 13:57:07 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> 02/08 13:57:07 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
> 02/08 13:57:07 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
> 02/08 13:57:07 ** PID = 10463
> 02/08 13:57:07 ** Log last touched 2/8 13:55:48
> 02/08 13:57:07 ******************************************************
> 02/08 13:57:07 Using config source: /home/condor/condor_config
> 02/08 13:57:07 Using local config sources:
> 02/08 13:57:07    /home/condor/config/condor_config.local.rm1li025120
> 02/08 13:57:07 DaemonCore: Command Socket at <192.168.25.120:36203>
> 02/08 13:57:07 Done setting resource limits
> 02/08 13:57:07 Communicating with shadow <192.168.100.85:56604>
> 02/08 13:57:07 Submitting machine is "netrender.lumierevfx.com"
> 02/08 13:57:07 setting the orig job name in starter
> 02/08 13:57:07 setting the orig job iwd in starter
> 02/08 13:57:07 Job has WantIOProxy=true
> 02/08 13:57:07 Initialized IO Proxy.
> 02/08 13:57:07 Job 26.0 set to execute immediately
> 02/08 13:57:07 Starting a PARALLEL universe job with ID: 26.0
> 02/08 13:57:07 IWD: /var/spool/condor/spool/cluster26.proc0.subproc0
> 02/08 13:57:07 Failed to open '/var/spool/condor/spool/cluster26.proc0.subproc0/output.0.txt' as standard output: No such file or directory (errno 2)
> 02/08 13:57:07 Failed to open some/all of the std files...
> 02/08 13:57:07 Aborting OsProc::StartJob.
> 02/08 13:57:07 Failed to start job, exiting
> 02/08 13:57:07 ShutdownFast all jobs.
> 02/08 13:57:07 **** condor_starter (condor_STARTER) pid 10463 EXITING WITH STATUS 0

If I comment out the "output = output.$(Node).txt" line, the job still ends up held, but this time with this error: >Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to execute '/var/spool/condor/spool/cluster27.proc0.subproc0/env': No such file or directory

Adding "copy_to_spool = False" to the submit description makes no difference.

As far as I can tell from the documentation, my submit description should work... any ideas?

thanks



--
Mark Visser, Software Director
Lumière VFX
Email: markv@xxxxxxxxxxxxxx
Phone: +1-514-316-1080 x3030