[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] globus -> condor matching problem



G'day chaps,

First of all, thanks to all who are trying to help; it's much
appreciated.

> >
> > Certainly you should specify:
> >
> >       should_transfer_files = YES
> >       when_to_transfer_output = ON_EXIT
> >
> > If you need to transfer more than the default files though, I'm not sure
> > what to specify, or how to get the list of files from the jobmanager.
> 
> The jobmanager doesn't tell the scheduler script (condor.pm) what files
> the client (Condor-G) requested to be staged in and out. There's also no
> way for the client to specify a list of files that need to be present on
> the execute machine. This isn't much of a problem for output files, as
> Condor can be told transfer all output files from the job (those created
> or modified after the job began running).

OK, I've added the two lines that Alain recommended and also forced all
jobs to be run as vanilla ones, and now the jobs *apparently* run
happily on the remote execution machine (i.e. IRIX gatekeeper handing
off to Linux execution node), and the log file on the Condor-G
submitting node indicates "Normal termination (return value 0)".
However, the contents of the redirected stdout are not returned to this 
machine (recall that this program simply echoes "hello world"), and the
initially touched file remains empty. A quick look at the ShadowLog on
the gatekeeper gives:

10/30 09:08:49 ******************************************************
10/30 09:08:49 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/30 09:08:49 ** $CondorVersion: 6.4.5 Nov 10 2002 $
10/30 09:08:49 ** $CondorPlatform: SGI-IRIX65 $
10/30 09:08:49 ** PID = 12598
10/30 09:08:49 ******************************************************
10/30 09:08:49 DaemonCore: Command Socket at <131.111.41.187:4661>
10/30 09:08:50 Initializing a VANILLA shadow
10/30 09:08:50 (274.0) (12598): Request to run on <131.111.44.138:33963>
was ACCEPTED
10/30 09:08:51 (274.0) (12598): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 100

Now, this exit code apparently maps to;

100  JOB_EXITED  The job exited (not killed)

Does this include normal termination? Or does it refer to an abnormal
exit? 

Anyway, a look at the StartLog on the execution machine (i.e. the linux
host the job actually runs on) gives:

10/30 09:09:26 DaemonCore: Command received via UDP from host
<131.111.41.187:3647>
10/30 09:09:26 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
10/30 09:09:26 match_info called                                    
10/30 09:09:26 Received match <131.111.44.138:33963>#1229371606
10/30 09:09:26 State change: match notification protocol successful 
10/30 09:09:26 Changing state: Unclaimed -> Matched
10/30 09:09:26 DaemonCore: Command received via TCP from host
<131.111.41.187:4660>
10/30 09:09:26 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler (command_request_claim)
10/30 09:09:26 Request accepted.                                    
10/30 09:09:26 Remote owner is mcal00@xxxxxxxxxxxxxxxxxxxx
10/30 09:09:26 State change: claiming protocol successful           
10/30 09:09:26 Changing state: Matched -> Claimed
10/30 09:09:29 DaemonCore: Command received via TCP from host
<131.111.41.187:4663>
10/30 09:09:29 DaemonCore: received command 444 (ACTIVATE_CLAIM),
calling handler (command_activate_claim)
10/30 09:09:29 Got activate_claim request from shadow
(<131.111.41.187:4663>)
10/30 09:09:29 Remote job ID is 274.0                               
10/30 09:09:29 Got universe (5) from request classad
10/30 09:09:29 Startd using *_VANILLA control expressions.          
10/30 09:09:29 State change: claim-activation protocol successful
10/30 09:09:29 Changing activity: Idle -> Busy                      
10/30 09:09:30 DaemonCore: Command received via TCP from host <131.11
1.41.187:4664>                                                      
10/30 09:09:30 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FOR
CIBLY), calling handler (command_handler)                           
10/30 09:09:30 Called deactivate_claim_forcibly()
10/30 09:09:30 Starter pid 15884 exited with status 0               
10/30 09:09:30 State change: starter exited
10/30 09:09:30 Changing activity: Busy -> Idle                      
10/30 09:09:31 DaemonCore: Command received via UDP from host <131.11
1.41.187:3653>                                                      
10/30 09:09:31 DaemonCore: received command 443 (RELEASE_CLAIM), call
ing handler (command_handler)                                       
10/30 09:09:31 State change: received RELEASE_CLAIM command
10/30 09:09:31 Changing state and activity: Claimed/Idle ->
Preempting/Vacating
10/30 09:09:31 State change: No preempting match, returning to owner
10/30 09:09:31 Changing state and activity: Preempting/Vacating ->
Owner/Idle
10/30 09:09:31 State change: IS_OWNER is false
10/30 09:09:31 Changing state: Owner -> Unclaimed
10/30 09:09:31 DaemonCore: Command received via UDP from host
<131.111.41.187:3654>
10/30 09:09:31 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_handler)
10/30 09:09:31 Error: can't find resource with capability
(<131.111.44.138:33963>#1229371606)

OK, there's a lot there but the main line I pick up on is the exit with
status 0, which suggests that the job ran OK, right?

So my question now becomes is there an obvious reason that you chaps can
spot for my redirected stdout not being returned back to the Condor-G
submitting machine?

Many thanks,

Mark

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>