[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] No vanilla condor_shadow



I am having trouble submitting vanilla jobs to a personal condor pool
that has a few startds connected to is .  I am using 6.7.18 right
now.  The job eventually gets held with this error:
HoldReason = "No condor_shadow installed that supports vanilla jobs on
V6.3.3 or newer resources"

The job is:
universe = vanilla
executable = pi-compute
arguments = 5000000
output = out.$(Process)
log = log.$(Process)
queue 1

This succeeds when I condor_compile and run under the standard
universe; however, this pool is to be used under Condor-C so this is
not an acceptable restriction.

The result of the command 'condor_config_val -name rgarver@ STARTER'
is correct:
/cs/sandbox/student/rgarver/condor_install/sbin/condor_starter

Also in the same sbin directory I have condor_starter,
condor_starter.pvm, and condor_starter.std.  This pool is started
through dynamic deployment so I should make sure I have all of the
binaries that I actually need:
/cs/sandbox/student/rgarver/condor_install/bin:
condor_config_val  condor_q  condor_rm  condor_run  condor_status
condor_submit
/cs/sandbox/student/rgarver/condor_install/sbin:
condor_advertise             condor_collector    condor_negotiator
condor_schedd      condor_shadow.std  condor_starter.pvm  gt3_gahp
condor_c-gahp                condor_gridmanager  condor_off        
condor_shadow      condor_startd      condor_starter.std  gt4_gahp
condor_c-gahp_worker_thread  condor_master       condor_preen      
condor_shadow.pvm  condor_starter     gahp_server

Let me know if any other information might be useful.  Thanks.

SchedLog
5/4 09:37:44 Trying to run a VANILLA job on a 6.3.3 or later resource,
but you do not have condor_shadow that will work, aborting.
5/4 09:37:44 Job 2.0 put on hold: No condor_shadow installed that
supports vanilla jobs on V6.3.3 or newer resources
5/4 09:37:44 match (<128.111.43.200:41928>#1146758835#4) out of jobs
(cluster id 2); relinquishing
5/4 09:37:44 Sent RELEASE_CLAIM to startd on <128.111.43.200:41928>
5/4 09:37:44 Match record (<128.111.43.200:41928>, 2, -1) deleted

StartLog
5/4 09:37:40 (fd:6) (pid:11833) vm1: Rank of this claim is: 0.000000
5/4 09:37:40 (fd:6) (pid:11833) vm1: Request accepted.
5/4 09:37:40 (fd:6) (pid:11833) Trying to find full hostname for "dizzy"
5/4 09:37:40 (fd:6) (pid:11833) Calling gethostbyname(dizzy)
5/4 09:37:40 (fd:6) (pid:11833) Trying to find full hostname from hostent
5/4 09:37:40 (fd:6) (pid:11833) Main name in hostent "dizzy" contains
no '.', checking aliases
5/4 09:37:40 (fd:6) (pid:11833) No host alias is fully qualified,
looking for DEFAULT_DOMAIN_NAME
5/4 09:37:40 (fd:6) (pid:11833) DEFAULT_DOMAIN_NAME not defined
5/4 09:37:40 (fd:6) (pid:11833) Failed to find full hostname for
"dizzy", returning "dizzy"
5/4 09:37:40 (fd:6) (pid:11833) vm1: Remote owner is rgarver@dizzy
5/4 09:37:40 (fd:6) (pid:11833) CLOSE <128.111.43.200:41928> fd=5
5/4 09:37:40 (fd:5) (pid:11833) vm1: State change: claiming protocol
successful
5/4 09:37:40 (fd:5) (pid:11833) vm1: Changing state: Matched -> Claimed
5/4 09:37:40 (fd:5) (pid:11833) vm1: Started ClaimLease timer (26) w/
1800 second lease duration
5/4 09:37:40 (fd:5) (pid:11833) STARTD_SHOULD_WRITE_CLAIM_ID_FILE is
undefined, using default value of True
5/4 09:37:44 (fd:5) (pid:11833) RECV 44 bytes at
<128.111.43.200:41928> from <128.111.43.200:42268>
5/4 09:37:44 (fd:5) (pid:11833)         Full msg [44 bytes]
5/4 09:37:44 (fd:5) (pid:11833) DC_AUTHENTICATE: received UDP packet
from <128.111.43.200:42268>.
5/4 09:37:44 (fd:5) (pid:11833) DaemonCore received UNAUTHENTICATED
command 443.
5/4 09:37:44 (fd:5) (pid:11833) DaemonCore: Command received via UDP
from host <128.111.43.200:42268>
5/4 09:37:44 (fd:5) (pid:11833) DaemonCore: received command 443
(RELEASE_CLAIM), calling handler (command_release_claim)
5/4 09:37:44 (fd:5) (pid:11833) vm1: State change: received
RELEASE_CLAIM command
5/4 09:37:44 (fd:5) (pid:11833) In cancel_timer(), id=26
5/4 09:37:44 (fd:5) (pid:11833) vm1: Canceled ClaimLease timer (26)
5/4 09:37:44 (fd:5) (pid:11833) vm1: Changing state and activity:
Claimed/Idle -> Preempting/Vacating
5/4 09:37:44 (fd:5) (pid:11833) Entered vacate_client
<128.111.43.200:54324> dizzy...
5/4 09:37:44 (fd:5) (pid:11833) STARTD_TIMEOUT_MULTIPLIER is
undefined, using default value of 0
5/4 09:37:44 (fd:5) (pid:11833) New Daemon obj (schedd) name: "NULL",
pool: "NULL", addr: "<128.111.43.200:54324>"
5/4 09:37:44 (fd:6) (pid:11833) PRIV_CONDOR --> PRIV_ROOT at sock.C:506
5/4 09:37:44 (fd:6) (pid:11833) PRIV_ROOT --> PRIV_CONDOR at sock.C:512
5/4 09:37:44 (fd:6) (pid:11833) CONNECT src=<128.111.43.200:35453>
fd=5 dst=<128.111.43.200:54324>
5/4 09:37:44 (fd:6) (pid:11833) STARTCOMMAND: starting 443 to
<128.111.43.200:54324> on TCP port 35453.

-- 
Ryan Garver
<rgarver@xxxxxxxxxxx>