[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor pvm



I cannot get condor pvm to work in 32 bit condor and find it is not
included with the 64 bit distribution. We use:

condor-6.8.4-linux-x86-glibc23-dynamic.tar.gz on AMD Athlon
condor-6.8.4-linux-x86_64-rhel3-dynamic.tar.gz on Xeon

I installed pvm both out-of-the-box (3.4.5) and the condor
modified one from the web site (condor-dist-pvm3.4.1.tar.gz) and I tried
the hello and master1 programs that are shipped with pvm3 and also the
master_sum program in the condor distribution. All work directly when I
start pvmd on the local node and all fail with condor the same way as I
have seen reported over the last 2 years in the email archive ie an exit
status 110, empty output and a ShadowLog entry and log entry on the
execute node along the lines I have seen posted. (Please see below).

I have tried with the pvm distribution available on all nodes, added
transfer input file lines to the example submit file and all had no
effect on the results.

The nodes I am trying are dual cpu. In this case we are only trying one
vm. The parallel universe (mpich and lam mpi) is working fine. I built
pvm to use ssh.

My latest attempt is

universe = PVM
executable = master_sum
input = in_sum
output = out_sum
error = err_sum
log = log_sum
machine_count = 1..1
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = master_sum, in_sum
Requirements = (Arch == "INTEL") && (OpSys == "LINUX")
queue


and the ShadowLog says

2/19 13:24:07 (?.?) (14583):First Line: 2676 0 1
2/19 13:24:07 (2676.0) (14583):Created class:
2/19 13:24:07 (2676.0) (14583):#0: 0 (1, 1) has 0
2/19 13:24:07 (2676.0) (14583):New process for proc 0
2/19 13:24:07 (2676.0) (14583):AllocProc() returning 0
2/19 13:24:07 (2676.0) (14583):Machine from schedd: <172.24.89.73:35809> <172.24.89.73:35809>#1170767595#7441 0
2/19 13:24:07 (2676.0) (14583):Machine Line: vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 0
2/19 13:24:07 (2676.0) (14583):Machines now cur = 1 desire = 1
2/19 13:24:07 (2676.0) (14583):Updated class:
2/19 13:24:07 (2676.0) (14583):#0: 0 (1, 1) has 1
2/19 13:24:07 (2676.0) (14583):Starting pvmd: /usr/local/condor/sbin/condor_pvmd -d0x11c
2/19 13:24:07 (2676.0) (14583):PVM is pid 14584
2/19 13:24:07 (2676.0) (14583):pvmd response: /tmp/filenX1mPQ
2/19 13:24:07 (2676.0) (14583):PVMSOCK=/tmp/filenX1mPQ
2/19 13:24:07 (2676.0) (14583):pvm_fd = 4, mytid = t40001
2/19 13:24:07 (2676.0) (14583):Entered StartWaitingHosts()
2/19 13:24:07 (2676.0) (14583):Ok to start waiting hosts
2/19 13:24:07 (2676.0) (14583):PVMd message is SM_STHOST from t80000000
2/19 13:24:07 (2676.0) (14583):SM_STHOST: 80000 "" "172.24.89.73" "$PVM_ROOT/lib/pvmd -s -d0x11c -nvm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 836f2422:836e 4080 2 ac185949:0000"
2/19 13:24:07 (2676.0) (14583):New process for proc 0
2/19 13:24:07 (2676.0) (14583):AllocProc() returning 0
2/19 13:24:07 (2676.0) (14583):Shadow: Entering multi_send_job(vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
2/19 13:24:07 (2676.0) (14583):Requesting Alternate Starter 1
2/19 13:24:07 (2676.0) (14583):Shadow: Request to run a job was ACCEPTED
2/19 13:24:07 (2676.0) (14583):Shadow: RSC_SOCK connected, fd = 6
2/19 13:24:07 (2676.0) (14583):Multi_Shadow: CLIENT_LOG connected, fd = 7
2/19 13:24:07 (2676.0) (14583):in new_timer()
2/19 13:24:07 (2676.0) (14583):Timer List
2/19 13:24:07 (2676.0) (14583):^^^^^ ^^^^
2/19 13:24:07 (2676.0) (14583):id = 0, when = 180
2/19 13:24:07 (2676.0) (14583):Shadow: send_pvm_job_info
2/19 13:24:07 (2676.0) (14583):send_pvm_job_info: arg = -s -d0x11c -nvm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 836f2422:836e 4080 2 ac185949:0000 -f
2/19 13:24:07 (2676.0) (14583):On LogSock for host vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:
2/19 13:24:07 (2676.0) (14583):In cancel_timer()
2/19 13:24:07 (2676.0) (14583):Timer List
2/19 13:24:07 (2676.0) (14583):^^^^^ ^^^^
2/19 13:24:07 (2676.0) (14583):Received PVM info from vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2/19 13:24:07 (2676.0) (14583):Adding host vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx to STARTACK msg.
2/19 13:24:07 (2676.0) (14583):Num Hosts to pack = 1
2/19 13:24:07 (2676.0) (14583):Packing tid t80000 with reply ddpro<2316> arch<LINUX> ip<ac185949:848f> mtu<4080> dsig<4229185>
2/19 13:24:07 (2676.0) (14583):Sending SM_STHOSTACK to PVMd
2/19 13:24:07 (2676.0) (14583):PVMd message is SM_ADDACK from t80000000
2/19 13:24:07 (2676.0) (14583):Host #0(vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) has been added to PVM, pvmd_tid = 80080000
2/19 13:24:07 (2676.0) (14583):SendNotification(kind = 3, tid = t80080000)
2/19 13:24:07 (2676.0) (14583):pvm_machines_starting = 0(should be 0)
2/19 13:24:07 (2676.0) (14583):StartLocalProcess: = /home/jcjb/pvm-test/master_sum < in_sum > out_sum >& err_sum
2/19 13:24:07 (2676.0) (14583):open_max = 1024
2/19 13:24:07 (2676.0) (14583):Local PVM process pid = 14585
2/19 13:24:07 (2676.0) (14583):Entered StartWaitingHosts()
2/19 13:24:07 (2676.0) (14583):Ok to start waiting hosts
2/19 13:24:07 (2676.0) (14583):deadpid = 14585
2/19 13:24:07 (2676.0) (14583):Local process for job 2676.0 died with status 0x6e00
2/19 13:24:07 (2676.0) (14583):SendNotification(kind = 1, tid = t0)
2/19 13:24:07 (2676.0) (14583):Multi_Shadow: Shutting down...
2/19 13:24:07 (2676.0) (14583):Updated class:
2/19 13:24:07 (2676.0) (14583):#0: 0 (1, 1) has 1
2/19 13:24:07 (2676.0) (14583):signal_startd( vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, 443 )
2/19 13:24:07 (2676.0) (14583):in new_timer()
2/19 13:24:07 (2676.0) (14583):Timer List
2/19 13:24:07 (2676.0) (14583):^^^^^ ^^^^
2/19 13:24:07 (2676.0) (14583):id = 1, when = 300
2/19 13:24:07 (2676.0) (14583):deadpid = 14584
2/19 13:24:07 (2676.0) (14583):Lost local pvmd termsig = 9, retcode = 0
2/19 13:24:07 (2676.0) (14583):deadpid = -1
2/19 13:24:07 (2676.0) (14583):No more dead processes(errno = 10)
2/19 13:24:07 (2676.0) (14583):deadpid = -1
2/19 13:24:07 (2676.0) (14583):No more dead processes(errno = 10)
2/19 13:24:07 (2676.0) (14583):Subproc 32767 exited, termsig = 0, coredump = 0, retcode = 15
2/19 13:24:07 (2676.0) (14583):ru_utime = 0.000000
2/19 13:24:07 (2676.0) (14583):ru_stime = 0.000000
2/19 13:24:07 (2676.0) (14583):Job 0, process 0 logsock appears to be closed, removing..
2/19 13:24:07 (2676.0) (14583):Removing Proc 0(t0) from Job
2/19 13:24:07 (2676.0) (14583):remove starter for host vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, removing the host too
2/19 13:24:07 (2676.0) (14583):RemoveHost: Sending HostDelete notify on t80080000
2/19 13:24:07 (2676.0) (14583):SendNotification(kind = 2, tid = t80080000)
2/19 13:24:07 (2676.0) (14583):signal_startd( vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, 443 )
2/19 13:24:07 (2676.0) (14583):Updated class:
2/19 13:24:07 (2676.0) (14583):#0: 0 (1, 1) has 0
2/19 13:24:07 (2676.0) (14583):Trying to unlink /home/condor/spool/cluster2676.proc0.subproc0
2/19 13:24:07 (2676.0) (14583):All processes completed, job should be deleted
2/19 13:24:07 (2676.0) (14583):MultiShadow Exiting!!!
2/19 13:24:07 (2676.0) (14583):user_time = 4 ticks
2/19 13:24:07 (2676.0) (14583):sys_time = 2 ticks
2/19 13:24:07 (2676.0) (14583):Entering multi_update_job_status()
2/19 13:24:07 (2676.0) (14583):Shadow: marked job status "COMPLETED"
2/19 13:24:07 (2676.0) (14583):multi_update_job_status() returns 0
2/19 13:24:07 (2676.0) (14583):Shadow: Job exited normally with status 110
2/19 13:24:07 (2676.0) (14583):Notification = "
2/19 13:24:07 (2676.0) (14583):********** Shadow Parent Exiting(100)
**********


The StarterLog.vm1 on the execute node says

2/19 13:24:07 ********** STARTER starting up ***********
2/19 13:24:07 ** $CondorVersion: 6.8.4 Feb  1 2007 $
2/19 13:24:07 ** $CondorPlatform: I386-LINUX_RH9 $
2/19 13:24:07 ******************************************
2/19 13:24:07 Submitting machine is "talpa--bio.grid.private.cam.ac.uk"
2/19 13:24:07 EventHandler {
2/19 13:24:07   func = 0x8102a9a
2/19 13:24:07   mask = SIGALRM SIGHUP SIGINT SIGTERM SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP
2/19 13:24:07 }
2/19 13:24:07 Done setting resource limits
2/19 13:24:07   *FSM* Transitioning to state "GET_PROC"
2/19 13:24:07   *FSM* Executing state func "get_proc()" [  ]
2/19 13:24:07 Entering get_proc()
2/19 13:24:07 In get_job_info(): About to request job!
2/19 13:24:07 In get_job_info(): got job 32767
2/19 13:24:07 User uid set to 65534
2/19 13:24:07 User gid set to 65533
2/19 13:24:07 This platform doesn't implement checkpointing yet
2/19 13:24:07 User Process -1.-1 {
2/19 13:24:07   in = /dev/null
2/19 13:24:07   out = /dev/null
2/19 13:24:07   err = /dev/null
2/19 13:24:07   rootdir = /
2/19 13:24:07   cmd = $(PVMD)
2/19 13:24:07   args = -s -d0x11c -nvm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 1 836f2422:836e 4080 2 ac185949:0000 -f
2/19 13:24:07   env =
2/19 13:24:07   orig_ckpt = weasel--bio:$(PVMD)
2/19 13:24:07   target_ckpt =
2/19 13:24:07   local_dir = dir_16529
2/19 13:24:07   cur_ckpt = dir_16529/condor_exec.-1.-1
2/19 13:24:07   tmp_ckpt = dir_16529/condor_exec.-1.-1.tmp
2/19 13:24:07   core_name = dir_16529/core
2/19 13:24:07   uid = 65534, gid = 65533
2/19 13:24:07   v_pid = 32767
2/19 13:24:07   pid = (NOT CURRENTLY EXECUTING)
2/19 13:24:07   exit_status_valid = FALSE
2/19 13:24:07   exit_status = (NEVER BEEN EXECUTED)
2/19 13:24:07   ckpt_wanted = FALSE
2/19 13:24:07   coredump_limit_exists = FALSE
2/19 13:24:07   soft_kill_sig = 15
2/19 13:24:07   job_class = PVMD
2/19 13:24:07   state = NEW
2/19 13:24:07   new_ckpt_created = FALSE
2/19 13:24:07   ckpt_transferred = FALSE
2/19 13:24:07   core_created = FALSE
2/19 13:24:07   core_transferred = FALSE
2/19 13:24:07   exit_requested = FALSE
2/19 13:24:07   image_size = 10000 blocks
2/19 13:24:07   user_time = 0
2/19 13:24:07   sys_time = 0
2/19 13:24:07   guaranteed_user_time = 0
2/19 13:24:07   guaranteed_sys_time = 0
2/19 13:24:07 }
2/19 13:24:07   *FSM* Executing transition function "init_transfer()"
2/19 13:24:07   *FSM* Transitioning to state "GET_EXEC"
2/19 13:24:07   *FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE  ]
2/19 13:24:07 Entering get_exec()
2/19 13:24:07 Executable is located on this host
2/19 13:24:07 Expanded executable name is "/usr/local/condor/sbin/condor_pvmd"
2/19 13:24:07 Created sym link from "/usr/local/condor/sbin/condor_pvmd" to "dir_16529/condor_exec.-1.-1"
2/19 13:24:07   *FSM* Executing transition function "spawn_all"
2/19 13:24:07 Entering function spawn_all()
2/19 13:24:07 Executing proc 32767.
2/19 13:24:07 In PVMdProc::execute().
2/19 13:24:07 Calling execve( "/home/condor/execute/dir_16529/condor_exec.-1.-1", "condor_exec.-1.-1", "-s", "-d0x11c", "-nvm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "1", "836f2422:836e", "4080", "2", "ac185949:0000", "-f", 0, "PVM_ROOT=/home/condor/execute/dir_16529", "CONDOR_VM=vm1", 0 )
2/19 13:24:07 "/tmp/pvml.65534" does not exist
2/19 13:24:07 "/tmp/pvmd.65534" does not exist
2/19 13:24:07 pipes: 0 & 1
2/19 13:24:07 Started pvmd - PID = 16530
2/19 13:24:07 PVMd response: ddpro<2316> arch<LINUX> ip<ac185949:848f> mtu<4080> dsig<4229185>
2/19 13:24:07 PVMSOCK=/tmp/fileRzsbXM
2/19 13:24:07 Starter tid = t80001
2/19 13:24:07 Shadow tid = t40001
2/19 13:24:07   *FSM* Transitioning to state "SUPERVISE"
2/19 13:24:07   *FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE TERMSIG ALARM DIE CHILD_EXIT PERIODIC_CKPT  ]
2/19 13:24:07   *FSM* Got asynchronous event "VACATE"
2/19 13:24:07   *FSM* Executing transition function "req_vacate"
2/19 13:24:07   *FSM* Transitioning to state "TERMINATE"
2/19 13:24:07   *FSM* Executing state func "terminate_all()" [  ]
2/19 13:24:07 terminate_all: PVMD state is EXECUTING
2/19 13:24:07 Sent signal SIGTERM to user job 16530
2/19 13:24:07   *FSM* Transitioning to state "TERMINATE_WAIT"
2/19 13:24:07   *FSM* Executing state func "asynch_wait()" [ SUSPEND ALARM DIE CHILD_EXIT  ]
2/19 13:24:07   *FSM* Got asynchronous event "CHILD_EXIT"
2/19 13:24:07   *FSM* Executing transition function "reaper"
2/19 13:24:07 Process 16530 exited with status 15
2/19 13:24:07   *FSM* Transitioning to state "TERMINATE"
2/19 13:24:07   *FSM* Executing state func "terminate_all()" [  ]
2/19 13:24:07 terminate_all: PVMD state is NORMAL_EXIT
2/19 13:24:07   *FSM* Transitioning to state "SEND_STATUS_ALL"
2/19 13:24:07   *FSM* Executing state func "dispose_all()" [  ]
2/19 13:24:07 Sending final status for process -1.-1
2/19 13:24:07 STATUS encoded as NORMAL
2/19 13:24:07 User time = 0.000000 seconds
2/19 13:24:07 System time = 0.000000 seconds
2/19 13:24:07 Unlinked "/tmp/pvml.65534"
2/19 13:24:07 "/tmp/pvmd.65534" does not exist
2/19 13:24:07 Unlinked "dir_16529/condor_exec.-1.-1"
2/19 13:24:07 "dir_16529/condor_exec.-1.-1.tmp" does not exist
2/19 13:24:07 "dir_16529/core" does not exist
2/19 13:24:07 Removed directory "dir_16529"
2/19 13:24:07   *FSM* Reached state "END"
2/19 13:24:07 ********* STARTER terminating normally **********