[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor PVM examples



Hi,

I don't know why, but my PVM job always appears to be dead.

Starterlog shows that everything is ok.
This is a log of shadow:

Start-up was unsuccessful on 0 machines."
11/5 12:13:26 ( 53.0) (2212):********** Shadow Parent Exiting(100) **********
11/5 12:18:09 (?.?) (2240):********** Multi_Shadow starting up **********
11/5 12:18:09 (?.?) (2240):uid=0, euid=501, gid=0, egid=501
11/5 12:18:09 (?.?) (2240):My_Filesystem_Domain = "local"
11/5 12:18:09 (?.?) (2240):My_UID_Domain = "local"
11/5 12:18:09 (?.?) (2240):Shadow reading via ASCII
11/5 12:18:09 (?.?) (2240):First Line: 56 0 1
11/5 12:18:09 (56.0) (2240):Created class:
11/5 12:18:09 ( 56.0) (2240):#0: 0 (1, 1) has 0
11/5 12:18:09 (56.0) (2240):New process for proc 0
11/5 12:18:09 (56.0) (2240):AllocProc() returning 0
11/5 12:18:09 (56.0) (2240):Machine from schedd: < 192.168.0.8:44516> <192.168.0.8:44516>#1193825611#10 0
11/5 12:18:09 (56.0) (2240):Machine Line: vm1@xxxxxxxxxxxxxxx 0
11/5 12:18:09 ( 56.0) (2240):Machines now cur = 1 desire = 1
11/5 12:18:09 (56.0) (2240):Updated class:
11/5 12:18:09 (56.0) (2240):#0: 0 (1, 1) has 1
11/5 12:18:09 (56.0) (2240):Starting pvmd: /usr/local/condor/sbin/condor_pvmd -d0x11c
11/5 12:18:09 (56.0) (2240):PVM is pid 2241
11/5 12:18:09 (56.0) (2240):pvmd response: /tmp/fileegqsdX
11/5 12:18:09 (56.0) (2240):PVMSOCK=/tmp/fileegqsdX
11/5 12:18:09 (56.0) (2240):pvm_fd = 4, mytid = t40001
11/5 12:18:09 (56.0) (2240):Entered StartWaitingHosts()
11/5 12:18:09 (56.0) (2240):Ok to start waiting hosts
11/5 12:18:09 (56.0) (2240):PVMd message is SM_STHOST from t80000000
11/5 12:18:09 (56.0) (2240):SM_STHOST: 80000 "" " 192.168.0.8" "$PVM_ROOT/lib/pvmd -s -d0x11c -nvm1@xxxxxxxxxxxxxxx 1 c0a80001:8022 4080 2 c0a80008:0000"
11/5 12:18:09 (56.0) (2240):New process for proc 0
11/5 12:18:09 (56.0) (2240):AllocProc() returning 0
11/5 12:18:09 (56.0) (2240):Shadow: Entering multi_send_job(vm1@xxxxxxxxxxxxxxx)
11/5 12:18:09 (56.0) (2240):Requesting Alternate Starter 1
11/5 12:18:09 (56.0) (2240):Shadow: Request to run a job was ACCEPTED
11/5 12:18:09 (56.0) (2240):Shadow: RSC_SOCK connected, fd = 6
11/5 12:18:09 (56.0) (2240):Multi_Shadow: CLIENT_LOG connected, fd = 7
11/5 12:18:09 ( 56.0) (2240):in new_timer()
11/5 12:18:09 (56.0) (2240):Timer List
11/5 12:18:09 (56.0) (2240):^^^^^ ^^^^
11/5 12:18:09 (56.0) (2240):id = 0, when = 180
11/5 12:18:09 (56.0) (2240):Shadow: send_pvm_job_info
11/5 12:18:09 (56.0) (2240):send_pvm_job_info: arg = -s -d0x11c -nvm1@xxxxxxxxxxxxxxx 1 c0a80001:8022 4080 2 c0a80008:0000 -f
11/5 12:18:09 (56.0) (2240):On LogSock for host vm1@xxxxxxxxxxxxxxx:
-> [pvmd pid9686] 11/05 12:17:59 version 3.4.2
11/5 12:18:09 (56.0) (2240):On LogSock for host vm1@xxxxxxxxxxxxxxx:
-> [pvmd pid9686] 11/05 12:17:59 ddpro 2316 tdpro 1318
-> [pvmd pid9686] 11/05 12:17:59 main() debug mask is 0x11c (tsk,slv,hst,sch)
11/5 12:18:09 (56.0) (2240):In cancel_timer()
11/5 12:18:09 (56.0) (2240):Timer List
11/5 12:18:09 (56.0) (2240):^^^^^ ^^^^
11/5 12:18:09 (56.0) (2240):Received PVM info from vm1@xxxxxxxxxxxxxxx
11/5 12:18:09 (56.0) (2240):Adding host vm1@xxxxxxxxxxxxxxx to STARTACK msg.
11/5 12:18:09 (56.0) (2240):Num Hosts to pack = 1
11/5 12:18:09 (56.0) (2240):Packing tid t80000 with reply ddpro<2316> arch<LINUX> ip<c0a80008:980c> mtu<4080> dsig<4229185>
11/5 12:18:09 ( 56.0) (2240):Sending SM_STHOSTACK to PVMd
11/5 12:18:09 (56.0) (2240):PVMd message is SM_ADDACK from t80000000
11/5 12:18:09 (56.0) (2240):Host #0(vm1@xxxxxxxxxxxxxxx) has been added to PVM, pvmd_tid = 80080000
11/5 12:18:09 (56.0) (2240):SendNotification(kind = 3, tid = t80080000)
11/5 12:18:09 (56.0) (2240):pvm_machines_starting = 0(should be 0)
11/5 12:18:09 (56.0) (2240):StartLocalProcess: = /usr/local/condor/examples/PVM/hello < /dev/null > /home/condor/out_sum >& /home/condor/err_sum
11/5 12:18:09 (56.0) (2240):open_max = 1024
11/5 12:18:09 (56.0) (2240):Local PVM process pid = 2242
11/5 12:18:09 (56.0) (2240):Entered StartWaitingHosts()
11/5 12:18:09 (56.0) (2240):Ok to start waiting hosts
11/5 12:18:09 (56.0) (2240):deadpid = 2242
11/5 12:18:09 (56.0) (2240):Local process for job 56.0 died with status 0x6e00
11/5 12:18:09 (56.0) (2240):SendNotification(kind = 1, tid = t0)
11/5 12:18:09 (56.0) (2240):Multi_Shadow: Shutting down...
11/5 12:18:09 (56.0) (2240):Updated class:
11/5 12:18:09 (56.0) (2240):#0: 0 (1, 1) has 1
11/5 12:18:09 (56.0) (2240):signal_startd( vm1@xxxxxxxxxxxxxxx, 443 )
11/5 12:18:09 (56.0) (2240):in new_timer()
11/5 12:18:09 (56.0) (2240):Timer List
11/5 12:18:09 (56.0) (2240):^^^^^ ^^^^
11/5 12:18:09 (56.0) (2240):id = 1, when = 300
11/5 12:18:09 ( 56.0) (2240):deadpid = 0
11/5 12:18:09 (56.0) (2240):No more dead processes(errno = 115)
11/5 12:18:09 (56.0) (2240):Subproc 32767 exited, termsig = 0, coredump = 0, retcode = 15
11/5 12:18:09 (56.0) (2240):ru_utime = 0.000000
11/5 12:18:09 (56.0) (2240):ru_stime = 0.000000
11/5 12:18:09 (56.0) (2240):Job 0, process 0 logsock appears to be closed, removing..
11/5 12:18:09 (56.0) (2240):Removing Proc 0(t0) from Job
11/5 12:18:09 ( 56.0) (2240):remove starter for host vm1@xxxxxxxxxxxxxxx, removing the host too
11/5 12:18:09 (56.0) (2240):RemoveHost: Sending HostDelete notify on t80080000
11/5 12:18:09 ( 56.0) (2240):SendNotification(kind = 2, tid = t80080000)
11/5 12:18:09 (56.0) (2240):signal_startd( vm1@xxxxxxxxxxxxxxx, 443 )
11/5 12:18:09 (56.0) (2240):Updated class:
11/5 12:18:09 ( 56.0) (2240):#0: 0 (1, 1) has 0
11/5 12:18:09 (56.0) (2240):Trying to unlink /usr/local/condor/local.ksl432-01/spool/cluster56.proc0.subproc0
11/5 12:18:09 (56.0) (2240):All processes completed, job should be deleted
11/5 12:18:09 (56.0) (2240):MultiShadow Exiting!!!
11/5 12:18:09 (56.0) (2240):user_time = 0 ticks
11/5 12:18:09 (56.0) (2240):sys_time = 1 ticks
11/5 12:18:09 (56.0) (2240):Entering multi_update_job_status()
11/5 12:18:09 (56.0) (2240):Shadow: marked job status "COMPLETED"
11/5 12:18:09 (56.0) (2240):multi_update_job_status() returns 0
11/5 12:18:09 (56.0) (2240):Shadow: Job exited normally with status 110
11/5 12:18:09 (56.0) (2240):Notification = "
exited with status 110, and touched 1 machines.
Start-up was unsuccessful on 0 machines."
11/5 12:18:09 (56.0) (2240):********** Shadow Parent Exiting(100) **********

 

On 11/3/07, Stanislovas Gineitis <s.gineitis@xxxxxxxxx> wrote:
Hi,

I have fount it in their website: http://www.cs.wisc.edu/condor/pvm/.
Where is an explanation what modification should be done on pvm.

Use this command:

ls -l `condor_config_val PVMD`

and it will show you where is condor_pvmd


David Brodbeck wrote:
>
> On Oct 31, 2007, at 1:58 AM, Stanislovas Gineitis wrote:
>
>> Hi,
>>
>> I would like to  know,  do I have to run  condor_pvmd  before running
>> PVM condor example or any PVM file, because if I don't do it the
>> output files stay empty and condor_q doesn't show anything.
>
> I'm curious where you got condor_pvmd from.  I haven't been able to
> locate it.  The manual says it's in the contrib section of the
> website, but there's nothing pvm related listed there.  I asked about
> this on the list several months ago and got no replies, so I assumed
> it was no longer available, but if you're using it apparently that's
> not the case.
>
>
> David Brodbeck
> Information Technology Specialist 3
> Computational Linguistics
> University of Washington
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>