[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Checkpoint server installation problem.



Thanks, Preston.

Does the checkpoint server only work with standard universe?

I didn't know about that.

Thank you.

Since I got this kind of message when I condor_compile for my geant4 simulation

--------------------
Using global libraries ...
Linking Hadrontherapy_neutron_DMX
LINKING FOR CONDOR : /usr/bin/ld.real -L/condor/lib -Bstatic --eh-frame-hdr -m elf_i386 -dynamic-linker /lib/ld-linux.so.2 -o /home/geniejhang/geant4/bin/Linux-g++/Hadrontherapy_neutron_DMX /condor/lib/condor_rt0.o /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../crti.o /usr/lib/gcc/i386-redhat-linux/3.4.6/crtbeginT.o -L/data/CLHEP-2.0.4.5/lib -L/data/geant4.9.3/lib/Linux-g++ -L/home/geniejhang/geant4/tmp/Linux-g++/Hadrontherapy_neutron_DMX -L/lib -L/usr/X11R6/lib -L/usr/lib/gcc/i386-redhat-linux/3.4.6 -L/usr/lib/gcc/i386-redhat-linux/3.4.6 -L/usr/lib/gcc/i386-redhat-linux/3.4.6/../../.. /home/geniejhang/geant4/tmp/Linux-g++/Hadrontherapy_neutron_DMX/exe/Hadrontherapy_neutron_DMX.o -rpath /home/geniejhang/geant4/tmp/Linux-g++/Hadrontherapy_neutron_DMX -lHadrontherapy_neutron_DMX -lG4Tree -lG4FR -lG4GMocren -lG4visHepRep -lG4RayTracer -lG4VRML -lG4OpenGL -lG4gl2ps -lG4vis_management -lG4modeling -lG4interfaces -lG4persistency -lG4error_propagation -lG4readout -lG4physicslists -lG4run -lG4event -lG4tracking -lG4parmodels -lG4processes -lG4digits_hits -lG4track -lG4particles -lG4geometry -lG4materials -lG4graphics_reps -lG4intercoms -lG4global -lGLU -lGL -lXmu -lXt -lXext -lX11 -lXi -lSM -lICE -lCLHEP -lz /condor/lib/libcondorzsyscall.a /condor/lib/libcondor_z.a /condor/lib/libcomp_libstdc++.a /condor/lib/libcomp_libstdc++.a -lm /condor/lib/libcomp_libgcc.a /condor/lib/libcomp_libgcc_eh.a -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /condor/lib/libcomp_libgcc.a /condor/lib/libcomp_libgcc_eh.a /usr/lib/gcc/i386-redhat-linux/3.4.6/crtend.o /usr/lib/gcc/i386-redhat-linux/3.4.6/../../../crtn.o
/usr/bin/ld.real: cannot find -lHadrontherapy_neutron_DMX
collect2: ld returned 1 exit status
make: *** [/home/geniejhang/geant4/bin/Linux-g++/Hadrontherapy_neutron_DMX] error 1
-------------------------

I never used standard universe.

I'll check if i get a chance to use standard universe someday.

Thank you again, Preston.

2010/1/28 Preston Smith <psmith@xxxxxxxxxx>
I think you're fine - the checkpoint server is ready to do it's thing,
sending updates to the collector, just waiting for checkpoints to
handle.

Have you tried running a standard universe job, and forcing it to
checkpoint with condor_checkpoint?


On Tue, Jan 26, 2010 at 12:59 PM, Genie Jhang <geniejhang@xxxxxxxxxxx> wrote:
> Hi. Genie again.
>
> I feel sorry about day after day questions.
>
> Now, it's about checkpoint server.
>
> I read through the page,
> http://www.cs.wisc.edu/condor/manual/v7.4/3_8Checkpoint_Server.html, to
> learn how to install it.
>
> And I'm stuck with the line below. I don't know what the second line means.
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Described in section 3.3.9. To have the checkpoint server managed by the
> condor_master, the DAEMON_LIST variable's value must list both MASTER and
> CKPT_SERVER.
> Also add STARTD to allow jobs to run on the checkpoint server machine.
> Similarly, add SCHEDD to permit the submission of jobs from the checkpoint
> server machine.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> I did add the lines below to the condor_config file in all our machines.
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> DAEMON_LIST                    = MASTER, STARTD, SCHEDD, CKPT_SERVER
> CKPT_SERVER                   = $(SBIN)/condor_ckpt_server
> USE_CKPT_SERVER          = True
> CKPT_SERVER_HOST        = 192.168.0.109
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> and the file, condor_config.local, in the 192.168.0.109
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> CKPT_SERVER_DIR             = /data/ckpt_server
> CKPT_SERVER_LOG            = $(LOG)/CkptServerLog
> MAX_CKPT_SERVER_LOG   = 1000000
> CKPT_SERVER_DEBUG       = D_ALWAYS
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Then, my MasterLog file says
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 01/25 17:47:50 Started process "/condor/sbin/condor_ckpt_server", pid and
> pgroup = 9895
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> and CkptServerLog file says
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 01/25 17:47:50 ******************************************************
> 01/25 17:47:50 ** condor_ckpt_server (CONDOR_CKPT_SERVER) STARTING UP
> 01/25 17:47:50 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
> 01/25 17:47:50 ** $CondorPlatform: I386-LINUX_RHEL3 $
> 01/25 17:47:50 ** PID = 9895
> 01/25 17:47:50 ******************************************************
> 01/25 17:47:50 CKPT_SERVER running in directory /data/ckpt_server
> 01/25 17:47:50     Server Initializing
> 01/25 17:47:50     Server:
> 01/25 17:47:50     pheko09
> 01/25 17:47:50     Store Request Port:                5651
> 01/25 17:47:50     Store Request Socket Descriptor:   3
> 01/25 17:47:50     Store Request Buffer Size:         87380
> 01/25 17:47:50     Restore Request Port:              5652
> 01/25 17:47:50     Restore Request Socket Descriptor: 4
> 01/25 17:47:50     Restore Request Buffer Size:       87380
> 01/25 17:47:50     Service Request Port:              5653
> 01/25 17:47:50     Service Request Socket Descriptor: 5
> 01/25 17:47:50     Service Request Buffer Size:       87380
> 01/25 17:47:50     Signal handlers installed:         SIGCHLD
> 01/25 17:47:50                                        SIGUSR1
> 01/25 17:47:50                                        SIGUSR2
> 01/25 17:47:50                                        SIGALRM
> 01/25 17:47:50     Total allowable transfers:         50
> 01/25 17:47:50     Number of storing transfers:       50
> 01/25 17:47:50     Number of restoring transfers:     50
> 01/25 17:47:50 Sending initial ckpt server ad to collector
> 01/25 17:47:50 ----------------------------------------------------
> 01/25 17:47:50     Begin removing stale checkpoint files.
> 01/25 17:47:50     Done removing stale checkpoint files.
> 01/25 17:47:50     Next stale checkpoint file check in 86400 seconds.
> 01/25 17:52:50 Sending ckpt server ad to collector...
> 01/25 17:57:50 Sending ckpt server ad to collector...
> 01/25 18:02:50 Sending ckpt server ad to collector...
> 01/25 18:07:50 Sending ckpt server ad to collector...
> 01/25 18:12:50 Sending ckpt server ad to collector...
> 01/25 18:17:50 Sending ckpt server ad to collector...
> 01/25 18:22:50 Sending ckpt server ad to collector...
> 01/25 18:27:50 Sending ckpt server ad to collector...
> 01/25 18:32:50 Sending ckpt server ad to collector...
> 01/25 18:37:50 Sending ckpt server ad to collector...
> 01/25 18:42:50 Sending ckpt server ad to collector...
> 01/25 18:47:50 Sending ckpt server ad to collector...
> 01/25 18:52:50 Sending ckpt server ad to collector...
> 01/25 18:57:50 Sending ckpt server ad to collector...
> 01/25 19:02:50 Sending ckpt server ad to collector...
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> And there's no file in /data/ckpt_server directory, even though condor has
> it and in 755 permission.
>
> What I did wrong?
>
> Thanks for reading this long mail.
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>



--
Preston M. Smith
psmith@xxxxxxxxxx
Sr. UNIX Systems Administrator
Rosen Center for Advanced Computing, Purdue University
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/