[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Checkpoint server installation problem.



Hi. Genie again.
 
I feel sorry about day after day questions.
 
Now, it's about checkpoint server.
 
I read through the page, http://www.cs.wisc.edu/condor/manual/v7.4/3_8Checkpoint_Server.html, to learn how to install it.
 
And I'm stuck with the line below. I don't know what the second line means.
 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Described in section 3.3.9. To have the checkpoint server managed by the condor_master, the DAEMON_LIST variable's value must list both MASTER and CKPT_SERVER.
Also add STARTD to allow jobs to run on the checkpoint server machine. Similarly, add SCHEDD to permit the submission of jobs from the checkpoint server machine.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
I did add the lines below to the condor_config file in all our machines.
 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DAEMON_LIST                    = MASTER, STARTD, SCHEDD, CKPT_SERVER
CKPT_SERVER                   = $(SBIN)/condor_ckpt_server
USE_CKPT_SERVER          = True
CKPT_SERVER_HOST        = 192.168.0.109
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
and the file, condor_config.local, in the 192.168.0.109
 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CKPT_SERVER_DIR             = /data/ckpt_server
CKPT_SERVER_LOG            = $(LOG)/CkptServerLog
MAX_CKPT_SERVER_LOG   = 1000000
CKPT_SERVER_DEBUG       = D_ALWAYS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
Then, my MasterLog file says
 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
01/25 17:47:50 Started process "/condor/sbin/condor_ckpt_server", pid and pgroup = 9895
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
and CkptServerLog file says
 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
01/25 17:47:50 ******************************************************
01/25 17:47:50 ** condor_ckpt_server (CONDOR_CKPT_SERVER) STARTING UP
01/25 17:47:50 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID: 204351 $
01/25 17:47:50 ** $CondorPlatform: I386-LINUX_RHEL3 $
01/25 17:47:50 ** PID = 9895
01/25 17:47:50 ******************************************************
01/25 17:47:50 CKPT_SERVER running in directory /data/ckpt_server
01/25 17:47:50     Server Initializing
01/25 17:47:50     Server:
01/25 17:47:50     pheko09
01/25 17:47:50     Store Request Port:                5651
01/25 17:47:50     Store Request Socket Descriptor:   3
01/25 17:47:50     Store Request Buffer Size:         87380
01/25 17:47:50     Restore Request Port:              5652
01/25 17:47:50     Restore Request Socket Descriptor: 4
01/25 17:47:50     Restore Request Buffer Size:       87380
01/25 17:47:50     Service Request Port:              5653
01/25 17:47:50     Service Request Socket Descriptor: 5
01/25 17:47:50     Service Request Buffer Size:       87380
01/25 17:47:50     Signal handlers installed:         SIGCHLD
01/25 17:47:50                                        SIGUSR1
01/25 17:47:50                                        SIGUSR2
01/25 17:47:50                                        SIGALRM
01/25 17:47:50     Total allowable transfers:         50
01/25 17:47:50     Number of storing transfers:       50
01/25 17:47:50     Number of restoring transfers:     50
01/25 17:47:50 Sending initial ckpt server ad to collector
01/25 17:47:50 ----------------------------------------------------
01/25 17:47:50     Begin removing stale checkpoint files.
01/25 17:47:50     Done removing stale checkpoint files.
01/25 17:47:50     Next stale checkpoint file check in 86400 seconds.
01/25 17:52:50 Sending ckpt server ad to collector...
01/25 17:57:50 Sending ckpt server ad to collector...
01/25 18:02:50 Sending ckpt server ad to collector...
01/25 18:07:50 Sending ckpt server ad to collector...
01/25 18:12:50 Sending ckpt server ad to collector...
01/25 18:17:50 Sending ckpt server ad to collector...
01/25 18:22:50 Sending ckpt server ad to collector...
01/25 18:27:50 Sending ckpt server ad to collector...
01/25 18:32:50 Sending ckpt server ad to collector...
01/25 18:37:50 Sending ckpt server ad to collector...
01/25 18:42:50 Sending ckpt server ad to collector...
01/25 18:47:50 Sending ckpt server ad to collector...
01/25 18:52:50 Sending ckpt server ad to collector...
01/25 18:57:50 Sending ckpt server ad to collector...
01/25 19:02:50 Sending ckpt server ad to collector...
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
And there's no file in /data/ckpt_server directory, even though condor has it and in 755 permission.
 
What I did wrong?
 
Thanks for reading this long mail.