[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs are dumped



On 12/1/2017 3:07 PM, Roberto Tavares wrote:
Hello,

I think I'm almoust there!

I'm trying to run a simple script:

echo "It Works" >> /tmp/thisshouldwork.txt


Are you specifying "thisshouldwork.txt" as your executable? If so, I would not expect it to work. Does it work from the command prompt without involving HTCondor ? (my guess is no). Instead of

   echo "It Works"

you probably want

   #!/bin/sh
   echo "It Works"

and then do chmod 700 thisshouldwork.txt (to set the executable bit). This is life on a Linux/Unix environment, nothing specific to HTCondor here.

Take a look at
 http://research.cs.wisc.edu/htcondor/manual/quickstart.html
I think you will find it helpful at getting started, it covers the above issues.

In looking at the log below, it looks like you submitted the job to HTcondor' "standard" universe, which you likely do not want to do (unless you have the C or Fortran souce code to your program). Instead, you want the 'vanilla' universe, by placing the following into your submit file:

   universe = vanilla

(This is the default on recent HTCondor installs....).

Hope the above helps,
Todd

What happens:

- it goes to the queue
- it is removed from the queue
- it does not run (log files empty and file in tmp is not created) and it seems to fall into some black hole... :(

The maximum that I could reach that shows any error is the ShadowLog file, that gives me:

12/01/17 18:51:25 (?.?) (74915):******* Standard Shadow starting up *******
12/01/17 18:51:25 (?.?) (74915):** $CondorVersion: 8.4.12 Jul 06 2017 BuildID: 409562 $
12/01/17 18:51:25 (?.?) (74915):** $CondorPlatform: x86_64_Ubuntu14 $
12/01/17 18:51:25 (?.?) (74915):*******************************************
12/01/17 18:51:25 (?.?) (74915):uid=0, euid=122, gid=0, egid=131
12/01/17 18:51:25 (?.?) (74915):Hostname = "<xxx.xxx.xxx.xxx:17345?addrs=xxx.xxx.xxx.xxx-17345>", Job = 37.0
12/01/17 18:51:25 (37.0) (74915):Requesting Primary Starter
12/01/17 18:51:25 (37.0) (74915):Shadow: Request to run a job was ACCEPTED
12/01/17 18:51:25 (37.0) (74915):Shadow: RSC_SOCK connected, fd = 17
12/01/17 18:51:25 (37.0) (74915):Shadow: CLIENT_LOG connected, fd = 18
12/01/17 18:51:25 (37.0) (74915):My_Filesystem_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):My_UID_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):*Can't get address for checkpoint server host (NULL): No such file or directory*
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ Entering pseudo_get_file_stream
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ file = "/var/lib/condor/spool/37/cluster37.ickpt.subproc0" 12/01/17 18:51:25 (37.0) (74915):Created TCP listen socket <xxx.xxx.xxx.xxx:41412> 12/01/17 18:51:25 (37.0) (74915):Shadow: Job 37.0 exited, termsig = 0, coredump = 128, retcode = 0
12/01/17 18:51:25 (37.0) (74915):user_time = 1 ticks
12/01/17 18:51:25 (37.0) (74915):sys_time = 0 ticks
12/01/17 18:51:25 (37.0) (74915):*Shadow: Cannot notify user( Condor Job 37.0, tavares, w )* 12/01/17 18:51:25 (37.0) (74915):Static Policy: removing job because OnExitRemove has become true
12/01/17 18:51:25 (37.0) (74915):********** Shadow Exiting(102) **********

Just to keep it simple, I'd rather to avoid to use the checkpoint server. Is it possible?

I'm a little clueless now... can you give me any help on that?

Thank you!!!!

Roberto



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685