Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] checkpointing the vanilla universeunder windows

Date: Tue, 21 May 2013 12:21:46 +0100
From: Antonis Sergis <sergis_antonis@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] checkpointing the vanilla universeunder windows

I was thinking of that as well so I gave open full access to the spooldirectory of the submitting machine to the entire network - still the sameproblem. Quite frustrating

A.

-----Original Message-----From: Ian Cottam

Sent: Monday, May 20, 2013 8:16 PM
To: HTCondor-Users Mail List

Subject: Re: [HTCondor-users] checkpointing the vanilla universeunderwindows


Could it be a firewall issue?
-Ian
--
Ian Cottam
IT Services Research Lead
IT Services -- supporting research
The University of Manchester
[ATD - Action This Day - Churchill]





On 20/05/2013 18:47, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx> wrote:

Hello Ian,

yes I have changed the want_vacate = TRUE in the condor configuration
file
and reconfigured condor. Still the same problem. Apparently the
"_condor_stdout" file is never created inside the spool directory even
though the directory is there. Here is the full log file returned up to
the
point that I am forcing a job to vacate a node:

000 (042.000.000) 05/20 18:43:51 Job submitted from host: <155.***>
...
001 (042.000.000) 05/20 18:44:11 Job executing on host: <155.***>
...
006 (042.000.000) 05/20 18:44:21 Image size of job updated: 750
   1  -  MemoryUsage of job (MB)
   112  -  ResidentSetSize of job (KB)
...
006 (042.000.000) 05/20 18:44:31 Image size of job updated: 5520
   1  -  MemoryUsage of job (MB)
   112  -  ResidentSetSize of job (KB)
...
007 (042.000.000) 05/20 18:44:31 Shadow exception!
   Error from slot1@***: STARTER at 155.***failed to send file(s) to
<155.***>; SHADOW at 155.***failed to write to file
c:\condor/spool\42\0\cluster42.proc0.subproc0.tmp\_condor_stdout: (errno
2)
No such file or directory
   179244  -  Run Bytes Sent By Job
   1528896  -  Run Bytes Received By Job
...
012 (042.000.000) 05/20 18:44:31 Job was held.
   Error from slot1@***: STARTER at 155.***failed to send file(s) to
<155.***>; SHADOW at 155.*** failed to write to file
c:\condor/spool\42\0\cluster42.proc0.subproc0.tmp\_condor_stdout: (errno
2)
No such file or directory
   Code 12 Subcode 2
...


-Antonis

-----Original Message-----
From: Ian Cottam
Sent: Monday, May 20, 2013 6:17 PM
To: HTCondor-Users Mail List
Cc: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] checkpointing the vanilla universeunder
windows

Did you check every node has
Want_vacate = true
In all local configs? And restart them?
-Ian



On 20 May 2013, at 17:04, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx>
wrote:

Hello Ian,

thank you for the useful notes. I have now been changing my source code
to
be able to extract more info from condor regarding what is going on
(added
some counters to count how many times the code has restarted etc).

To my surprise, I have intentionally vacated the job to see what
happens
and the following error is logged by condor each time a job is
evacuated
(asterisks added to protect privacy):

Error from slot2@***.uk: STARTER at 155.*** failed to send file(s) to
<155.***>; SHADOW at 155.*** failed to write to file
c:\condor/spool\48\0\cluster48.proc0.subproc0.tmp\_condor_stdout:
(errno
2) No such file or directory
  Code 12 Subcode 2

The positive thing though is that I can read and write inside the same
file (read the input flag and then I was able to change it).

Checking the spool directory, the folder
c:\condor/spool\48\0\cluster48.proc0.subproc0.tmp is present but always
empty. It appears that at every job vacation the code is not
transferring
the files inside the spool directory to be able to restart the code to
a
different node. The counters from my source code indicate the same (the
code basically cannot be restarted and is completed as long as it is
left
running uninterrupted on a single node).

Any ideas?

P.S. periodic_release=True is to enable more frequent automatic
releases
of jobs getting held (I am running around 60k jobs).

Antonis

-----Original Message----- From: Ian Cottam
Sent: Monday, May 20, 2013 1:51 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] checkpointing the vanilla universe under
windows

We are Linux based here in Manchester, but we took advice from Liverpool
who are Windows based, so it should all work.

Here is our web page on the topic:

<http://condor.eps.manchester.ac.uk/examples/user-level-checkpointing-an-
ex
ample-in-c/>
The has_checkpointing is just a local thing: you don't need it if all
your
clients are set up identically and to support user checkpointing. To
that
end, I believe they need
WANT_VACATE = True
in all their local configs. Then all should work.

Here is the Liverpool page
<http://condor.liv.ac.uk/checkpoint/>

I'm not sure why you have periodic_release=True, we don't use that. But
if
you could explain it to me, maybe we should be!

Hope that helps.
-Ian


On 20/05/2013 12:41, "Antonis Sergis" <sergis_antonis@xxxxxxxxxxx>
wrote:

Dear all,

I am coming back to the hot ³checkpointing the vanilla universe issue²
under windows. I have a fortran 90 code which can run for a while. For
longer runs, condor¹s performance drops significantly as jobs get
interrupted by users and with the lack of a
native checkpointing function and the inability to use the ³standard"
universe the code has to restart from the beginning on a different
machine. As a result seldom any jobs manage to finish off. I changed my
source code to accommodate a check pointing feature.
The code reads a ³flag² file (which is also one of the initial input
files) and creates a checkpoint file with all the required data to be
able to resume a job from where it was left off. The flag file
initially
contains a ³0². As soon as a given elapsed time
passes (1hr and then every one hour from there onwards) the first
checkpoint takes place. The flag file is supposed to be updated with a
value of ³1² and a ³history² file is created saving the required
checkpoint data. The idea is that when the code gets evicted,

it will read the input file as ³1² and then use the ³history² fileto

read the last checkpoint data and resume from where it left off. This
doesn¹t seem to be working. I am quite confused if the flag file gets
updated and re-read upon re-starting the job.
I am also not sure if condor will be able to read the ³history² file
which was created as an output file and is not in the initial input
files
list.


Any ideas?

This is the current submit file I am using to accommodate the
checkpoint
function:

************************
************************
Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS")
Executable = \\htcondor\htcondorjobs\\****\T2\mds.exe
<file://\\htcondor\htcondorjobs\\****\T2\mds.exe>
initialdir = \\htcondor\htcondorjobs\\****\T2
<file://\\htcondor\htcondorjobs\\****\T2>
transfer_input_files = mds.exe, input, flag
Universe = vanilla
Getenv = False
output = Test_cores.out
error = Test_cores.err
log = Test_cores.log
should_transfer_files = ALWAYS
when_to_transfer_output = ON_EXIT_OR_EVICT
periodic_release = TRUE
Queue 250
************************
************************

Regards
Antonis



--
Ian Cottam x61851
IT Services Research Lead
IT Services -- supporting research
The University of Manchester
[ATD - Action This Day - Churchill]




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:

https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- Re: [HTCondor-users] checkpointing the vanilla universeunder windows
  - From: Ian Cottam

Prev by Date: Re: [HTCondor-users] Job scheduling in a Pool
Next by Date: Re: [HTCondor-users] Job Scheduling
Previous by thread: Re: [HTCondor-users] checkpointing the vanilla universeunder windows
Next by thread: [HTCondor-users] Job scheduling in a Pool
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] checkpointing the vanilla universeunder windows