[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] why does checkpoint size differ for the same job so much?



Hi,
 I have 2 versions of the same job that I submit in condor. First version has, an application which is checkpointable
along with a java program of mine that transfers the checkpoint that I take periodically. The application itself
is written in C and is not multi-threaded. The executable has been condor_compiled.
The 2nd version has the same application with same input file and all along with a java program of mine
which is multi-threaded and this also transfers the checkpoint that I take periodically. This "take checkpoint" signal
is being generated by a shell script of mine that I submit as a part of the whole job package. I am using
vanilla universe.

Now my question is: For the first version job's image size is considerably less than the 2nd version.
Is this the reason why I see the checkpoint generated by first version is around 200MB where as
checkpoint generated by 2nd version is around 800MB? Is condor taking checkpoint of the process spaces
of all the processes (java and C)? I though condor will save only the image size of my C application since that
is the only program being condor_compiled. Am I missing something?


Section 7.3 of http://www.cs.wisc.edu/condor/manual/v7.0/7_3Running_Condor.html#SECTION008319000000000000000
states that condor cannot correctly calculate a job's image size if it has multiple threads in it. To clarify again,
my C program does not have thread but my java program does(for 2nd version).

I will really appreciate any direction. Thanks in advance.

Tan

--
--
Tanzima Zerin Islam
Graduate Student
School of Electrical & Computer Engineering
Purdue University