CCLM failed with error code 1004 – in #9: CCLM

in #9: CCLM

<p> Dear Delei, </p> <p> I never encountered this kind of variablilty (4 different outputs) with the same setup on the same computer. I often compare different model version (when we change something) and from this I know that even after 30h simulations the difference of output fields is zero (at least in ncview) - at least when I ran it on the same computer. </p> <p> [ if you wanne track down the problem ] <br/> are the non-NAN values of the 4 simulations the same? <br/> -&gt; if no, I guess it's a wierd chaos effect that causes the crash to happend differently. And I would try to get a stable setup that doesn´t produce this chaos effect anymore... (using the same node of the supercomputer for example - if the nodes have different setups) <br/> -&gt; if yes, I have no Idea, and would start suspecting a faulty hardware causing random NAN. <br/> -&gt; using only 1 cpu/core (procx=1,procy=1) may be a setup that you could also test. As far as I know, this should not change the simulation output. But I had a case (with INT2LM), where a bug occured only if parallel computing was used. <br/> <br/> [ if you want a workaround ] <br/> Do you have any working setup right now, from which you could deviate slowly towards the setup you need... step by step, always checking when the error occurs? </p> <p> Cheers <br/> Rolf </p>

  @rolfzentek in #da0f3fa

<p> Dear Delei, </p> <p> I never encountered this kind of variablilty (4 different outputs) with the same setup on the same computer. I often compare different model version (when we change something) and from this I know that even after 30h simulations the difference of output fields is zero (at least in ncview) - at least when I ran it on the same computer. </p> <p> [ if you wanne track down the problem ] <br/> are the non-NAN values of the 4 simulations the same? <br/> -&gt; if no, I guess it's a wierd chaos effect that causes the crash to happend differently. And I would try to get a stable setup that doesn´t produce this chaos effect anymore... (using the same node of the supercomputer for example - if the nodes have different setups) <br/> -&gt; if yes, I have no Idea, and would start suspecting a faulty hardware causing random NAN. <br/> -&gt; using only 1 cpu/core (procx=1,procy=1) may be a setup that you could also test. As far as I know, this should not change the simulation output. But I had a case (with INT2LM), where a bug occured only if parallel computing was used. <br/> <br/> [ if you want a workaround ] <br/> Do you have any working setup right now, from which you could deviate slowly towards the setup you need... step by step, always checking when the error occurs? </p> <p> Cheers <br/> Rolf </p>

Dear Delei,

I never encountered this kind of variablilty (4 different outputs) with the same setup on the same computer. I often compare different model version (when we change something) and from this I know that even after 30h simulations the difference of output fields is zero (at least in ncview) - at least when I ran it on the same computer.

[ if you wanne track down the problem ]
are the non-NAN values of the 4 simulations the same?
-> if no, I guess it's a wierd chaos effect that causes the crash to happend differently. And I would try to get a stable setup that doesn´t produce this chaos effect anymore... (using the same node of the supercomputer for example - if the nodes have different setups)
-> if yes, I have no Idea, and would start suspecting a faulty hardware causing random NAN.
-> using only 1 cpu/core (procx=1,procy=1) may be a setup that you could also test. As far as I know, this should not change the simulation output. But I had a case (with INT2LM), where a bug occured only if parallel computing was used.

[ if you want a workaround ]
Do you have any working setup right now, from which you could deviate slowly towards the setup you need... step by step, always checking when the error occurs?

Cheers
Rolf