[ale] Disappointed in the recent climate research hack

Mon Nov 23 16:15:33 EST 2009

Sorry for the top posting :)

There are some climate modeling apps that are open-source. You can
cruise around and find them (e.g. WRF, POP3, CAM, etc.). Some are
very tightly controlled by the labs (NCAR, UCAR, and universities)
to prevent and/or control problems. For example, any one could grab
a random data set, run it through the code, and out pops results that they
could be used to argue a conclusion one way or the other ("hey the
application gave me the answers so that much mean they are right!").

The scientific process, as Jim pointed out, is pretty thorough. They look
at the data itself, the application (both the model and any comparison of
the resulting application to known data sets), and then the interpretation
of the data with a microscope.

This same process s true for many scientific and engineering disciplines.
If you develop a new model or technique you need to apply it against
known data sets and compare the results. Only after passing these tests
is it accepted by the community. But as Jim points out, that doesn't give
you a free ride  - you still have to be very rigorous with applying the
application to other problems. You have to do simple sanity checks, any
quick comparisons to simple models, interpret any "strange" results, and
defend your conclusions. I ran through this process when I was doing
research. You have to be thorough if you want people to listen to you
or believe your data - otherwise you are viewed as a "nut" (I guess the
closest thing in the open-source world would be a troll). While it can be
a very harsh process from the outside, it usually is viewed as a more
rigorous process from inside. They even argue about the know data
sets ("test cases") to make sure the data is correct (were the sensors
calibrated correctly? How did you prove they were calibrated correctly?
Did you do some basic sanity checks? If so, what where they? What are
the sources of error in the measurements? How was it measured? How
is the error source(s) propagated into the measurements? Was the data files
checked for errors when it was copied from the test systems? How was
the data checked? On and on).

My area is aeronautical/astronautical engineering and I can tell one story
where NASA ran some wind tunnel tests with their new cryogenic wind
tunnel (allows them to get data at conditions they couldn't previously
achieve otherwise). Turns out they didn't check the data after the test. 
The force
measurements should have been the same when the tunnel was not blowing
air before the test started and after the test ended. The problem was - 
no one
checked that until much later. Oops. NASA ultimately found the problem and
corrected it but the data was not overly useful at that point so they had to
go back and redo the tests.

Jeff

> The source code that runs applications is NOT under embargo rules. It
> is not considered data and is (usually) custom-written,
> semi-proprietary code owned by the lab/researcher who oversaw the
> (intern) person who wrote it. However the operations of the
> application (i.e. the model itself) is under intense scrutiny.
>
> I'm not sure of the source code itself to the modeling application. I
> do know that the math and theory is public from the research papers
> themselves. I suspect the source code for the final works that is NOAA
> research _is_ public somewhere. However, I do know that much of the
> programming that is used to run the instruments to collect the data is
> NOT public (my BiL wrote a crapton for his collecting gear [small
> semi-truck sized cargo crate that takes up a chunck of a C5A cargo
> plane] which runs XP and thus has the associated mentalities).
>
> On Mon, Nov 23, 2009 at 2:25 PM, Greg Freemyer <greg.freemyer at gmail.com> wrote:
>   
>> Jim,
>>
>> I hope you're right about the embargo process, but the one only chunk
>> of source code I saw a reference to was supposedly 1999 code.  So if
>> the embargo is 10 years it is ridiculous.  6 or 12 months would be
>> fine.
>>
>> The few emails I seen quoted were also 10 year old emails, but I am
>> not saying I think those should be public.  It is the source code to
>> the models and the data they are using that I think should be handled
>> under an open license of some sort.
>>
>> Greg
>>
>> On Mon, Nov 23, 2009 at 2:03 PM, Jim Kinney <jim.kinney at gmail.com> wrote:
>>     
>>> I have a bit of insight into the research data issue (brother-in-law
>>> works in the field that had the data loss):
>>>
>>> The data when first generated/collected is held in an embargo for a
>>> period of time. This time period varies but is often for 6 months to
>>> one year. This is done to allow time for the research team who did the
>>> work to collect it to also do the work to to write it up and present
>>> it. It's pretty much a "geek cred" thing. It also allows time to do a
>>> proper analysis to make sure that the data is not flawed in some way
>>> _before_ it's made public.
>>>
>>> During the embargo time, the researches with access to the data are
>>> not allowed to discuss the initial findings or disperse data copies.
>>>
>>> Once the embargo period is over, the data is made fully available
>>> along with the research findings and all the supporting papers.
>>>
>>> Science does not (and probably should not) work on a release early,
>>> release often process.
>>>
>>> So the unauthorized data access was of embargo'ed data. Without having
>>> the details of the collection methodology, it is not possible to draw
>>> any valid conclusions from. That's why the researchers spend so long
>>> to do the writeups. They have to explain why certain data is not valid
>>> (hard) and other data is valid (very hard) and why their conclusion is
>>> what it is (extremely hard).
>>>
>>> The schmuck who broke in had an agenda. He (most likely "he") has an
>>> axe to grind and no understanding of the research process or why it is
>>> done the way it is. So now that incomplete data set will be "outed"
>>> and be used to "justify" his cause. It will have little impact on the
>>> actual research but will likely have great influence on the
>>> scientifically illiterate congress critters.
>>>
>>>
>>> On Mon, Nov 23, 2009 at 1:25 PM, Greg Freemyer <greg.freemyer at gmail.com> wrote:
>>>       
>>>> All,
>>>>
>>>> Not sure everyone knows but a major climate research center was hacked
>>>> recently and in addition to 1000 emails or so, some of their source
>>>> code was published!
>>>>
>>>> In this age of OPEN research and government funding, why wasn't that
>>>> code OPEN in the first place?
>>>>
>>>> I don't care which side of the Global Warming debate you sit on, we
>>>> should all feel it is to important to have the modeling code be
>>>> published under a GPL (or similar license) and available for peer
>>>> review.
>>>>
>>>> If one of you knows of the "best' license for this kind of use I want
>>>> to contact my senator and congressman and tell them we need
>>>> legislation that all federally funded climate change research should
>>>> have both the data and the software models released to the public!
>>>>
>>>> I encourage all OSS advocates to do the same.  This seems like an
>>>> issue the requires a OSS philosophy more that any other I can think
>>>> of.
>>>>
>>>> After all, if the government thinks climate change is worth
>>>> implementing cap and trade over, then it is important enough to let
>>>> the public know how the models work.
>>>>
>>>> Thanks
>>>> Greg
>>>> _______________________________________________
>>>> Ale mailing list
>>>> Ale at ale.org
>>>> http://mail.ale.org/mailman/listinfo/ale
>>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>>> http://mail.ale.org/mailman/listinfo
>>>>
>>>>         
>>>
>>>       
>