I was just thinking about whether it makes sense to use -O2 or -O3 for compiling production software these days. Back, say, 5 years ago, it was easy for me to diss "gcc -O3" because it was so "obviously" full of bugs. But these days, with the rise of fuzz-testing and so on, are we seeing an increase in the reliability of "gcc -O3" to the point where -O3 is just as reliable as -O2?
What I'm really looking for is basically a graph where the X-axis is "GCC version" and/or "year", and the Y-axis is "bugginess", perhaps measured as "Csmith test cases per thousand whose -O3 output differs from their -O0 output".
Even more interesting would be a family of these graphs, for "gcc -O2", "clang -O2", "clang -O3", etc. That's really the part that would allow me to find out whether my anti-O3 bias is (or was ever) justified.
John's September 2013 blog post "Are Compilers Getting More or Less Reliable?"
addresses a similar question, but not in exactly the terms I'm interested in. Namely, there are only two data values on his X-axis ("2.7" and "trunk-as-of-2013"), and his Y-axis conflates -O3 bugs with all other kinds of bugs.
Does anyone have any answers (even partial answers) related to the above question?
Or even any spare grad students who can tackle it? ;)