« Hypothetical Technical Question | Main | Blog Notes »

Sunday, 08 February 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

I don't buy it. In ordinary English, the "X quality" of something isn't the quality of X it produces under optimal circumstances - it's something more complicated - something more like the quality of X it can be *expected* to produce. This is where the analogy with height breaks down "height" when said of people reliably means "height when standing up without stilts or elevating shoes and not traveling at relativistic velocities..." the word "quality" simply isn't used this way.

Dear Mike,

I have a serious problem. I can only afford to buy one book. Looking at the reviews on Amazon, Whyte's book was rated as follows by reviewers:

rating: number of reviewers--

5 star: (45)
4 star: (28)
3 star: (18)
2 star: (11)
1 star: (6)

Kida's book came out:

5 star: (20)
4 star: (9)
3 star: (3)
2 star: (2)
1 star: (1)

Based on this data, which book I should buy?

{g,d,&r}


~ pax \ puzzled Ctein
[ Please excuse any word-salad. MacSpeech in training! ]
======================================
-- Ctein's Online Gallery http://ctein.com 
-- Digital Restorations http://photo-repair.com 
======================================

Ctein,
...And you're a photographer? Then clearly, you should wait to buy until someone comes out with a book that gets all 5's.

Mike J.

I always get swamped with the "who's buried in Grant's tomb" kinds stuff.
you got me...
dale

In both cases, this and the previous post, I prefer the bokeh of lens number 2. Just pretending of course.

Hi Mike,
As a statistician I was interested in your question (from a technical point of view). I didn't give an answer, however, as I knew most answers I could think of would be skewed to the more theoretical aspect than the practical (my question of interest would rather have been: "given the scores on each setting, which camera is the better ?", which is then really unanswerable because of the lack of repetition, and, really, here I'm again skewed to the theoretical). I usually avoid score-based reviews (no matter the subject) because they are usually based on a sample of size 1 which can't inform on either the variability of the product nor the variability of the reviewer (I suspect the latter is the greater, ha!).

Now, and I can't stress this enough, my comments are coming from a statistical point of view, which is very different than, say, a buyer's point of view, or a reviewer's. I'm often forced to explain to my clients that even differences that are statistically significant can be meaningless in practice (and the opposite can also be true). There has to be a certain level of restraint (some might say conservatism) when interpreting statistical results and almost everybody misunderstands this from time to time (statisticians, and me, included).

In this particular case, I tend to agree with the answer being the maximum (the goal being to judge what the maximal image quality obtainable is). But then, the caveat is that you lose all information about the minimal image quality attainable, an interesting tidbit for statisticians (I guess it's possible for any reviewer to get a terrible image out of even the best camera out there ? And what buyer cares about how bad a picture he can make ?). The question that keeps popping in my mind is, are these scores meant to be compared ? I suppose that's what a potential buyer would want to do. Then, we might wonder if a buyer can say, based on a single observation of two cameras, which camera is better.

I think the answer here is that if you plan on buying on a score (!), you can't base your decision on a single article, you'd have to survey all reviews of all the cameras you want to compare. I, personally, would much rather buy on experience than on a score, saving me the trouble of analyzing countless articles (mathematicians are famed for their laziness [irony alert]).

In the end, to return to the original question, many of the answers given in the original post weren't invalid. They were simply scoring a different criteria than maximal image quality. And I could have proposed many different and sophisticated scoring methods. However, how many of those scores would have been meaningful ? That is a question many statisticians should ask themselves more often.

(Sorry for the rant here but, as much as photography is a hobby, statistics is my passion and any discussion of it is bound to rile me up in one way or another).

I'll have to disagree. You said:
Your task is to assign one rank or grade for "image quality" to each device.

If 9 is the max one device is capable of, then its sensor is better, but that doesn't equate to "image quality" IMO. It would be as misleading without details as any other rating; letting a prospective use (who may never be able to use the '9' setting in practice) think he's getting the device that gives him better image quality.

I think this same argument can be applied to any rating, so I basically believe there is no right answer; not single numeric rating, without the details, is useful.

Your answer makes sense if we take "image quality" quite literally (as in ranking lenses for resolution with total flexibility in how they're used in order to obtain that resolution). I can go back and read the prior post that way. I suppose the lesson to take from this is that if given a task to rank imaging devices based on image quality; clarify the definition of image quality with the person assigning the task :)

Oh,come ON!

Instead, let's assume that the two devices had these scores:

2 2 2 2 9 2 2 2
8 8 8 8 8 8 8 8

According to your reasoning, the first device is superior.

Riiiiight. (Wink, wink, nudge, nudge ...)

"2 2 2 2 9 2 2 2
8 8 8 8 8 8 8 8"

Bill Rogrs,
Of course. The first one is capable of 9, the second one is capable of 8. Which is higher, 9 or 8?

The point is that dxomark is gauging the potential performance of the sensor. In that case they really have to look at the highest performance a given device is capable of. How could they possibly rate a 9 less than an 8? It doesn't matter what the average is. They're not measuring average performance.

It's up to reviewers to say things like "this camera is more consistent and will give you better results over a wider range of conditions" (which, by the way, sounds a lot like something I would write...).

Mike J.

Dennis said "I'll have to disagree. You said:
Your task is to assign one rank or grade for "image quality" to each device.

If 9 is the max one device is capable of, then its sensor is better, but that doesn't equate to "image quality"

I went for a similar argument but when I re-read the original post I noted "Imaging Device" and "quality" in the text. I then concluded that Mike was fair in his reference. One thing that I do find troubling is that Mike had an unstated agenda in his "hypothetical" question. The trouble is that the lead in to the hypothetical statement is gauged to the desired outcome. A tactic I would consider below Mike's standards and falls under the old "Liars, damn liars and statisticians.(with apologies to Mr. Carmichael)

My camera goes to 11.

I'm with the dissenters here.

If you're in a situation in which setting number five is (for whatever reason) not an option, "just try" to get a 9 out of either device. You can't. But you might be able to get a 5 or 6 out of Device 2.

You ask for one number for the entire device. That number is next to useless without qualifiers of some sort, and it would flat-out wrong to say that Device 1 has _overall_ better image quality. I understand that your point is that the single rank/grade doesn't imply "overall" in that sense — but I disagree.

To go to your concrete example: if your spy camera catches an important bit of information at the edge of the field of view, suddenly Device 1 wasn't such a good choice after all. And you'll be especially sad if you made that choice based on the impression that Device 1 scored the best without understanding the conditions it needed to obtain that score.

And I'm going to go back to what I said before: if you want an at-a-glance presentation of the differences in these devices, a spider chart is a better way to go than a number.

Matthew Miller,
All true. Which is why I've been saying very consistently for any number of years that tests and reviews that come down to a single rating or rank or number are next to useless.

Mike J.

Matt Needham, meet Bill Rogers. Bill, Matt.

[g]

Mike J.

The maximum value as overall score has at least one thing to recommend itself over using the average — it's at least a number that has a useful meaning once you know what it is. The average of all numbers doesn't tell you a thing about the characteristics of the device, rendering constantly mediocre performance indistinguishable from overall poor results mixed with flashes of brilliance.

GI=GO.

Dear folks,

Most of you are guilty of overthinking this too much. Mike asked a very simple and straightforward hypothetical question.

(And, by the way, a hypothetical question may indeed have an unstated goal and an obvious answer. One legitimate, not deceptive, purpose for making a question hypothetical is to divorce it from real-world irrelevancies; that does not make it indeterminate.)

Most of you got distracted by side-issues or went looking for tricks. Instead of taking the question and the data at face value, you looked for a "gotcha." There wasn't one. Many of you directly violated the purpose of such a hypothetical question and tried to tie it back to some unstated real-world situation, and then use that situation to determine your answer. Which, consequently, was usually wrong.

The answer to Mike's question was not a recommendation, an advisory, nor an admonition. It was simply a fact. How you might use that fact would depend upon your particular needs and situation (which would be unknowable in this limited hypothetical case).

By insisting that a fact qualify as a recommendation, you muddle your information.

A couple of readers astutely figured out that this had to do with the DxO website, but others even more astutely figured out that these lessons apply to any review site that is presenting comparison measurements.

The measurements will tell you what the device will do. They will not tell you what you will do with the device. If you insist that they conform to what you, in particular, do with the device, all you'll get will be distorted information, because it's almost a certainty that in some respect the "representative use" will deviate markedly from yours.


~ pax \ Ctein
[ Please excuse any word-salad. MacSpeech in training! ]
======================================
-- Ctein's Online Gallery http://ctein.com 
-- Digital Restorations http://photo-repair.com 
======================================


This whole thing is very silly. Mike you can do better.

If you're trying to make the point that no single-number review can mean anything to users with different requirements, well, just say so! Or give us an example: should the leica or the deardorff get a higher score? Clearly since great people used both, there can't be a unique answer. Done.

If you're trying to claim that there is a preferred way of obtaining a meaningless single number from a set of qualities, then you are wrong, that's the thing about meaningless numbers. (To make a choice of how to weight the qualities into one score, you need extra information about these qualities and the user's desires, which you've assumed you don't have.)

If you're not doing an abstract example but, as it now emerges, discussing DXOmark, then say so. Still, I don't see the photographic interest in reverse-engineering their overall score algorithm. The provide useful measurements (which they do explain) of some things, and if you care about these qualities and are shopping, you would do well to understand their measurements. But clearly the relative importance of them is up to you.

I was getting totally confused by this so I had to go back and re-read the original post. In retrospect, I now see that by "different settings" Mike was talking about different settings for a single measurement type. For example measuring resolution at different aperture settings. In that case obviously the device that has the highest value at any setting is the one capable of the better image quality. I had originally read the question as being about a series of measures for different settings where each setting was measuring a different aspect of the device's quality (I was wrong to do that, as Mike's question was pretty clearly stated - but that's how I read it).

But as Mike also points out, there are a wide range of human cognitive errors, and in this case they are positively encouraged by the set-up. What is Mike getting at? Why would he pose this question? He's usually opposed to endless pixel-peeping and these kinds of contests? What's he getting at?

So my pre-conceptions, encouraged by reading every post on TOP are to blame for my misreading and it's Mike's fault, not mine, that I made such a hash of it! ;-)
Adam

Wait a minute, Mike. I want to know why we should accept the premise that "The device's total score should be the max of its scores in any sub-field." There are, admittedly, circumstances where that would be the correct way to measure something, but without more specific information, I'm not entirely sure we should apply that standard. Suppose I score a 300 in math and 770 in English on my SAT, are colleges supposed to measure me by the max of my scores in any subfield? Normally, we benchmark by coming up with some way to composite the scores beforehand. In the case of the SAT, by simple addition. If the standard is "highest possible potential," then scoring them 9 and 7 would have been correct. I may be a bad reader, but I don't remember finding any indication that "mean score rounded to the nearest integer" wasn't the correct methodology.

Mike, do me a big, big favor. Please change in my last comment the word "answer" to the word "methodology". That sentence makes more sense that way.

Long time reader, somewhat recent commenter.

Thank you.

This would be the point I'd rant at the examiner for marking me down.

As the question postulated: "Your task is to assign one rank or grade for "image quality" to each device". You've now framed the answer as "potential image quality".
As originally postulated, any score that uses the data logically to come up with a score would be valid: either by maximum, average, or maxium of the minima.

The fact that DXOMark use maximum potential for their scoring is valid, for them, but outside the bounds of the hypothetical question.

Anyway, as you state, it's all moot because a single score for imaging devices like this is meaningless.

If you are going to be pedantic you need to be accurately pedantic. You said that the "quality" scores were ..797.. and ..777.. while you asked us to rate "image quality" which, from the data, was of course impossible :-)

We only know that the first camera works poorly in fog, snow, sleet and rain but excels in sunshine while the other ...

mike, you are getting it wrong. sorry.

you answer a hypothetical question with practical facts in mind. without those facts, any answer that doesnt ignore the numbers is right. 41 vs 45, 5 vs 6, 9 vs 7 and so on. in the end this was very much "bla-bla" (as the germans say).

on a different but related note i am still waiting for that absolute foolproof numerical scale for quality of content, e.g. in images.

Sorry, I disagree with your follow up to the hypothetical q. In your question, you asked readers to assign one score for "image quality" for each of the devices. In your follow up, you say the "proper" answer was 9 and 7 because those were the peak quality numbers. Your justification? That one should judge based on the highest attained score, and additionally backed up because dxo does it that way.

The problem I have with that answer would be that you specifically and intentionally did NOT mention dxo in your hypothetical question, so their approach shouldn't be the determining factor. The same is true for the metric of max quality vs average quality. You said "Your task is to assign one rank or grade for 'image quality' to each device." Please point to where that states "maximum attainable image quality" as you interpreted this phrase to mean. An average is easily as justifiable a metric.

This concept is precisely where your original question and your example of the presidential height diverge--you gave a clear method with the presidential height question, so there is one proper answer. If you asked people to identify which set had the highest number in it, everyone would have gotten it right, but it would have been pointless to ask.

Going back to dxomark, one more point: I looked at the past comments, and to me it seems Rob's point was not the same as your hypothetical situation. Let's accept the fact that the highest mark a camera gets will be that camera's score in a category, the method I believe they are using and the point of your hypothetical question. It does not resolve Rob's problem.

If you compare the Canon 5dII, the Sony A900, and the Nikon d700. Start at "Overview" and look at the "Dynamic Range" score on that page. They are, in order, 11.9, 12.3, 12.2. Now, click on the "Dynamic Range" tab and then click the "Print" tab. These numbers are the ones dxomark supposedly uses for the DR rating. Canon's best mark at any setting is 11.86. Sony's BEST is 11.99. Nikon's best is 12.15. Now 11.86 rounds to 11.9, 12.15 rounds to 12.2. I think Rob is correct to ask how Sony gets the 12.3 DR rating when their BEST on the chart is 11.99, and also correct in asking how the Nikon camera has a lower score when at every setting it's higher than Sony. Rob said all this, I'm just giving my two cents that he's totally correct.

Personally, I would suggest the most likely explanation is an error by dxomark--either their overall number or their best chart number was incorrectly entered. If it is not an error, they certainly should explain how the score is actually calculated, because it's clearly not from these "Print" numbers if no error has occurred. Also, you can't say this discrepancy is from sampling variation because presumably both the chart and the overall number are aggregate scores from the samples they used, so the chart and the score should agree no matter what sample variations exist.

Mike, I think your example is not so fair. In the president's example, the criterion was clear. Your image quality example was more similar to 'Suppose President A has legs that are 1m long, and arms that are 0.6 m long, and President B has legs that are 0.9m long, and arms that are 0.75m long. If limb length is the sole quantity that defines presidential greatness, which president is greatest?' Same problem. One has more limb length in total, the other has the longest limb. Both can be valid ways to judge depending on what you are looking for. As you supply insufficient information you cannot say one of the answers is 'proper' and the other is not. If it is my task to assign a quality score, I can take any bleeping weighting function I want. Device 1: 2.34, Device 2:3.6. Just as 'proper' as any other answer.

Go Hoover!!! :)

"Suppose I score a 300 in math and 770 in English on my SAT, are colleges supposed to measure me by the max of my scores in any subfield?"

James Liu,
Well, perhaps some colleges don't tend to do that, but perhaps they should, because that's how excellence is looked for and found in real life. Paul Krugman wins the Nobel Prize for Economics, but we don't care how good a swimmer he is. Michael Phelps is a great swimmer, but it's unreasonable to judge him by his knowledge of economics. In neither case do we care how good they are at carpentry.

Which would you rather have in YOUR college, a kid who got an 800 on the math SAT and a 400 on the verbal, or a kid who got 600 on both? (When I was at Dartmouth, I tutored the former in English [g].)

Mike J.

ctein wrote:

"By insisting that a fact qualify as a recommendation, you muddle your information."

I'll readily admit that this is where I got thrown off track. However, as I said before, I *can* reread Mike's first post as asking for a single number to represent "optimal" or "maximum" image quality ... but while it's suggested some of us were looking for tricks, I still don't see the original wording as being that straightforward. Honestly, to me it doesn't matter if you're talking about a camera (a system) or a sensor ... a single grade for "image quality" that represents an optimal value only achievable under 1 out of 8 possible settings is as useless (uninformative) as anything else shy of giving 8 separate ratings. OK, I still get a little hung up on "recommendation versus fact" but knowing it's capable of a 9 at some setting doesn't really tell me about its "image quality" when there are 7 other settings. It only tells part of the story.

Relating to dxo, the lesson here is obvious - don't take their ratings as recommendations :)

But nowhere does it imply that either George Bush or Herbert Hoover was a president. Begging the hypothetical question, perhaps?

Ah - the first one gets a nine, eh?
I guess there was another unstated fact, that the scores for the individual settings were each out of ten.
It could have been 9 out of 100 - in which case the final score should have been 1 according to your solution.

Not buying it. What you asked for isn't nearly precise enough to force the answer you want. And the answer you want is, for me, an obvious mistake (one people commonly make).

I do suspect that your hypothetical question may, in fact, be working precisely as you had intended, but that's another question.

Also -- the "post" and "preview" buttons are greyed out. Apparently I have to allow scripts on some non-obvious URL that is not yours and is not obviously typepad to be able to comment here. That's annoying, makes me violate basic security guidelines to comment. I think maybe it's yahooapis.com.

Mike Johnston: "All true. Which is why I've been saying very consistently for any number of years that tests and reviews that come down to a single rating or rank or number are next to useless."

Ctein: "A couple of readers astutely figured out that this had to do with the DxO website, but others even more astutely figured out that these lessons apply to any review site that is presenting comparison measurements."

That's the thing: DxOMark numbers aren't reviews. They're quantified results - not evaluated results.

If anyone out there assigns any significant weight to a single score in DxOMark, what do we care about them? Even if you look at the charts on DxOMark's site you get much more useful information.

DxOMark remains useful as one (small) piece of the fact- and opinion-gathering steps in deciding whether to buy something. Other things matter far more.

I mean no offense by this, but we didn't fail you - the hypothetical failed us. I think it was a poor attempt to make a point you probably could have made much more eloquently. The fact that virtually everyone chose "item two unless you did the majority of your shooting at that 9-rated setting" shows you we're not stupid people who care only about a single-value rating.

Agreed with David Dyer-Bennet: "What you asked for isn't nearly precise enough to force the answer you want. And the answer you want is, for me, an obvious mistake (one people commonly make)."

Dear Dennis,

You didn't go far enough. A point Mike has made repeatedly (and so have I) is that NO single number, from ANYONE is going to give you enough information to make an informed decision.

That said, lots and lots of people love single number ratings. A high percentage of them can't make sense of anything but single number ratings. Such is the way of the world

Nobody says one has to use such numbers. Clearly, folks like thee and me shouldn't. It's better if one can make use of more complex data.

Which was, indeed, a metapoint of the exercise.

pax / Ctein

Dear Tim,

Your responses are good examples of why this needed to be a hypothetical case. Because it's NOT about DxO (that was only the genesis of the disagreement), it's about what is good general methodology and reporting practices for a reviewer.

If I'm assigned to find and report on what is the sharpest lens, film or paper, and I am constrained to report a single number or ranked list, I will do that based on each product's BEST performance. Not the average sharpness under all conditions, not the consistancy of sharpness, not the typical sharpness. But the BEST sharpness it can produce.

It's incomplete information, but it is the correct answer to the question. Anything else is not. If someone wants maximum film sharpness, they need to know best performance.

If I'm kind, I'll also report the test conditions that produce the best result (like lens aperture or film exposure). If I'm very kind, and I think the audience can handle it, I might even provide tables showing the performance under different conditions. But, it wouldn't change the single parameter rankings.

That is the correct and most useful way to report such results. Constrained by space-time and reader comprehension, it is the best answer to the question. Anything else is not.

pax / Ctein

This reminds me of the magical 0-60 mph times that were quoted in car magazines (probably still are) and which came to carry an absurd importance for car enthusiasts through the 80s and 90s. A car with a very peaky, turbocharged engine and a sticky manual gearbox might take a dozen or more attempts to deliver something close to its best 0-60 time, after the tester had worked out how fast the engine should be revving when the clutch was dropped, when to change up and sometimes how to change up. A competing car with a big, lazy engine and an automatic gearbox might deliver something close to its best time every time, and almost regardless of the tester's skill, even if that figure wasn't quite as impressive.

In both cases, it was the single best result that was delivered. That was the job of the tester - to deliver the number that filled the cell in the table. The job of the reader was to work out which car might be more fun to drive, more suitable to his driving style, more reliable and more practical and that demanded a different kind of report from a different kind of car magazine, on top of a few test drives and perhaps a rental. The 0-60 times still had their part to play. I can still quote them from memory in a way that some younger readers today might be able to quote their ideal cameras' DxOMark numbers. :)

Bahi,
That's pretty funny about the 0-60 times. It reminds me of one of my all-time favorite "comebacks" in the Car & Driver letters section. The writer had written that he'd "blown the doors off" a certain BMW, and a reader, obviously in a high state of umbrage, had written in armed with all the magazine's own stats and times for the respective cars. Having made his point, he rather triumphantly asked, "So, tell me, at exactly what point did the BMW get its 'doors blown off'?" "Ed.'s" brief response was, "At the point it realized it was being driven by an inferior driver."

Still makes me chuckle....

Mike J.

Dear Bahi,

Wow, that's a really important point with regards to any review tests. If the tester is really good at what they're doing, they're going to get better results than most of the readers will be able to, because most of the readers will not have perfected their technique.

I have alluded to something like that in many of my reviews; for example when reporting on enlarging lenses, I've emphasized the importance of perfect enlarger alignment and really accurate focusing. But until your post, I hadn't thought about the fact that there was a more general principle involved.

Thanks for the insight.


~ pax \ Ctein
[ Please excuse any word-salad. MacSpeech in training! ]
======================================
-- Ctein's Online Gallery http://ctein.com 
-- Digital Restorations http://photo-repair.com 
======================================


The comments to this entry are closed.

Portals




Stats


Blog powered by Typepad
Member since 06/2007