Install Theme

lithnin:

kontextmaschine:

image

That’s a good graph format I’ve never seen before (the L-R location of the red/blue circles represents an average, too)

Lot of mirror images but some differences - partisan-marked stuff tends to have an “additive value” of around 100, races around 150, Hillary is 84 which Trump has on Rs alone (plus 8 D, even with so many of them at 1 or 0) but Obama was 112

“the Alt right” and “antifa” are lowest at a mirrored 57/58, their onsides teams are lukewarm but the opposing team hates them

Reblogging for good data visualization

(via jiskblr)

nostalgebraist's review of The Book of Why →

I wrote a review of Judea Pearl’s “The Book of Why.”

SPOILERS: I didn’t like it!

nostalgebraist:

I know next to nothing about bioinformatics, and have a very basic question about it:

I keep seeing those studies that try to identify genetic factors for traits or diseases in big data sets, and end up with results like “we identified 300 SNPs that collectively explain 5% of the variance.”  And this happens with things that are thought to be very heritable based on other evidence, so people talk like there’s something missing.

There are at least three missing pieces that could cause such a gap:

(1) factors not present in the SNP data at all (this is sort of my catch-all category)

(2) SNP-SNP interactions (nonlinearity)

(3) linear effects of so many SNPs that we can’t get significance for most of them with our sample size

Do we have any sense of which is the main factor?  I’m particularly interested in (2) – I know people use random forests and stuff sometimes, and it’d be interesting if we could get good cross-validation performance out of a nonlinear model, even if it wasn’t interpretable or didn’t have a well-defined hypothesis testing framework.

(@raginrayguns​ may know?)

Now that I think about it, (3) is poorly phrased, since whether or not you get significance depends on how you correct for multiple comparisons.  Since some of these studies use conservative corrections like Bonferroni, they have deliberately low power, focusing on identifying SNPs that are really associated with the trait and having few false positives.  If we expect something to be caused by very many SNPs, it’s not surprising that this will find relatively few of them.

If that’s the case, I guess I’m wondering why I only see these studies and not their complement, the studies that just try to predict outcomes.  Perhaps those studies exist, but if so it’s strange that (reputable) people do prediction on the basis of the conservative studies.  E.g. https://dna.land/ lets you upload genetic data and see a “Trait Prediction Report,” but their trait predictions seem to rely on studies like those I described in the OP, which identify a few SNPs with confidence but have little predictive value.

In a recent post, Scott linked an interesting paper about controlling for statistical confounders.  The paper draws some pretty damning conclusions, all based on the simple idea that you’re never really controlling for X, you’re controlling for your imperfect proxy for X.  Since the proxy is imperfect, if you’ve measured some associated variable Z, it’ll usually give you information about the true value of X above and beyond what your proxy tells you, and the usual approach mistakes this for an independent effect of Z above and beyond its association with X.

That’s very interesting, but it strikes me as just one facet of a bigger issue with statistical controls which has always unsettled me.  There is something oddly backwards about the whole idea.


You, the scientist, want to publish an exciting new study about some variable of interest called Y.  Everyone knows about ten different variables that “obviously” affect Y; call these, collectively, X.  A study saying “X affects Y!” would not be new or exciting.  No, you want to say that some other variable, Z, affects Y.  No one has discovered that yet.

A problem arises: Z is also associated, in various ways, with various of the ten components of X.  What if the correlation between Z and Y (or nonzero regression coefficient, or whatever) is just due to the already known X-to-Y association?  How can you tell?

The usual answer is: make some sort of model predicting Y from both X and Z, and show that the model uses some information from Z to predict Y, even though it knows about X, too.  Success!  Now you can claim that Z is associated with Y.  You are now free to forget about your model, which was merely a tool you used to draw this conclusion.  You didn’t really care about predicting Y, and you don’t care whether your model is the best model for predicting Y, or even a good one.  It has served its purpose, and into the dumpster it goes.


As I said, there is something backwards about this.  Your claim about Z and Y depended entirely on Z helping some model predict Y.  Clearly, the strength of your argument must depend on the quality of this model.  If the model is a bad model of the relationship between X and Y, before Z is even added to the picture, then it’s hard to conclude anything from what happens when you add in Z; if your model doesn’t capture the relationships we think are there in the first place, its use of Z could just be an attempt to “put them back in.”

(For example, someone’s BMI is inversely proportional to the square of their height.  The electrostatic force between an electron on someone’s head and an electron on their heel is also inversely proportional to the square of their height.  Suppose, absurdly, that someone tries to model the relationship between height and BMI by doing linear regression on the two.  This will fare poorly, because the relationship is inverse-square, not linear.  But if they add in the electrostatic force as a regressor, it will of course have a nonzero coefficient, and predict BMI much better than the height term.  This does not show that this force is associated with BMI “even controlling for height”!)


This was brought forcefully to my attention recently when I was reading a recent study about alcohol consumption and mortality.  The big punchline was that, in a huge meta-analysis, it only took something like 7 standards drinks / week (not the 14 specified in the US guidelines) to negatively impact mortality.

There was a big problem with this claim that has nothing to do with this post, namely that the researchers meant “the confidence interval for 7 drinks / wk just barely excluded no effect” (it was nearly symmetric about a hazard ratio of 1.0).  This is the same old problem where people try to figure out when an effect “turns on” or “turns off” by noticing when they start being able to reject the null, which is the kind of thing you are taught not to do in Stats 101 but which is nonetheless endemic in the medical literature.

But anyway, even after facepalming over that, I was curious about how the study adjusted for confounders.  So many things are associated with mortality, and so many things are associated with alcohol consumption – how do you disentangle it all?  And the authors clearly tried to do their due diligence on this front.  My eyes started to glaze over as I read the list of confounders they controlled for:

HRs were adjusted for usual levels of available potential confounders or mediators, including body-mass index (BMI), systolic blood pressure, high-density-lipoprotein cholesterol (HDL-C), low-density-lipoprotein cholesterol (LDL-C), total cholesterol, fibrinogen, and baseline measures for smoking amount (in pack-years), level of education reached (no schooling or primary education only vs secondary education vs university), occupation (not working vs manual vs office vs other), self-reported physical activity level (inactive vs moderately inactive vs moderately active vs active), self-reported general health (scaled 0–1 where low scores indicate poorer health), self-reported red meat consumption, and self-reported use of anti-hypertensive drugs.

My first reaction upon reading this was to think, “okay, some of these may or may not have been poorly operationalized, and that may have affected their results in problematic ways not captured in the sensitivity analyses in their appendix, or maybe not, because how the fuck would I know when there’s so much going on in their mortality model?”

And then I was like, wait.  They have a “mortality model.”  They’re only focusing on the coefficients for one variable, but it’s got a zillion variables in it.  It sounds like it could be the sort of model used by the people who are actually interested in predicting mortality as accurately as possible – say, insurance companies – as opposed to people who are just interesting in making claims about alcohol.

But they aren’t telling me how good their model is.  I have no idea if it’s similar to the models the insurance company people use, or if the insurance company people would turn up their noses at it.  Their model was created on the spot to make some claims about alcohol, and even if I spent a day scratching my head and trying to understand it, the next day I might read a paper with another mortality model, and have to repeat the process.  There must be hundreds of models like this, invented on the spot for the purposes of statistical controls, and then discarded.

It feels like there should be someone in charge of maintaining our best models of things like mortality.  Questions about individual variables, like alcohol, could be investigated on a common footing.  Instead, we have hundreds of claims about how some Z affects some other Y, derived from different models, which might not all be true if stitched together into a single framework.

I still don’t really understand why Cambridge Analytica is supposed to be such a big deal.  Sure, I understand why people object to them obtaining personal data under false pretenses.  But I don’t understand the leap from “they have the data” to “they are puppetmasters controlling people’s voting behaviors.”

Targeted advertising based on internet user data is, of course, a hot area that is attracting a lot of investment these days.  But that does not mean it works very well yet, or that it constitutes some vast leap in effectiveness over traditional marketing.  I feel like every other day I hear someone half-facetiously lamenting how badly targeted their ads are.  (Sometimes it isn’t facetious at all – it would be nice, of course, to learn about products I actually want to buy.  And yet it almost never happens via online ads!)  Despite the best efforts of the many people working on this problem, and the legions of automated trackers stalking us online, most of us still become aware of ad targeting only when we notice something hilariously irrelevant popping up on every site we visit.


I am equally unimpressed by everything I know about what is going on under the hood.  Around a year ago, I made a post poking fun at the data Facebook (ostensibly) shows to advertisers about me – in which “Toxicity” was listed as one of my hobbies, and “Travel, places and events” included several seemingly random places which I’ve never been or wanted to go (”Slovakia”) as well as, mystifyingly, “Time” (illustrated with a picture of an hourglass).

Checking the same page now, the results seem a bit better (although perhaps only because there is no “Travel, places and events” section anymore), although still hit and miss.  (Many of the successfully identified interests, like “Tumblr,” are listed because they’re “apps I’ve installed” [on my phone? how does Facebook knows this?], but there are a lot of false positives even there – it also says I’ve installed Instagram, Zillow and Feedly [I haven’t, and haven’t even heard of the last one].  Under “Shopping and fashion,” amusingly, there is only one interest – “Hat” – although I’ve never been in the market for buying a hat online.  “You have this preference because you liked a Page related to Hat,” Facebook explains.)

What about those spooky analytics services that can use our Facebook likes (or whatever) to predict our personality, intelligence, sexual orientation, etc.?  I admit it’s noteworthy that this works as well as it does, but it still doesn’t work that well.  A while ago I installed the extension Data Selfie, which sends your Facebook data to a few of these analytics services and shows you the results.  As of this writing, it has ingested 272 hours of my Facebook use, including my likes, every word I type, and how long I spend looking at each item in my feed.  Here is what it concludes about my Big 5 personality traits:

image

The green marks come from one service (I think it’s Apply Magic Sauce?), using the posts showing up in my feed; the yellow dots come from another service (I think Watson Personality Insights), using the text I’ve typed.  The two give very different answers – for instance, the yellow one thinks I have extremely high openness and the green one thinks I have lower than 50th percentile openness.

On “political orientation,” well, it seems to have figured out that I might be “liberal”:

image

… on the basis of my news feed, which is chock-full of liberal and left-wing political posts.  (For “religious orientation” it gives me 57% “None,” which relative to base rates is actually pretty good, I guess.)

Out of curiosity, I also sent my data directly to Apply Magic Sauce (based on the PNAS paper about likes) – Data Selfie says it’s using AMS, but I wasn’t sure which of its results came from where.  This gave some amusing results, like “Your probability of being Female is 82%” (the API Data Selfie uses for gender, on the other hand, gives 72% probability of Male).  Under “Education” (glossed as “Probability of having a personal or professional interest in a given field”), a breakdown of fields gives me a whopping 32% for “Art,” 20% above the population average and far higher than any of the others.  (I score a measly 5% each on “IT” and “Engineering,” both below population average; I have a PhD in applied math and currently work in tech.)

AMS helpfully indicates which of my likes are especially influential on its decisions – the fact that I “like” Radiohead seems to be giving it, uh, mixed messages:

image
image

None of this should be surprising.  Data science is really hard!  Here’s a nice lecture I recently watched from someone at Booking.com, which employs over 100 (!) data scientists, working on tasks as seemingly innocuous as “figuring out whether someone cares about getting served breakfast at a hotel, and then using that to choose hotel recommendations for them”:

It turns out – and is not surprising, in retrospect – that even something as simple as this presents tough statistical difficulties, and requires a lot of hard thinking about how to get around sampling biases and correlation/causation disconnects.

Real data science looks less like mecha-Big-Brother and more like this talk: lots of hard work to figure out extremely simple things that any human could read instantly off the data.  There is a magic leap in the public conversation about things like Cambridge Analytica, where we go from the knowledge that some organization has detailed information on lots of people (in itself, kinda scary) to the idea that they must, of course, be able to use it.

It is easy to assume, if you don’t think too hard about it, that if having detailed information on one person is scary, then having it on ten million people must be far scarier.  There is a vague sense that anything we creepy one could do with data on a single person can be done simultaneously for all ten million, or that even creepier things could be done from the aggregate, through vaguely imagined “data mining” techniques.

But, of course, if you have data on one person, you could hire one person to interpret it, while you cannot hire ten million people to do the same job in parallel.  So we try to get machines to do it.  And the machines are very bad at it.  The actual meaning of “data mining,” in this situation, is “trying to get a machine to even kinda sorta do a tiny piece of what a human might be able to do.”  In practice, the best one can do is usually to get the machine to notice very broad demographic information – stuff like advertising baby-related items to all women in a large age range, which is an advance over advertising them to everyone, but is worlds away from the “micro-targeted” manipulation of Jonathan Albright’s fever dreams.

When we try to get smarter, the results tend to get worse, especially on the levels of individuals rather than broad aggregates.  On average, you can predict a surprising amount about people from their Facebook likes.  But on an individual level, this looks like concluding that I, nostalgebraist, count “Toxicity” among my hobbies, am interested in taking vacations to Slovakia and to somewhere called “Time,” am either 98th percentile or 44th percentile on Openness to Experience, and am either 28% or 82% likely to be female.

And remember, marketing has always existed, and it has always been used, among other things, for political campaigns.  Perhaps marketing is a little better than it used to be, and perhaps that marginal improvement is making a marginal impact on the effectiveness of political campaigns.  Even that would fail to scare me – after all, we are talking about showing people the equivalent of leaflets that appeal to them, and if that is enough to sway an electorate, we were already screwed.  But even that is not certain.  Is marketing better than it used to be?  I honestly don’t know.

I’ve wanted to do this for a while, but I finally got around to it:

I’m debuting a new tag “#statpicking” (as in “stats + nitpicking”) for posts about questionable statistical methodology in specific empirical research papers.  I’ve been making a lot of these recently, and I’m relatively proud of them, but I didn’t have any tag that would collect them together.  (They had usually been tagged #mathpost, but that mixes them together with a bunch of other stuff.)

This tag won’t include posts about philosophy of statistics (i.e. all the Bayes posts) and probably won’t include posts about techniques I don’t like unless any specific papers are criticized (i.e. most of the the factor analysis / IQ posts).

Highlights so far:

My long post on full distributions vs. summary statistics is the closest thing to a single statement of the reasons I so often get frustrated with the stats methodology in papers I read.

Some strange things about the happiness measures on the General Social Survey.

Simon Baron-Cohen’s systemizing/empathizing research is astonishingly bad: part 1, part 2.

Confusing horizontal axis of the day

Confusing horizontal axis of the day

I know I complained about this exact thing before, but I keep reading papers where researchers try to measure the duration of a phenomenon by finding the earliest time they can’t statistically detect it with p < .05

No!  Nooooo!  This is so bad in so many ways!  It has multiple comparisons problems, it has within-vs-between-subjects problems (easily fixable by doing a paired test but they never do that), but those aren’t even the main problem, the main problem is that you’re making the duration a function of your sample size and it’ll get longer or shorter if you re-do the study with a different sample size

You guys successfully nerdsniped me with this trends-in-happiness stuff, and now I’m trying to back away from the rabbit hole before it pulls me in (I actually downloaded the General Social Survey and started playing with the data! so many variables!).  But here are the most salient things I’ve learned, for people curious about what this research means:


1. The paper that originally got me nerdsniped, “The Paradox of Declining Female Happiness” (Stevenson and Wolfers 2009), used data from the U.S. General Social Survey, so I’ve mostly looked at that.  There are other data sources (see e.g. this interesting response to S&W 2009) that don’t have some of the GSS’ flaws.  But I get the impression that the GSS is pretty popular with researchers.


2.  The most important thing you need to know about the happiness measures on the GSS is that they are extremely coarse-grained.  The survey item which produced the big “paradoxical” result about female happiness was the following question:

‘‘Taken all together, how would you say things are these days – would you say that you are (3) very happy, (2) pretty happy, or (1) not too happy?’’

Those are the only three options.  The GSS does also ask about satisfaction with some specific areas of life, like finances and work (with 4 possible responses), and also asks about whether you have a happy marriage (same exact 3 options as on the general happiness question).

The only observed trend here, then, is increases/decreases in the fraction of respondents occupying each of these three boxes.  Given that fact, I was really impressed with Stevenson and Wolfers 2008 (which I promo’d yesterday), in which the authors claim they can estimate, from just this information, the effects of time and demographic on the mean and variance of an underlying continuous distribution – without assuming the functional form of that distribution, and while simultaneously having to estimate the cutoffs that slice that continuum into the three boxes!  I still have a “sounds fake but okay” reaction to this – I’m surprised the model is identifiable at all, and am kinda concerned about the stability of the estimates.

Technicalities aside, I was really excited about being able to get the variance as well as the mean, because given these 3 boxes, “happiness inequality” seems more morally salient to me than mean/median happiness trends.

Why?  Well, think about the categories.  I honestly am not sure what to make of people opting for “pretty happy” instead of “very happy,” or vice versa.  If I imagine the General Social Survey people knocking on my door at various times in my past, I can imagine myself answering one or the other of those two on the basis of, like, how the past week had gone.  I don’t see myself as aiming, in life, for a state of being that is consistently “very happy” as distinguished from “pretty happy.”  Indeed, part of me reflexively bristles at the (callous?) indifference to outward circumstances that I imagine such a state would require!

On the other hand, the times in my life when I would have answered “not too happy” (the lowest possible option) are sharply distinguished from the others, and encompass some states of misery which I would very much like to prevent in others.

So, insofar as any “overall trend” here would mix together these two distinctions, it’s hard to interpret.  But a decrease in variance, toward a mean that is at least somewhere in the middle, implies that we are raising people up from the “not too happy” box – which is all I care about.

Hence I was encouraged to hear that variance on this question has declined greatly, across and especially within groups, to the point of swamping the mean shift.


3.  That still isn’t the full story though.  Because remarkably few people use the lowest category.  Either people are far happier than I (and the conventional wisdom) would imagine, or they are putting on an artificially happy face for the researchers.

Here are the male and female trend lines for the “not too happy” response (from the online data explorer, check it out):

image

You’ll note that they line up very closely, which is interesting.  But also, they’re consistently between 10% and 20%.  Apparently the remaining 80% of the U.S. population has been either “pretty happy” or “very happy” for the past 4 and a half decades!  A golden age!

I first noticed this when I was working with the data offline and drilling down into a specific category – I think it was “married women who report their marriages are ‘not too happy’” (n.b. this is from the marital happiness question, not the general one).  And I noticed that suddenly everything was really noisy, because my sample sizes were as small as 20-40 people per year.  (For marital happiness this phenomenon is even more extreme – it’s more like 5% of women who say “not too happy,” with a full 60-70% reporting “very happy.”)

We appear to be studying, and fretting over, the slight variations in bliss level of a mostly blissed-out populace.  Since this does not resemble the actual country I live in, something must have gone wrong with our measuring apparatus.

(Note: I think I made this post too long by going on too many digressions.  A short version capturing the main point would probably be better overall, although this at least gave me the chance to ride some entertaining hobby-horses.)


1.

I’ve noticed that several seemingly unrelated frustrations of mine can all be classified as “people should care more about entire probability (or frequency) distributions, rather than summary statistics like averages.”

This is frequently a problem in academic papers.  Many of the problems with that godawful marijuana paper I posted about earlier involved the authors doing complicated things to dredge individual numbers (p < .05, etc.) out of their small sample, when with only n=22, a set of histograms would have been much more informative.  With only 22 people, your statistical power probably isn’t very high, so it’s hard to tell what it means that you can or can’t get p < .05 for something.  But if there’s an effect on a particular metric, we should be able to see it just by plotting a histograms of heavy users on that metric and one of light users.

Indeed, those histograms would contain many other interesting facts.  They would tell you, for instance, whether a given “average” effect was the result of everyone experiencing roughly that effect, or the result of half of the people experiencing no effect while half experience one twice as big, or whatever.  It would let you see differences that wouldn’t show up in a t-test, where a distribution changes shape while still having a similar mean.  It would let you see the difference between statistical and practical significance – you could see when there’s a big effect that the study just doesn’t quite have the power to detect, and when there’s a statistically significant but tiny difference that’s swamped by individual variability.  (You can usually infer the latter from std. devs. if they’re supplied, but this isn’t always possible, and the former is usually invisible.)

You get all of these things for a simple reason: the distribution implicitly contains all of the other information.  Every derived number you see in a statistical paper came from a distribution (or collection of them), and if you knew the distribution, you could re-derive all the numbers.  But it doesn’t go the other way: you can’t re-derive the distribution from the numbers.

(Unless you know the distribution has a parametric form, but this is irrelevant to real data; “these frequency counts could have come from this Gaussian” does not let you reconstruct the original counts, and the original counts contain more information, e.g. information that might lead you to disagree with the assertion about Gaussianity.)

There’s kind of a tradeoff here, since I’m literally talking about exhibiting pictures and determining things by “seeing them in the picture,” which is uncomfortably subjective.  If you wanted to make it more objective, you’d have to come up with numerical proxies for the judgments you’d be making visually, which gets you back to … exactly what I was trying to get away from.

But it isn’t really that stark.  One thing that would vastly improve a lot of papers I read is just including more histograms, even if they also included all the same derived numbers.  Additionally, even if we’re deriving numbers, it’s possible to have an attitude that pays more “due respect” to the distribution.  This is a reason to prefer nonparametric tests, but parametric tests are fine too as long as you can justify using them.  For instance, a lot of standard intuitions (about the interpretation of the mean and tests that rely on it) break down for data that is not unimodal.  But a lot of data is unimodal, so this may not be a problem.  But it’s very rare for authors to just tell me the data is unimodal, even though it’d just take a few words to do so.  (A histogram would also help, and most of the cases where this information is included are ones where it’s included implicitly via histogram.)

I want to focus on this unimodality issue more, because it’s central to the problem.  We have a habit of simplifying a distribution down to a single number, usually a mean; if a second number is included, it’s some measure of spread around the mean.  A bimodal distribution can’t be captured in one number, and a mean-and-spread won’t capture it either.  So implicit in our whole way of talking about results in social and medical science is that everything is unimodal, or else nothing would make sense.  Indeed, this is a problem even for unimodal distributions that are skewed – since the mode, median, and mean are different, the mean (which is typically the number reported) is a poor guide to the “typical” case, either in the sense of “most common” (mode) or “50th percentile” (median).

Here’s an example from another marijuana paper (PDF).  This paper had a really intriguing result – previous studies had shown that cannabinoid receptor availability is suppressed in regular users but bounces back somewhat after 28 days of abstinence, and this study showed that most of the bounce-back (~75%) happens within just 2 days, with the remaining 26 just adding some extra on top.

However, only means are reported, so that this time trajectory (“75% of the 28-day improvement in 2 days, the remaining 25% in 26 days”) may not have occurred in any experimental subject, like the proverbial “average American family” that has 2.5 kids even though literally no one has 2.5 kids.  It could be that some people bounce back completely in 2 days (or fewer) while others improve slowly and linearly; it could be that some people improve fast while others don’t improve at all; it could be that everyone improves exponentially with the same half-life, so people who started lower bounce back faster on a linear scale; it could be that you can only bounce back if you’re above a certain threshold, but if you are then it’s fast.  All of these possibilities have strong and different implications for individual users.

When I was thinking about that other study, I joked with myself that you’d get higher-quality information just from talking to a few stoners.  On reflection, I think this is less of a jokey exaggeration than I realized.  These studies have small samples (the one I just talked about had 11 dependent subjects and 19 controls).  These sample sizes are in kind of a transitional regime between case studies and proper statistical samples.  Because tests with a small sample will have low power, I feel wary of drawing any conclusions from the observed patterns of significance and non-significance (although these are often presented as the main results).  Since I don’t think you can do much with derived quantities (of the sort that usually get derived), I am correspondingly more interested in individual cases.

After all, if nothing else, we have a collection of individual cases here, and the “case study approach” can still be interesting with very few cases, while the statistical approach cannot.  If your sample size is three, and you give me detailed info about all three cases, I have at least learned about 3 things that can happen to human beings.  If the variance is high, all the better: now I know about 3 quite different things that can happen to human beings.  But if your sample size is 3 and you only report the results of statistical tests – which are all going to turn out non-significant, probably – I have learned nothing from you.

So, if you know (or know people who know) 11 stoners, you have 11 (colloquially presented) case studies.  This provides a lot of information, if not about overall trends, then about the sorts of things that can happen in individuals – which, after all, is what all of this (i.e. medical and public health research) is supposed to be about.  I can even start to get a sense of the relative frequencies of distinct subgroups: if 3 of 11 stoners experience Pattern X while the other 8 experience Pattern Y, well, I’d like more data, but that’s already suggestive.

But if you form an averaged time trajectory over the 11 and never give me more detail about the distribution beyond that coarse average, I don’t have any Pattern Xs or Pattern Ys, I just have a mean pattern that may not correspond to anyone’s story.  I have graphs like this (from the paper):

image

Here we have a picture of the Average Stoner During A Tolerance Break, who may not resemble any particular stoner at all, like the average family with 2.5 kids.  Those error bars aren’t quantiles, BTW, they’re SEM, so we don’t even have any skew information here.

Rather than providing us with anything about individual trajectories, the authors concentrate instead on p-values.  Their reasoning – and I hope I am committing a misreading here! – is based on one of those errors they warn you about in Stats 101 classes: they are interpreting non-significance as conveying positive information about the world.  They present it as a big deal that while they got significance between stoners and non-stoners, the result is no longer significant after 2 days of marijuana abstinence:

Compared with HC subjects, [11C]OMAR volume of distribution was 15% lower in CD subjects (effect size Cohen’s d of 1.11) at baseline in almost all brain regions. However, these group differences in CB1R availability were no longer evident after just 2 days of monitored abstinence from cannabis. [my emphasis]

 Of course, the p-values would all slide downwards with a bigger sample, so if anyone does larger studies of this, they will predictably “find” that it takes longer than 2 days to lose significance.

Right next to that figure is another one, with a more appropriate vertical axis, which shows what the authors mean by a “no longer evident” difference: 

image

That’s right: even after 28 days of abstinence, they’d closed less than half the gap between them and non-users.  But since the sample size is small, they could only get p=0.27 for this clearly-there difference.  (Remember, the bars here are SEMs, so the STDs will be a lot bigger.)

This had been a bit of a digression, since I’m not sure these mistakes about null results have much to do with “respecting distributions.”  But I do think I can justifiable use this as another example of “this is your brain on summary statistics.”  Some of these mistakes are probably due to the emphasis on “significant = important” that is ingrained by publication criteria, but it also evinces a willingness to discard a lot of the information in your data.

To provide a Gallant to pair with the Goofus above, here’s yet another weed paper with a small sample.  They provide a lot of fine-grained, bimodality-tolerant, case-study-like detail:

The general trend was a decrease in blood pressure within the first 43 min after onset of smoking, but an initial increase in blood pressure was observed among some participants. Concerning individual mean arterial blood pressure, the largest decreases in mean arterial blood pressure were observed with the high THC dose with drops up to 41% below baseline (from 121 to 71 mmHg). Subjects 2, 23 and 12 showed the greatest decreases in mean arterial blood pressure whilst their THC serum concentration was 34, 213, and 137 μl/L, respectively, at 43, 17 and 7 min after onset of smoking. Subjects 10, 22 and 19 showed limited initial increase in mean arterial blood pressure (up to 37% above baseline, from 87 to 119 mmHg). Mean arterial blood pressure was still below baseline levels 8 h post-smoking for a majority of the participants.

And they even make plots where they just throw together every single participant’s time course:

image

Admittedly these look ugly, and I’m sure there are much nicer ways of presenting the same information.  Still: this is the “ask some stoners” of graphs, and I mean that as high praise.  These graphs can answer many questions you might want to ask, even if the researchers didn’t ask them: what different types of trajectories are possible, the range at any time, where the distribution is peaked and how far away the unusually high/low trajectories are, etc.  Admittedly, you could be asking these questions more rigorously than by eyeballing a figure – but the authors probably aren’t going to answer every such question rigorously, so these pictures (like histograms) provide an indispensable supplement.


2.

Speaking of public health issues, I think I also see the downstream effects of these bad habits on the doctors who consume medical research.  (I would imagine there are similar effects on people who act upon social science research.)

I complained a while back about how I received different responses from different psychiatrists (and my GP) about benzodiazepenes.  It seemed like each doctor had a single opinion about benzos, and didn’t adapt it much to the patient.  The “benzos are bad” doctors would be unmoved when I mentioned I’d been on the same dose for years, even though “people have to keep taking higher and higher doses” is one of the reasons the “benzos are bad” idea exists.  I think there was an element of “cover your ass” here, but it felt like the usual presumption of expertise in doctor-patient relationships was breaking down, as each doctor would refer to a supposed “standard opinion” which happened to concur with their own, clearly non-universal opinion.

Benzos are a case where lack of unimodality is important.  As I mentioned, one reason why doctors are wary of benzos is tolerance, specifically the need to ramp up the dose more and more over time.  It is true that some patients exhibit this pattern when prescribed benzos.  But then, there are those (like me) who don’t. From this article “reappraising” benzos [below, “BZDs”]:

Although there are occasional reports of patients with anxiety disorders who increase the dose of BDZs to continue experiencing the initial anti-anxiety effect or who experience a loss of therapeutic benefit with the continuing treatment with BDZs, a body of evidence shows that the vast majority of patients with anxiety disorders do not have a tendency to increase the dose during long-term treatment with BDZs [30,69–72]. Therefore, tolerance to anxiolytic effects of BDZs usually does not occur in the course of long-term treatment. When patients increase the dose of BDZs, this usually appears in the context of other substance misuse.

This suggests bimodality, or a unimodal distribution not very well represented by its peak.  There is a population subtype that requests increasing doses, and people outside that subtype generally do not.

Likewise, a common concern is the withdrawal syndrome (or equivalently “dependence,” which is “characterized by the symptoms of withdrawal upon abrupt discontinuation and no tolerance”).  But again, this is not universal, and this may be another issue of “population subtypes”:

Withdrawal symptoms occurring after an abrupt cessation of long-term BDZ use are not inevitable; such problems were reported in approximately 40% of individuals taking BDZs regularly [80,81] and they were more likely in people with personality disorders, especially those with passive-dependent personality traits [82,83] (ibid.)

(I myself have abruptly stopped taking BZDs and then not taken them for periods of several weeks, and I’ve never experienced withdrawal symptoms.  I am the 60%.)

We have distinct desired and undesired patterns, so it would seem that the clinician’s task is to think about whether their patient is displaying (or likely to display) the undesired pattern, and act accordingly.  Instead, what we have gotten is a one-size-fits-all concept which says that, “overall,” BZDs can cause worrying tolerance and dependence issues, and so one should treat them warily as a last resort. This means that even when evidence about my own personal BDZ use over 3+ years is available, doctors prefer to consult assessments of BZDs and SSRIs “overall,” throwing away the distribution in favor of a single number.

(I don’t want to play up my own frustrations about this, which are very minor as frustrating medical care goes; I’m using myself as an example solely because it’s a case I’ve read a bit about.) 

(Some of the oddness I am trying to explain is no doubt the result of the pharma industry pushing heavily for the SSRIs, which are newer than BZDs.  For instance, it’s gradually been realized (by the profession – more quickly by patients, one would assume) that SSRIs can have a bitchin’ withdrawal syndrome themselves.)


3.

This problem also appears in many conversations that are not, superficially about statistics.  A familiar example is conversations about attractiveness.  A lot of people talk as though there’s just one scale of attractiveness, which would only be true if the assessors of attractiveness had unimodal (and strongly peaked) preferences.  In concrete terms: if you think that the way for any straight man to become more attractive is to improve him on a single “what women want” metric, you are assuming that straight female preferences are (if not all literally identical) strongly peaked around a single mode, so that boiling the distribution down to a single number is a reasonable approximation.

Everything in my relevant experiences suggests this is false.  There are people who like all sorts of things, and there are things far from the mean/median that are nonetheless interesting to many (perhaps there is another mode at these points).  This does not mean that scales of attractiveness do not exist, but that there are multiple such scales worth considering (at least one per mode), and that climbing the nearest scale is probably better than climbing the mean/median scale.

There is something analogous in political attitudes that assume sociological groups are homogeneous blocks.  This post is already too long, though.