Friday Polls - The Seven States - The Difference Engine

That WSJ seven-swing-states poll, and a margin of error revelation.

Hi, happy Friday.

Today I want to share some technical observations about the reporting of the WSJ’s swing state polling that got a bit of coverage this week. The topline report is available as a PDF here. I’ll attach it here so you can click through in case it’s paywalled.

WSJ_Swing_States_Partial_March_2024

WSJ_Swing_States_Partial_March_2024.pdf

275 KB

download-circle

My big takeaways will shock you: memories are short, the vibes are still off, and people are only beginning to coalesce around two candidates that really aren’t very popular. You can see why I don’t make my living as a public pollster.

Limited Methodological Reporting

You’ll notice in the PDF there is no methodology section. The article online that describes the poll results tells us the MoEs ranged from +/-1.5pct to +/-5.6pct, with question base sizes running from 300 to 4200 people.

They don’t report the sample method, don’t mention whether they used the voter file, and don’t report any weighting used in analysis. They do indicate that the study is multimodal – which means some participants answered the survey online and some by phone. We don’t know the proportion of online respondents to phone respondents (it’s pretty rare to see this split reported, which is kind of too bad, as there have been some documented effects on response based on mode).

If you’ve been listening to the show, you know that practitioners find this inconsistency in how pollsters report their methodology extremely frustrating. If everybody is doing it a little bit differently, it’s hard to tell if a result is part of a trend or an outlier. Based on other polling I’ve seen in the states included in this poll, I don’t see massive outliers in the data, but I can’t know it for sure.

Split Ballots (again)

Here’s another survey where everyone did not answer the exact same set of questions, but that’s extremely common practice. One reason we do this in survey design is relevance – we want people to answer the questions they can answer. So we use what’s called “skip logic”, where based on your answer to a question, we route you to another question, bypassing others that don’t apply to you.

The other reason is simply we don’t want you to bail out on a survey because it’s too long. Survey length is an issue for everyone who creates them. The ideal situation is to keep survey length to something below fifteen minutes to ensure the most completes.

The WSJ recruited 600 registered voters per state, and then randomly assigned them to split-ballot experiments – 300 people in each state were asked about their state’s economy and 300 people about the nation’s economy. They also split people up between two lists of topics so every respondent didn’t see every topic. Here’s what Split A included:

The economy
Immigration and border security
The war between Ukraine and Russia
Abortion

Split B included:

Inflation and rising costs
The mental and physical fitness needed to be president
The war between Israel and Hamas
Protecting democracy

This does help keep the survey short and sweet – but it also means that pollsters have to be mindful of the design of the splits. In my opinion, the issues here are reasonably balanced if non-exhaustive. There’s one question in each where Biden performs pretty well; the other questions go in Trump’s favor – but neither “ballot” seems to “work better” for one candidate or the other (at least in the sense that they both work pretty well for Trump).

Is 600 people enough for a whole state?

The question of sample size is a little tricky. It depends on the kind of analysis you want to do. The overall sample size of a survey is often driven by a couple of important considerations.

What kinds of cross tabs do you want?

To have a reliable sample size of Republicans v. Democrats v. Independents, you need to set a minimum base size. In the WSJ sample, I’d guess they presumed about 15% of the sample should be Independents (as opposed to leaning Republican or Democrat) and set a minimum base size somewhere around 75-90 people. So if 15% of the sample is 90 people, then the total sample size should be 600, with the remaining 500 or so people about evenly split between the parties.

At this sample size you could layer in another demographic factor, like gender, or something broad like “white/non-white” or “college/noncollege”. This would give you readable base sizes for a segment that is Republican/college or another segment that is Democratic/white. At this size, you can do quite a bit of “cutting” the data to look at subsegments, but you can’t get extremely fine grained (e.g., Republican, college educated, non-white women).

Now do 90 Independents in Pennsylvania represent all Independents in that state? I can tell you that as a qualitative researcher, if I individually interviewed 90 independents across the state, it would probably be overkill on broad brush topics like “how do you think the economy is doing” or “which Presidential candidate do you prefer”.

Several years ago I did focus groups with independent voters in the Philadelphia suburbs, and in Pittsburgh, talking to 50 or so people. I remember many of these people well – they were of all ages, life stages, races, professions and levels of educational attainment. Some of them were extremely active in their local communities. A few were very well read on the goings-on in Harrisburg, the state’s capitol. Many evinced a conspiratorial tendency — I remember there was talk of chemtrails in one group, and I started to laugh before realizing the person was quite serious. But when it came time to go down the list of messages and attributes of various candidates, or to identify their top issues in state politics, there was a clustering around a few core positions. They all had slightly different reasons for reaching those positions, but in the end, they coalesced.

So for the kinds of questions that WSJ asked, I suspect these base sizes are fine. We’d all like them to be bigger – but there are so many polls in the field, and they’re expensive to run, so anyone conducting surveys has to be mindful of doing what is most efficient and most effective.

Deep Dive: What’s a Margin of Error?

Margin of error is a statistical term broadly determined by the size of the sample. The larger the sample size, the lower the margin of error. That’s how we figure it, but what the heck is it?

The margin of error is the amount of variability we would expect to see (and accept) in the responses to survey questions if we ran the same survey multiple times with the same sample method and size at the same level of confidence (I’ll explain confidence rates and statistical significance next time, for my sins).

So if the preference for candidate A in a survey is 49%, and the MoE is +/-3pct we should expect that if we ran the exact same survey in the exact same way, sometimes this candidate could pull 52% and sometimes they could pull 46% – but we’d never expect to see them pull 35% or 75%.

Sample size isn’t the only factor to be calculated in looking at the margin of error. There are a variety of assumptions baked into how we think about the margin of error. So here’s something most people don’t know: Different pollsters calculate margin of error in different ways based on their own assumptions about the sample, the survey design, and so on – and most of the time, they’re wrong.

Back in 2016, the Upshot published a piece by David Rothshild and Sharad Goel about research examining 16 years of political polls spanning 1998 to 2014. In the end they examined 608 public polls, looking at their published margins of error versus the electoral outcome. They wrote:

Comparing those polls’ results with actual electoral results, we find the historical margin of error is plus or minus six to seven percentage points. (Yes, that’s an error range of 12 to 14 points, not the typically reported 6 or 7.)

I have to tell you this stopped me in my tracks when I first read this in G. Elliott Morris’ book Strength in Numbers: How Polls Work and Why We Need Them. He wrote in the introduction, “the real margin of error for a poll is more than twice the size of the one most people hear reported.” He went on:

If poll-watchers had better understood the “true” margin of error in polls, they would not have been shocked by medium-sized errors.

There is no dancing around the troubling facts of polling. But a fair reckoning of the performance of polls, which is badly needed, must place blame both on the pollsters creating political surveys and the people consuming them.

I do agree that media outlets covering polls should do a better job of caveating this for their audience, especially since voters do not usually read detailed research reports, even if they are available. My hope is that “the people consuming them” in Morris’ telling are in fact these journalists and pundits. Because I don’t know why the ordinary voter, or even the exceptional news and polling junkie, should just know that the published margin of error is that wrong. Hell, I’m a market researcher and while I knew that MoE is a little bit made up, I didn’t realize it was made up by a factor of 2.

So, as you go forth and read about polls in the coming weeks and months, double in your head the margin of error – you might find living with greater levels of uncertainty to be actually kind of liberating!