Good, Clean Data

by Brooke Williams

In our recent story for The New Republic, Ken Silverstein and I examined think tank scholars who simultaneously work as registered lobbyists. We knew of situations worth examining: a resident think tank fellow also representing Polish oil interests, and the director of a homeland security program lobbying for defense contractors, to name a couple. But we wanted to go beyond the anecdotal and gain context. Was this part of a larger system in which registered lobbyists have access to think tanks from the inside?

Due to limited and dirty data, trying to answer this question turned out to be a challenge. Think tanks only disclose officers, directors, trustees, key staff, and top five highest paid employees in annual filings to the Internal Revenue Service. And while most think tanks list scholars and staff online, they’re in various formats. Some are behind search engines or listed on a bunch of separate web pages.

As a part of my long-term project, I am grabbing names of scholars and staff listed online, then cleaning, parsing, and importing them into a database, which I will be making freely available in a searchable, meaningful way. But for this story, I stuck with data from tax filings, when they were available, for the 25 top think tanks as James McGann ranked them in his report for the Think Tanks and Civil Societies Program at the University of Pennsylvania.

I downloaded the names of think tank people from, which has digitized the IRS Form 990s. However, since the original files were .pdfs, the data required cleaning, standardizing, parsing, and verifying before they could be linked to lobbyist records.

The name field contained unwanted spaces, characters and punctuation, as well as misspelled names. At least one person was missing. Once I trimmed the spaces and removed punctuation, titles, suffixes, and prefixes, I researched people whose names appeared to be misspelled to ensure the data were correct and consistent.

It’s worth noting the IRS began releasing 990s in a digitized format this year—and other journalists have already made them easily searchable. But unfortunately, the digitized data don't include names of directors, officers, trustees, and key employees.

Once the think tank names were ready, I downloaded registered lobbyist data and prepared them for a cross-check. This required parsing a name field and performing integrity checks on the results.

When the two data sets were ready, I linked the name fields by first and last name and began examining the results. I didn’t include the middle name because it wasn’t always in both data sets. One by one, I verified or eliminated.

First, taking into account different filing periods, I queried for those people listed in both data sets during the same years—as we were only interested in those cases. I removed instances where the person was a registered lobbyist for the think tank itself.

Next, I verified whether they were indeed the same person. (As it turns out, some think tank executives and registered lobbyists share seemingly uncommon names.) This required varying levels of research, from reading online biographies and making phone calls, to scouring federal records showing prior government positions. Next, I verified in the paper versions of the 990s and lobbying disclosure reports that each individual was, in fact, listed on the think tank’s rolls and registered to lobby simultaneously.

In the end, as our article described, the data showed at least 49 people have simultaneously worked as scholars, officers, trustees, and directors at think tanks while registered to lobby on behalf of outside clients. Especially given the limited scope of data for this analysis, the number suggests there will be plenty more potential conflicts of interest to examine once we cross-check the rest.

In the meantime, it seems even one example is significant. The Center for American Progress says it has implemented a “no lobbyists” policy in response to our New Republic story. Stay tuned for the next one.