Farewell. The Flying Pig Has Left The Building.

Steve Hynd, August 16, 2012

After four years on the Typepad site, eight years total blogging, Newshoggers is closing it's doors today. We've been coasting the last year or so, with many of us moving on to bigger projects (Hey, Eric!) or simply running out of blogging enthusiasm, and it's time to give the old flying pig a rest.

We've done okay over those eight years, although never being quite PC enough to gain wider acceptance from the partisan "party right or wrong" crowds. We like to think we moved political conversations a little, on the ever-present wish to rush to war with Iran, on the need for a real Left that isn't licking corporatist Dem boots every cycle, on America's foreign misadventures in Afghanistan and Iraq. We like to think we made a small difference while writing under that flying pig banner. We did pretty good for a bunch with no ties to big-party apparatuses or think tanks.

Those eight years of blogging will still exist. Because we're ending this typepad account, we've been archiving the typepad blog here. And the original blogger archive is still here. There will still be new content from the old 'hoggers crew too. Ron writes for The Moderate Voice, I post at The Agonist and Eric Martin's lucid foreign policy thoughts can be read at Democracy Arsenal.

I'd like to thank all our regular commenters, readers and the other bloggers who regularly linked to our posts over the years to agree or disagree. You all made writing for 'hoggers an amazingly fun and stimulating experience.

Thank you very much.

Note: This is an archive copy of Newshoggers. Most of the pictures are gone but the words are all here. There may be some occasional new content, John may do some posts and Ron will cross post some of his contributions to The Moderate Voice so check back.


Tuesday, May 29, 2012

Names suck as matching references

By Dave Anderson:

This has made me retch as a professional data geek:

The state  [Florida] has been responsible for helping screen voters since 2006 when it launched a statewide voter registration database. The state database is supposed to check the names of registered voters against other databases, including ones that contain the names of people who have died and people who have been sent to prison.

Read more here: http://www.miamiherald.com/2012/05/18/2805445/fla-to-double-check-names-on-voter.html#storylink=cpy

I'm a data geek.  One of my occassional tasks is to integrate my company's data set and lists that outside parties provide to us.  A priori, I know that a very large proportion of the individuals should be on both lists.  I've blocked out most of tomorrow for this task as we just got a medium size list that needs to be crosswalked into our data set.  I'll be working with the data geek intern (yay, I have a .25 FTE minion) to show the intern the ropes on how to work this process.  We talked about the project for twenty minutes this afternoon and the intern was shocked that this is not an easy process as it is just a matter of comparing names, and names are easy.

Ahhh, to be understandably incompetent in the ways of data.  Names suck as unique identifiers, here are some common problems. 

  • Junior versus Jr. versus JR versus II

  • Dave versus David

  • David M Anderson versus DM Anderson versus David Anderson versus D Anderson

  • Family groupings don't neccessarily follow any coherent naming structure

  • Mary Louise Jones versus Mary Louise Smith Jones versus Mary Smith-Jones versus Mary L Smith Jones etc.

My name in particular is a pain in the ass because for my age cohort, it has a top-10 male name and a very common last name.  Googling "David Anderson" and restricting it to Pittsburgh produces numerous other individuals before you come find anything that is non-Newshoggers related to me.  My wife is a bit easier for the data geek as she has an uncommon first name.  But the point is that names are a hideous identifier. 

Names combined with other information can be better as unique identifiers.  However, there are strong limitations on using address data such as postal address as there again are significant naming convention problems, as well as the lack of actual zip code boundaries that are not imputed.  ZIP codes can commonly cross multiple municipalities and counties.  Furthermore, center cities are often used as mailing addresses for multiple inner ring suburbs, for instance, I live outside of the Pittsburgh city limits, but my zip code means my mailing address is "Pittsburgh, PA".  Birthday data is a bit better, assuming accurate data entry, but again, there are numerous David Anderson's born on my birthday and they live in multiple states and have jacked up my credit report more than once. 

The intern's eyes were glazing over when I got to the point about propensity scoring (ie a match on first name, last name, DOB, and zip code but mismatch on middle initial and suffix is probably a valid match), wild ass guesses that need to be sent back to the outside vendor for confirmation, and unique identifiers such as Social Security number or UPIN or NPI or anything else. A match on EIN or TIN or SSN is a solid match. 

The intern's ignorance is understandable as this is his first exposure to intermediate data geekery.  However, Florida's decision to use name matching for anything other than a PSA mailing to remind people to brush their teeth is not defensible as understandable ignorance.  It is intentional and willful incompetence by someone, either the hiring entity or the contractor and if it is the contracter, the state is guilty of neglect.

But that happens to be the entire point of this exercise, intentional neglect is useful to the Florida governing elite. 


  1. OMG. Name matching?
    You're right. Thats an amature mistake.
    I have lots of experience matching records, and names in the healthcare domain, and even the low level staff know they need a name and Date of Birth to have just a high probability of matching a caller to their medical records.
    The good news is that this type of challenge to Floridas's list, is objective and based on solid knowledge of data profiling.
    Thanks for calling this to my attention!

  2. matching buys and sells by stock symbols is easy! you can even use totals!