The hardest part about working with data, especially in large sets, is the people who input it in the first place. The reason they make it so difficult is because they’re inconsistent, not only in their day-to-day habits, but between one or more different people all entering info into the same database.
When you’re creating something solely for yourself, then by all means be as inconsistent or idiosyncratic as you want. But if it’s a group project creating information that someone like me is going to have to derive useful information from at some point in the future, inconsistency can make my job infinitely more difficult.
This is the reason why things like style guides were created — and they don’t just exist for the written word. Accounting and data management have their own style guides. So does computer programming, although that field has the advantage, because the program itself won’t let you get it wrong. Excel is the same way, although it won’t always tell you how to make it right.
Little things can cause problems and cost a business money. Sally may prefer to spell out words in addresses, like Avenue or Boulevard, while Steve likes to abbreviate with Ave or Blvd. Sam is also big on abbreviations, but always with periods. Seems innocuous, doesn’t it?
It does until the only way to make sure that a massive mailing doesn’t go to the same household at the same address twice is to compare the addresses to each other. That’s because, to a computer, 1234 Main Street, 1234 Main St, and 1234 Main St. are all completely different addresses. There’s no easy way to fix that because computers don’t have a “kinda sorta look the same” function.
Garbage in, garbage out
It’s also important that a database be designed properly. For example, names should always be entered as separate units — title/prefix, first name, middle name, last name, suffix. They can be combined later when necessary. A lot of good databases do this, but it’s completely worthless if somebody enters the first and middle names in the first name field or adds the suffix to the last name. You may have heard the expression “garbage in, garbage out,” and this is a prime example of that. All of the right fields were there, but if used improperly, it doesn’t matter.
Of course, the proper fields aren’t always included. One example I’ve had to wrestle with recently is a database showing the various insurance policies people have had with the agency. Now, that is useful and necessary information, as well as something that legally needs to be maintained. And it’s all right that a person gets one row of data for each policy that they’ve had. Some people will have one or two rows, others might have a dozen or more.
So what’s the problem? This: There are no data flags to indicate “this is the policy currently in effect.” This is doubly complicated since it’s Medicare related health insurance, so someone can have up to two active policies at a time, one covering prescription medications and the other a Medicare supplement. Or a policy may have expired after they decide to drop an MAPD and go back to “original” Medicare but the only way to know that is to look for an ending or termination date — if it was ever entered.
The secret power of “IF”
This is where one of my “stupid Excel tricks” came into it. You may or may not be familiar with some of the numeric functions dealing with columns or rows of numbers, but they basically operate on a whole range. They include functions like SUM, MAX, MIN, and AVG. The usual usage is to apply them to a defined range or series of cells and they have no operators, so you get things like:
=SUM([Range]) =MAX([Range]) =MIN([Cell1],[Cell2],[Cell3],...[Cellx])
Here’s the fun trick, though. If you add one or more “IF” statements within any of these functions, you can perform the operation on a sub-range of data defined by certain criteria. In the example I’m giving, it would look at all of the insurance effective dates for one person and determine the most recent one, which is usually a good indicator of which policy is in effect.
Generally, each item you’re evaluating is in the form of [DataRange]=[CellValue], or in actual Excel terminology, it might look like “$A$1:$A$470=A12” for the version entered in row 12. After the criteria ranges, you enter the range that you want to perform the operation on, close out the parenthesis, then enter.
So let’s say that we have last name in column B, first name in column D, and the dates we want to look at to find the latest are in column N. Our formula would look like this, assuming that the first row has the field headers and the data starts in row two:
If you’ve entered it right, the formula should be displaying the right number. In effect, you’ll have created a column down the far right side in which the value opposite any particular person’s name equals the maximum date value, meaning the latest. Then you can do an advanced filter (oh, google it!) to pull out just the unique name data and date, then use that to do an INDEX and MATCH to create a dataset of just the most recent plan and effective date. (I covered those two functions in a previous post.)
Or… the original database administrator could have just put those current plan flags in the data in first place, set them to automatically update whenever a newer plan of the same type was added, and voilà! Every step since I wrote “This is where one of my stupid Excel tricks came into it” 396 words ago would have been unnecessary. Time and money saved and problem solved because there was never a problem in the first place.
The art of improv in Excel
On the other hand… solving these ridiculous problems of making large but inconsistent datasets consistent with as little need as possible to look at every individual record just lets me show off my ninja skills with Excel.
It’s really no different than improv. Badly entered data keeps throwing surprises at me, and I have to keep coming up with new and improved ways to ferret out and fix that bad data. In improv, this is a good thing, and one of our mottos is, “Get yourself in trouble,” because that creates comedy gold as things in the scene either get irredeemably worse or are suddenly resolved.
Damn, I miss doing Improv, and long for the days when we can finally return to the stage. But I do digress…
Back to the point: In real life, not so much for easy resolution. It’s a pain in the ass to have to fix the curveballs tossed at us by other people’s laziness and lack of diligence — unless we approach it like a game and an interesting challenge. Then, real life becomes improv again in the best sense.
And I’ll find it forever amusing that the same rules can apply to both a spontaneous, unplanned, free-wheeling art form, and an un-wielding, rigid and unforgiving computer program. They both have their rules. Only the latter won’t allow them to be bent. Okay, some improv games have rules that are not supposed to be bent. But half the fun is in bending those rules, intentionally or not.
With Excel and data-bashing, all of the fun is in following Excel’s rules, but getting them to do things they were never intended to.
Image source: Author, sample output from a completely randomized database generator in Excel used to create completely artificial test data for practicing functions and formulae without compromising anyone’s privacy. Maybe I’ll write about this one some day, if there’s interest.