Wednesday Wonders: More fun with Excel

Every so often, an Excel formula challenge comes up that takes more than a little dissecting and pondering before the solution becomes clear. I had one of these recently, and when I finally did hit on the answer, I even surprised myself. I’ll get to that in a bit, but first a little backstory.

One of my job functions was to create a Policies and Procedures manual explaining all of the workings of the place from my perspective, and to keep it updated. It’s currently somewhere around 70 pages, and to make it easy for the user, it’s a Word Doc that is extensively cross-referenced with a full glossary of the many complicated terms in the Medicare insurance business. There are also a number of appendices, from the very inside baseball explanation of the various Medicare Supplement Plans to a useful but very specific guide to the nearest fast and fast casual dining establishments relative to the office which links out to Google Maps for each destination.

Another inclusion was a schedule of pay periods and pay dates which I originally included as a quick and simple table cut and pasted in from Excel starting with the pay period from when I originally created it. But it was static and although it covered a couple of years, would eventually go out of date.

So my challenge was this: How to create a dynamic table in Word by linking to an entire table in Excel that would always start during the particular pay period the document was opened in. For example, as I write this, our current pay period goes from September 28th to October 11th, with payday on the 15th. But if someone were to open the document in a week, then the first entry should show October 12th to 25th, and so on.

The three columns in the table show just that for each pay period: the start date, the end date, and payday. Each row below shows the next period, etc.

Normally, this would be a simple matter of doing an IF/THEN calculation. IF (today’s date) is greater than a period’s start date and less than its end date, then use the start date in that cell, otherwise increment to the new date.

Now, this would be great if I could create a table of all the start dates for however many years and then link to it. Unfortunately, that wasn’t an option.

The other usual method for incrementing periodic changes wouldn’t work here at all. Normally, you could look at the cell above and use something like “IF (old cell) is less than 10, (this cell) equals (old cell) + 1, otherwise (this cell) equals 0.” Then (this cell) becomes (old cell), repeat.

The problem is that you can quickly run afoul of the circular reference problem,

This happens when two or more cells have formulae set up so that each one provides input to the other at the same time. The simplest example would be something like inputting “=A1+B1” in cell A1. This is telling Excel that the value in cell A1 is equal to itself plus B1. The problem is that B1 changes the value of A1 no matter what it is, so A1 has no value until B1 does, but even if B1 has a value, it will keep constantly changing the value of A1. It all leads to an endless loop, and computers do not like endless loops. That’s why programs like Excel have a hard-coded braking system that will stop an endless loop before it happens. If you ever accidentally do enter one, you’ll notice that the program gives you an error message, draws little arrows to point out the offending cells, and evaluates the initial culprit to zero.

Which is the long way around of saying that I had to figure out a way to calculate the current pay period start date based on the current date and do it all in one cell without evaluating any outside expressions.

Excel nerds, if you’d like to go grab a nosh or latte and think about this now, please do and compare your answer later. Everyone else, here’s how I managed to figure the problem out — again, after a lot of thought and contemplation.

The problem, restated: Calculate a table of payroll period start, end, and pay dates dynamically based on the current date, and without relying on any kind of lookup function using a pre-determined database. In other words, this puppy had to do it dynamically, with only one input point.

When I finally found the solution, I damn near shit my pants in joy, because it was really so simple and elegant, but it took a lot of thought, and gets to the heart of how Excel handles dates.

The only way that a computer can work with dates is to count them as a certain number of days since a fixed day. The behind the scenes work converts a particular number into a particular date. The catch is that a system can only handle dates after their zero point, and not before. If you enter 08/01/1899 into any Windows or Apple program, it won’t know what to do with it. Start with any date on or after 01/01/1900, though, and you’re fine, because that’s the start point.

So… in Excel, any date is just the number of  days since that start point, and if we go with the first start date of my company’s payroll period in 2020, we get January 6 which, in the terms stated, is 43,836. If you don’t believe me, divide this number by 365.25, and you get 120.02, which is just a hair over the number of years it’s actually been.

But forget that. The important number is 43,836, because, to Excel, that means the same thing as Epiphany, which is January 6, 2020 — the start of a past pay period where I work.

The key insight I had here was this: The important bits were those days that were exactly increments of 14 days after the starting point — and having a starting point and increment meant that I suddenly had two constants to plug into the equation, and that made all the difference, because constants are the anchors that everything else could be hung on, and it could happen in a single source cell.

Two constants would be modified by a single variable — what is today’s date? And the answer to that question comes in the form of another great Excel function, NOW(),  which simply returns the number for the date at the point it’s invoked.

So… given an arbitrary pay period start date of January 6 and a pay period of 14 days, the first cell formula looks like this:

=43836+(INT(((NOW()-43836)/14))*14)

Where 43836 is the constant date value for January 6, 2020, 14 is the constant interval in days, and NOW() is the variable based on today’s date. Inside the brackets, we get the integer value of Today minus the start date divided by the interval, and then multiplied by the interval outside of the integration.

What all that fancy math does is this: decides whether the current date is evenly divisible by 14. If it’s not, then it uses the original start date. If not, it increments it by a number that just happens to work out to be the actual number of increments since that first day. If it’s not obvious, it works like this. The +(INT(((NOW()-43836)/14) part determines how many pay period past the original date that NOW is. The *14) part puts back the days to land on the right week.

Yes, it was a struggle to figure it out and it worked beautifully, and I felt that my Excel Jedi score was vastly boosted. But… oh. Did I mention that I had to make this whole thing work as a dynamic spreadsheet insert inside of a Word Doc with no external links to Excel?

Oops, I guess? I did it again.

Sunday Nibble #3

At various points in my professional life, I’ve been a heavy user of spreadsheets, Excel in particular, and as I’ve indicated in occasional pieces on things like using index and match together, or sneaking IF into unexpected places, that I like to stretch what Excel can do so that I can automate data processes as much as possible.

Currently, one of my job functions involves entering commission statements into spreadsheets, and when I got there, the method had been to print the things out and then manually enter the data, and that just would not do — especially not when some of the statements would run to more than 20 pages, with hundreds of entries.

I eventually figured out how to suck the data out of the PDFs, paste it into Excel, use consistent quirks of how that data came out in order to correct it en masse, pass it through Word in order to tab delimit it, then put it back into Excel for a few more tweaks until the data was ready to transfer, either via cut and paste or, as I’m working towards, using a formula to have the destination spreadsheet update directly from the data.

Of course, there’s one very important factor here, and it’s one I build into every single step. I am constantly verifying that the balance of the data I’m working with at the moment always matches the total on the original to the penny. If it suddenly goes off, I know that something didn’t go right, and I can catch and fix it immediately instead of having to go through the tedious balancing process at the end.

And trust me, I’m over that. It used to take me two or three days to go through the entire process of entering and balancing a huge statement. Now? Maybe an hour or two — time much better spent.

But the moral of the story is “Verify constantly.” One time, when that didn’t happen, it led to a very costly mistake. You may or may not have heard of or remember the London Whale. He was a trader who made a little boo boo in his spreadsheet formula, leading to a loss of six or seven billion dollars to JP Morgan, with additional fines in the hundreds of millions.

How did it happen? Forensic analysis showed that he used a series of spreadsheets, and data had to be manually copied and pasted from one to the other. It was a process that should have been automated but never was. Ultimately, a formula was messed up at some point, and it started dividing the difference between return rates by their sum instead of their average.

This figure led to the London Whale making increasingly aggressive trades, causing other traders to place put calls against them — i.e. betting that he was making the wrong decision. Guess which group was right?

And in case that division error seems trivial, here’s an example with real numbers. Let’s say that the old rate was 15% and the new one was 25%. The difference is 10%. But here’s the outcome when you divide by the sum and the average. Ten percent over the sum of 40 is .25. Ten percent over the average of 20 is .50.

In other words, one number is half of the other, and that is consistent for any combination of new and old rates you pick. The error is always a factor of 200%, Imagine that kind of error in construction — “Oh. Your doorways are all three feet wide, but we ordered eighteen-inch wide doors. Sorry!”

“The minimum safe height for skydiving out of this plane is 5,000 feet… Oh, crap. He jumped, and we’re at 2,500.”

As the old carpenter’s adage goes, “Measure twice and cut once.” When it comes to dealing with numbers, always be verifying.

More stupid Excel tricks: a secret power of IF

The hardest part about working with data, especially in large sets, is the people who input it in the first place. The reason they make it so difficult is because they’re inconsistent, not only in their day-to-day habits, but between one or more different people all entering info into the same database.

When you’re creating something solely for yourself, then by all means be as inconsistent or idiosyncratic as you want. But if it’s a group project creating information that someone like me is going to have to derive useful information from at some point in the future, inconsistency can make my job infinitely more difficult.

This is the reason why things like style guides were created — and they don’t just exist for the written word. Accounting and data management have their own style guides. So does computer programming, although that field has the advantage, because the program itself won’t let you get it wrong. Excel is the same way, although it won’t always tell you how to make it right.

Little things can cause problems and cost a business money. Sally may prefer to spell out words in addresses, like Avenue or Boulevard, while Steve likes to abbreviate with Ave or Blvd. Sam is also big on abbreviations, but always with periods. Seems innocuous, doesn’t it?

It does until the only way to make sure that a massive mailing doesn’t go to the same household at the same address twice is to compare the addresses to each other. That’s because, to a computer, 1234 Main Street, 1234 Main St, and 1234 Main St. are all completely different addresses. There’s no easy way to fix for that because computers don’t have a “kinda sorta look the same” function.

Garbage in, garbage out

It’s also important that a database be designed properly. For example, names should always be entered as separate units — title/prefix, first name, middle name, last name, suffix. They can be combined later when necessary. A lot of good databases do this, but it’s completely worthless if somebody enters the first and middle names in the first name field or adds the suffix to the last name. You may have heard the expression “garbage in, garbage out,” and this is a prime example of that. All of the right fields were there, but if used improperly, it doesn’t matter.

Of course, the proper fields aren’t always included. One example I’ve had to wrestle with recently is a database showing the various insurance policies people have had with the agency. Now, that is useful and necessary information, as well as something that legally needs to be maintained. And it’s all right that a person gets one row of data for each policy that they’ve had. Some people will have one or two rows, others might have a dozen or more.

So what’s the problem? This: There are no data flags to indicate “this is the policy currently in effect.” This is doubly complicated since it’s Medicare related health insurance, so someone can have up to two active policies at a time, one covering prescription medications and the other a Medicare supplement. Or a policy may have expired after they decide to drop an MAPD and go back to “original” Medicare but the only way to know that is to look for an ending or termination date — if it was ever entered.

The secret power of “IF”

This is where one of my “stupid Excel tricks” came into it. You may or may not be familiar with some of the numeric functions dealing with columns or rows of numbers, but they basically operate on a whole range. They include functions like SUM, MAX, MIN, and AVG. The usual usage is to apply them to a defined range or series of cells and they have no operators, so you get things like:

=SUM([Range])
=MAX([Range])
=MIN([Cell1],[Cell2],[Cell3],...[Cellx])

Here’s the fun trick, though. If you add one or more “IF” statements within any of these functions, you can perform the operation on a sub-range of data defined by certain criteria. In the example I’m giving, it would look at all of the insurance effective dates for one person and determine the most recent one, which is usually a good indicator of which policy is in effect.

Generally, each item you’re evaluating is in the form of [DataRange]=[CellValue], or in actual Excel terminology, it might look like “$A$1:$A$470=A12” for the version entered in row 12. After the criteria ranges, you enter the range that you want to perform the operation on, close out the parenthesis, then enter.

So let’s say that we have last name in column B, first name in column D, and the dates we want to look at to find the latest are in column N. Our formula would look like this, assuming that the first row has the field headers and the data starts in row two:

=MAX(IF($B$1:$B$525=B2,IF($D$1:$D:$D$525=D2,$N$1:$N$525))

If you’ve entered it right, the formula should be displaying the right number. In effect, you’ll have created a column down the far right side in which the value opposite any particular person’s name equals the maximum date value, meaning the latest. Then you can do an advanced filter (oh, google it!) to pull out just the unique name data and date, then use that to do an INDEX and MATCH to create a dataset of just the most recent plan and effective date. (I covered those two functions in a previous post.)

Or… the original database administrator could have just put those current plan flags in the data in first place, set them to automatically update whenever a newer plan of the same type was added, and voilà! Every step since I wrote “This is where one of my “stupid Excel tricks came into it” 396 words ago would have been unnecessary. Time and money saved and problem solved because there was never a problem in the first place.

The art of improv in Excel

On the other hand… solving these ridiculous problems of making large but inconsistent datasets consistent with as little need as possible to look at every individual record just lets me show off my ninja skills with Excel.

It’s really no different than improv. Badly entered data keeps throwing surprises at me, and I have to keep coming up with new and improved ways to ferret out and fix that bad data. In improv, this is a good thing, and one of our mottos is, “Get yourself in trouble,” because that creates comedy gold as things in the scene either get irredeemably worse or are suddenly resolved.

In real life, not so much. It’s a pain in the ass to have to fix the curveballs tossed at us by other people’s laziness and lack of diligence — unless we approach it like a game and an interesting challenge. Then, real life becomes improv again in the best sense.

And I’ll find it forever amusing that the same rules can apply to both a spontaneous, unplanned, free-wheeling art form, and an un-wielding, rigid and unforgiving computer program. They both have their rules. Only the latter won’t allow them to be bent. Okay, some improv games have rules that are not supposed to be bent. But half the fun is in bending those rules, intentionally or not.

With Excel and data-bashing, all of the fun is in following Excel’s rules, but getting them to do things they were never intended to.

Image source: Author, sample output from a completely randomized database generator in Excel used to create completely artificial test data for practicing functions and formulae without compromising anyone’s privacy. Maybe I’ll write about this one some day, if there’s interest.

Stupid Excel tricks #1: INDEX and MATCH

Enter the matrix… math

There is an entire class of functions in Excel that take things to a whole new level, and they are called matrices. Maybe you ran across this in math in school and have forgotten, maybe not, but the idea with a matrix is that it takes one grid of numbers of X x Y dimensions and uses operators to manipulate it using another grid of numbers that may or may not have the same dimensions.

The great part is that to use these functions in Excel, you don’t need to know any of that. Like I’ve mentioned before, it’s exactly like using a cookbook. Plug in the ingredients as specified, voila, the dish pops out the other end.

Maybe you’ve used the functions VLOOKUP and HLOOKUP, or maybe not, but they can be useful if you want to match exactly one criteria in a table and if the data you’re looking up is somewhere to the right of that criteria. So it’s perfect if you have something like a unique account number on the far left and want to use that to look up a name or phone number to the right of it:

=VLOOKUP(M2,$A1:$L556,6,FALSE)

This tells Excel to take the value in cell M2, compare it to all of the values in column A of the named range, then look up the value in the sixth column counting from the column defined in the second variable (in this case, F) where the first column is equal to M2. “FALSE” just means to use an exact match, whereas “TRUE” would mean to use an approximate match.

Again, this is great if you’re searching something with unique values in both places — there is only one account number, and only one data point associated with it.

Now what if you have multiple entries for the same person with different account numbers, or multiple sizes and colors of a product with differing prices, or you need to search on more than one data point in different columns, or your table was set up with the criteria you want to use somewhere to the right of the data points you’re searching?

Welcome, matrix functions! These are two nested commands that work miracles together. The first is INDEX, and what it basically does is point to a column with data that you’re going to pull stuff from, then follow that up with the criteria you’re going to use to do that. You can see the difference from the LOOKUP functions right off the bat, because those start with the single data point you’re going to use to search the data. The INDEX function starts with the place you’re going to get the answer from.

The MATCH function is the matrix math, and it allows you to specify multiple criteria matched to different columns in the source data. The nice part about it is that you can have as many different criteria as you need — first name, last name, account number; size, gender, color, style; title, author, binding, edition; and so on. And each of these can point to any particular bit of data you need — monthly cost, price, location, phone number, address, and so on. Any bit of data in the table can be found this way.

If you want to put a physical analogy on it, it’s this. LOOKUP functions are a librarian with a sliding ladder that moves horizontally or can be climbed vertically. But the way it works is that they first move it or climb it in the direction you specify until it hits the target word. Then, it slides or climbs the other direction however many rows or columns you specified, and has now targeted exactly one cell with the answer. Oh — and it can only move to the right or down from that initial search cell.

On the other hand, think of INDEX and MATCH as a whole bunch of librarians who have set out all over the same bookcases, but are simultaneously searching the rows and columns, and calling back and forth to each other to indicate what bits they’ve found that match.

If you work with any kind of inventory or any data sets where people’s info is broken down (as it should be) into separate first and last names and account identifiers, then you need to know these functions, because they will save you a ton of time. And the basic way they work is like this:

INDEX($E1:$E1405,MATCH(1,(W2=$C$1:$C$1405)*(X2=$D$1:$D$1405)*(AA2=$J1:$J1405),0))

(Note: All column and row designations here are arbitrary and made up, so they don’t matter.)

That might look complicated, but it’s not. Let’s break it down. The first part, referring to the E column is the “Where” of the formula. That is, this is the column you’re pulling your data from. For example, if you want to use size, color, and style to find price, then this would be whatever column has the price data in it.

Next, we nest the MATCH function, and this lets INDEX know that what comes next will be the instructions it needs. The “1,” inside the parenthesis is a flag for MATCH, telling it to return one value. After that, each nested thing — and you can have as many as you need — follows the form “Single cell to look at equals column to search.” So, as seen here, for example, in the search data, column W might be the first name, and cell W2 is the cell corresponding what we’re looking at. Meanwhile, column C in the target data might also include first names, so what we’re saying is “Look for the single value of W2 down the entire column of C1 to C1405. The dollar signs are there to lock it as a fixed range.

All of the other parentheticals here follow the same pattern. Maybe X is the column for last name in the source and D is where the last names are in the target; and AA is account number, as is J.

The two other interesting things to note in building matrix equations: The single cell and the column are joined by an equals sign, not a comma, and this is important because, without it, your formula will break. What this tells Excel is that whatever the matrix pulls out of single cell must equal what’s in the column at that point.

The other thing to notice is that between the searches within parentheses, there aren’t commas, but rather asterisks, *, which indicate multiplication, and this is the heart of Matrix math.

What this tells the formula is to take the results of the first thingie, apply those criteria and pass it along to the second. In other words, if the first evaluation turned up nothing, that is mathematically a zero, and so it would quash anything coming from the second and third functions. On the other hand, if it comes up as a one, then whatever the second formula turns up will stay if there’s a one, dump if not, and then pass on to the third, fourth, etc..

Lather, rinse, repeat, for as many steps down the process you’ve created. A false at any point in the matrix math will kill it and result in nil. Meanwhile, as long as the tests keep turning up positives, what will fall out of the ass end of it is the honest legit “This data is the true data.”

Funny how that works, isn’t it? The only other trick you need to remember is that after you’ve entered this formula, you need to close it out by hitting Ctrl-Shift-Enter to let Excel know it’s a matrix formula. Then, if you want to copy it, you can’t use the usual Ctrl-C, Ctrl-V. Instead, you have to highlight the column with the formula at the top, then hit Ctrl-D. Voila… the whole thing repeats down the column.

And there you have it — a way to search multiple criteria in a row in order to find a specific data point in a table. You’re welcome.

Don’t make it rocket science when it’s not

So many tools

It never ceases to boggle my mind when people don’t jump on the chance to learn and fully take advantage of the amazing modern tools we’ve been handed and which are ubiquitous. If you work in any kind of office environment at all, whether it’s some stodgy traditional business or a bleeding-edge industry like tech or gaming, at the very least you’re dealing with either Microsoft’s Word, Excel, Outlook, etc., or the Apple equivalents.

If you’re using the Open Office or Chrome/Cloud versions, then this piece probably isn’t directed at you because you definitely get it. But, otherwise… really, people? These are literally the things that you use every day, and yet I constantly see very few people ever progressing beyond the merest basic ability to use any of the programs.

That is: Open document, type shit with defaults, save or send as-is.

If I open a spreadsheet you’ve worked on in an older version of Excel and see three tabs at the bottom named Sheets 1, 2, and 3, I will know that you’re an amateur. Likewise if the font is set to that hideous Calibri. Same thing in Word minus the tabs, but same crappy font, ragged aligned left, with auto-spacing before paragraphs or lines.

Word to the wise, people. The first thing you should do in Word is go in and set your default formatting so that the autospace before lines or paragraphs is 0, and line spacing is single.

Why is paper still a thing?

But this is just an intro to some recent heinous, and it’s this. I’ve managed to stumble into a situation where a lot of coworkers prefer to do things on paper, and it makes me nuts. Simple question: Why? Physical files can only be in one place, usually aren’t in the place where they’re supposed to be, and there isn’t a magic search function that can find them other than somebody maybe remembering that they worked on it recently, and where they put it. There’s also no standardization of fonts, so if someone scribbles a note in that file, there’s no guarantee that someone else will be able to read it six months later.

Not to mention that it’s just wasteful. Especially wasteful when there are so many ways to avoid it and so many resources to make that easier.

Case in point: One of the things I do regularly is enter and reconcile commission statements from various vendors, but I’ve had to do it by printing the things, manually entering the data into a spreadsheet, and then doing a careful audit to fix the inevitable errors, since some of these run to hundreds of entries.

But then I figured out how to pull the data directly from the statement, slap it into Excel, format it, and then use a few formulae to pull the new info into the old spreadsheet. The great advantages are that it uses the original data directly, so there are no entry errors to deal with. Also, the second pass just involves pulling out a copy of the original statement data and the target input by formula data, putting them side-by-side, using a few more formulae to spot errors due to differences in how names were spelled, making a few tweaks, and reconciling the thing a lot faster than before.

Pre-paperless innovation, a big statement could take me a few days (interspersed among all the other office duties) to finally balance it to zero. New method? I made it through four statements in one day, each one entered and balanced in two steps instead of about six.

The thing is, this isn’t really all that difficult, and anybody could learn to do it. One of the big helps in this process were the Excel functions INDEX and MATCH (which I’ll explain in a future post), and it took all of a two minute Google search and then reading the first good link to figure out how they worked in order to figure out how to do what I needed to do. What I needed to do: Compare the client’s first and last names and insurance plan type in one table in order to pull out a specific number from another. And this is literally all you need to do to learn how to make your office tools work for you.

Try it. Google “change the default font in Word,” or “turn off auto-correct in Word,” or “alternatives to VLOOKUP in Excel,” or any one of a number of other topics, and you’ll find the answers. It really isn’t any more complicated than reading a cookbook and making food from a recipe. Really, it’s not.

Using computers made easy

There is too much of an aura of mystery put around computers, but trust me, they are more simple than you think — and I’ve been working with them since… well, since most of my life, because I was just born at the right time. All that they ultimately understand are “Off” and “On.” “Zero” and “One.” Those are the only two states a switch can be in, that is what digital computing is, and it only gets two digits.

Maybe someday I’ll write a bit about how the electrons inside do what they do and turn it into intelligible information for humans, but for now suffice it to say that they pretty much only do a few things — input, store, and retrieve data through various devices; allow you to manipulate that data with various software programs; then allow you to re-store and output that data, again through various devices.

The nice thing about graphical user interfaces (GUIs) like Windows, OS, Android, Linux, etc., is that they tend to standardize across programs written in them, so that every program tends to use the same convention for the basics: Open, Close, Save, Save As, Print. Programs of the same type will also follow the same conventions — Format, Spellcheck, View, Layout, etc., for text editors; Image, Layer, Select, Filter, Effect, etc., for graphic design programs; Inset, Formulas, Calculate, Data, Sort, etc., for spreadsheets.

Finally, almost every program will have a Help function, whether it’s invoked via the F1 key, or by some combination of a control/alt/Apple/shift-click plus H move. Help menus, when well-done are great and, guess what? They were basically the hyperlinked documents we’ve all come to know and love via the internet, except that they’ve been around since long before the internet. Most of the time, they’ll answer the question but, if they don’t, you can always google it, as I mentioned above.

How to create job security

You may be wondering, “Okay, if my job is just doing data entry, or writing emails, or accounting, or… etc., why do I need to know so much about the software when no one else does?”

Simple. As the economy moves more and more toward service, knowledge becomes value. If you’re the one in the office who gets a reputation as the computer expert, you will get noticed, and you will save a higher-up’s cookies more than once. You’ll also earn the attention and gratitude of your co-workers if you become the one they come to when “I did something and something happened and I don’t know how to fix it,” and you know immediately upon looking that they accidentally, say, set Word to Web Layout instead of Print Layout. It’s called creating job security by taking that extra simple step that too many people refuse to. Try it!

Image Source: NASA, Apollo 11.