Excel Problem Solver: Detect Fraud Using a Spreadsheet and Benford's Law

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
if you suspect fraud is occurring in your business you need to watch this video i'm going to show you a simple way to detect fraud in your transaction data using nothing but a spreadsheet and an obscure law of mathematical probability called benford's law forensic accountants use this technique frequently and so can you this video was made using microsoft excel 2019 [Music] benford's law is also known as the newcomb benford law the law of anomalous numbers or the first digit law in 1881 a canadian-american astronomer made a casual observation when he noticed that his books of logarithms tended to have pages more worn and coffee stained at the beginning than at the end and those beginning tables of logarithms contained logarithms that began with the digit one this casual observation led that astronomer named simon newcomb to the conclusion that when you have a set of truly random numbers you tend to get more numbers beginning with the digit 1 than with the digit 2 and more numbers beginning with the digit 2 than 3 and so on down the line with numbers beginning with the digit 9 being the least frequently observed of all later in 1938 physicist frank benford presented the law of anomalous numbers formally in a paper and benford was eventually given credit for this discovery this is benford's distribution presented graphically you can see the first digits across the horizontal axis at the bottom the digit 1 is expected to occur about 30.1 percent of the time and the first digit of two about 17.6 percent of the time decreasing all the way down to the first digit of nine which is only expected about 4.6 percent of the time it's important to understand that benford's distribution does not apply to all data for example if you took a list of 10 or 7 digit phone numbers in the united states and tried to apply benford's distribution it wouldn't work there are artificial constraints in play on the area codes and the prefixes of those phone numbers conversely if you took the length of every river in the world measured in any unit of measure and applied benford's distribution it would work quite well those numbers are completely random benford's distribution is known to work well with sales transaction data as well as purchasing transaction data and it works very well when you have combinations of numbers being multiplied or added together to form the final number benford's distribution tends to not work very well when numbers have a built-in minimum or maximum they're rounded off or they're subject to some sort of psychological threshold like everything ends in 99 cents so the first thing you need to do is to consider carefully whether any of those constraints are in place on your transaction data keep in mind you may have a policy in place such as all transactions over one thousand dollars need manager approval which will influence your employees but that's not a constraint on the actual transaction amount if everybody's doing their job and being honest i will say however that fraudsters are typically influenced by such rules and transaction data entered by fraudsters doesn't conform to benford's distribution normally unless it's a very clever fraudster we're going to use a four-step approach to attack this problem first we need transaction data and we need that transaction data broken down by employee if we suspect that one of our employees is committing fraud next we're going to chart benford's distribution and we'll be adding things to this chart in the next steps in step 3 we'll determine our desired confidence interval and add that to the chart finally we'll analyze our transaction data first we'll look at all employees grouped together then we'll analyze individual employees in this scenario we run a department store we have different shifts of employees working at the department store and on one of the shifts we've identified that we have some inventory shrinkage we have some unusual inventory variances that we've isolated to this one shift there are four cashiers who work on this shift we'll just call those cashiers cashier a b c and d there are thousands of items sold in our department store and any combination of them could show up at the cashier for checkout we sell food beverages clothing sporting goods and so forth the only constraint on our cashiers is that any transaction over one hundred dollars needs to be approved by a manager with benford's distribution the more data you have the better so we've downloaded a report of sales transactions processed by these clerks that numbers about six thousand transactions so this is our data this is a transaction by employee report and these are the cashiers a through d and their transaction amounts this particular report is sorted by amount so it starts out with very low transactions maybe somebody buying a candy bar and they go down to transactions that are quite large but they're all under a thousand dollars so that's our data all we need to do now is isolate the first digit of each transaction so we can do that with our left function go to one digit and i'm gonna auto fill that down that seems to be working just fine so i'm going to go ahead and copy this column and paste as values to cement those in there and i'm going to ignore the error so we've isolated all the first digits now keep in mind this is a very sanitized report this report may have many other data fields on it i cleaned it up it would have had the time of day and the date and which cash register number and many other things on it potentially all we really need is the cashier and the amount our next step was to chart benford's distribution so i put the digits into a table here these are just numbers in a cell and this is benford's distribution for each number so one should be 30.1 percent of the time and i took these out to five decimal places all of these distributions should add up to one or one hundred percent if you want to think of it that way and they do so let's quickly throw a chart in here that will show our benford curve if you will and that looks pretty good this first one is going to be for all cashiers collectively so i'm going to call that all cashiers now without getting too deep into the statistics this curve is not enough we know that in our sample size of nearly 6 000 transactions that's a pretty big sample but we're probably going to have some deviation from these exact amounts it's inevitable so how much of a deviation are we willing to tolerate that's where z scores and confidence interval comes into play so there's a formula to calculate the upper limit and the lower limit of our confidence interval and i'm going to choose a confidence interval of 95 percent think of the 95 confidence this way this list of transactions is a sample it's not the entire population so how much faith can we put in this sample data when we're trying to make a judgment about the whole population so what we can say is that if we have a deviation from the upper and lower bounds we set on this benford's distribution we can be 95 certain that the entire population deviates as well conversely if our sample data fits within our upper and lower bounds at every digit we can say with 95 probability that the total population conforms to benford's distribution before i put the formulas for the upper and lower bounds on here and plot those we're going to need three items and i've added those to the bottom of our gray area here one is the expected frequency or p sub e of each digit so that's right above in column f i'll just call that benford's distribution now the sample size in this sample over here is 5940 so that'll be our sample size for this one which is going to be all the cashiers we're going to look at the cashiers individually later on but for now that's 5 5946 and the z value for a 95 confidence interval is 1.96 that's a statistical constant if you want to change this to 90 or 99 there's a different constant you can use and i'll let you guys go look that up on the internet so let's compute the upper limit or upper bound and lower limit i'm going to move our chart slightly because i'm running out of room and paste in right here the formulas don't be intimidated these are not that hard so the upper limit will be right here let's read read across from here equals p sub e plus our constant of 1.96 and i'm going to fix that value times and i've got a square root item here so i'm going to put that in parens it's going to be p sub e times paren 1 minus p sub b again divided by n which i will fix closed paren take the square root you do that with a little carrot put one half in parens after that and then we've got to add back the 1 divided by 2n and i'll fix that in so we come up with an upper limit of three one two seven seven that seems like a reasonable answer so i'll show you a neat little trick here since these formulas are virtually identical except for the two minus signs and two plus signs i'm just gonna paste this formula over here for the lower limit and i need to change all these g2s to f2s because that's where my pcb is so i'm going to find the g2s and replace with f2s should get three replacements which we do so now i've got the same answers here and i need to just change these to minus signs saves us a little time so that seems like a reasonable answer there my lower limit is below my benford's distribution number and my upper limits above it so let's paste these down and it looks like all of them turned out that way so i'm gonna assume we did that right so let's uh pull down our chart and add those two series to our chart uh in fact let's get rid of this stuff since we don't need it anymore and we want to select data i'm going to first edit this series because i never gave it a name and i'm going to call that benford's distribution and i'm going to add another series and call it upper limit those values are right here and let's add one more call it lower limit so there you have it we've got our upper and lower limits there now i'm going to make this chart bigger so we can see but the basically the idea is we're going to plot our actual transaction data on here and i'll probably use a different kind of chart for that probably a bar chart so we'll have a combination bar and line chart when we're done so let's go ahead and see how it looks so next we want to get the distribution among all cashiers of the different digits and plot those on this chart so i'm going to create a couple more columns here in order to do that and expand my gray area here with my data i'm going to call this one the digit count and i'm going to call this one the digit frequency so what i want here is for the digit 1 i want to count all the ones in this column over here so i'll use a countif function and i want to count in the range of c 2 to c 5947 and my criteria is simply equal to 1 and you have to put that in quotes let's fix these references so i can paste down we still have a little work to do on these uh so let's change this one to two this one to three this one to four and so on so now i need the frequency let's sum these i'll put a line there that's the number we would expect with that many rows and the frequency distribution of this digit is going to equal this divided by this and i'm going to fix that reference take it out to five decimal places looks like i've got six there so there is my digit frequency distribution and you can already see we've got a little variance from benford's distribution here but it looks reasonably close let's see what that looks like on the chart we're going to add a series to our chart so i'm going to select data and let's add a series and we'll call it all cashiers my values will be right here and let's throw that on there so that's all lines we want to change our chart type so let's right mouse click in the chart area here and you get this chart change chart type option and if you go to all this all charts tab there's a combo chart down here now this is actually not what we have selected right now for the benford's distribution i'd be okay with a smooth line just a regular line for the upper limit i think i'm going to go with the line with markers as well as the lower limit and for the all cashiers i want a bar chart see how that looks now i'm not crazy about this color i'm gonna fill those in gray i'm going to resize this chart and we'll take a close look at it i went full screen mode for this part of the video and what you want to do is go down through each digit here's the digit one and it looks like we're very close to the middle of our upper and lower range there two looks good three four five is close to the bottom but not quite six is a little better seven's a little better eight is quite close to the upper limit and nine is the one digit that is actually over the upper limit and you can just confirm this by looking at our data this distribution for 9 on all cashiers exceeds the upper limit here and this eight is perilously close but it's within bounds that we set so what do we do in a case like this we do have reason to believe that there could be fraud and let me make it clear this deviation from benvert's distribution is not proof of fraud it is reason to suspect fraud however so let's proceed on and the way we're going to proceed on is to look at each cashier individually and see if any of them break out as individuals to analyze each cashier individually i've added a set of two columns for each cashier to my table of data so we need to count the one digits for cashier a only and we're going to use the countifs function there because we have two criteria we need to see an a here and a one here so my criteria range for my first criteria is going to be a3 to a 59 48 because i added a row and the criteria i'm looking for is in quotes equals a now i have to enter my criteria for range two which will be column c over here so that'll be c three to c 5948 and that criteria will be in quotes equals one i'm going to fix all of those references so i can paste this formula down i get 469 for that one and like before i have to change these and we'll get a sum here and my frequency distribution is going to be this well equals this divided by my sum which i will fix the reference for so there is my frequency distribution for cashier a i'm gonna go ahead and repeat this exercise for the other three cashiers so a really easy way to do this is to just simply copy cashier a's data and formulas over here and then ask yourself what's changing here we still want to look for ones in the same range the only thing that's really changing is this equals a needs to be changed to equals b so i'm going to use the find and replace feature and you got to be careful here uh we have um an equals c right here so when we get over to this column that's gonna be a problem so i'm gonna put equals a quote so i don't catch something unintended and i'm going to replace that with equals b quote now i should get nine replacements one for each cell so it looks like that worked okay now we've got a sum here that's fine i'm gonna ignore this error and we need to check this formula remember we we made this cell an absolute over here and we we don't want that so we're gonna have to change all these k12s to m12s so once again use your find and replace and replace dollar sign k dollar sign 12 to dollar sign m dollar sign 12. should get nine replacements as kind of a final check these these columns of frequency should always add to one so you want to make sure those are adding up to one i'm going to go ahead and repeat this exercise for columns c and d for for those cashiers and then check back so i've set up my counts and frequencies for all four cashiers but there's one little fly in the ointment that has cropped up remember these upper and lower limits and the formulas and i'm going to paste these formulas back in here to take another look at them one of the inputs for these formulas was n the sample size and we used 5946 because collectively the four cashiers had that many transactions because each cashier basically split those 5946 transactions roughly evenly we need to re-establish n now technically i should recompute an upper and lower bound for each one of these cashiers but they're all pretty close to 1500 i'm gonna take the average which is 1487 and i'm going to use that as my n and recompute some new upper and lower bounds for each cashier so i think the easiest way to do this and remain organized is going to be to insert a couple columns here and i'll bring these over and paste them in here and instead of all cashiers we'll call this individual for the individual bounds that we're going to use keep this relatively organized so for my upper limit uh what's changing now really i'm just replacing these ends in the formula which is this dollar sign h dollar sign one four with a number which we said was going to be 1487 the average so let's go ahead and do that in both sides here simultaneously using find and replace dollar sign h dollar sign one four and we'll replace that with just a simple number should get 36 replacements because i had that number appears two times in each formula and there's 18 formulas the other thing we need to change in these new upper and lower bounds is the reference to p sub e remember that was a relative reference in our formulas over here it references f3 the benford's distribution column so we need to change all of these js to f's let's see how that works out we should get three times 18 or 54 replacements 54 replacements good and now we have established new individual upper and lower limits and you can kind of tell by looking at these the they're a little wider because of the smaller sample size this goes from 0.277 to 0.324 whereas with the larger sample size it's it's a narrower range of values so my next order of business is to create a chart for each cashier and evaluate them as individuals i think the easiest way to do this since i already created an all cashiers chart is to simply copy that one and i'll go down here a little further and paste it and we'll rename that as cashier a and we'll just go in here with a right mouse click and change the data so the bendford's distribution line will remain the same but the upper limit is going to change to this this individual upper limit so let's go into that one and edit it the name is good but the data needs to be right here then we need to edit the lower limit the name is okay again but the data will be right here and i need to edit this one a little further instead of all cashiers we're going to call this a because that's what will appear on any legend that i decide to do for this chart and the values are going to be right here so let's take a look at that chart and i've got an undesirable color there i'm going to change that to a nice muted gray so cashier a how does this cashier stack up for the digit one it looks like the cashier is in limits two three it looks like this cashier is well within limits so i'm going to go ahead and repeat this exercise for cashier b c and d and then we'll take a look at all of them so i've completed my charts and i kind of standardized them a little bit i made the upper and lower bounds the same color and we already looked at cashier a cashier a looked like that person was in limits let's look at cashier b uh i see some minor deviations from benford's distribution but again given the sample size uh and the tolerance that we've chosen it looks like this cashier is in limits how about cashier c as you go down through the digits there's some deviation again but looks good cashier d this cashier has an unusual and unnatural disproportionate percentage of transactions that begin with eight or nine this is a little suspicious this is not proof of fraud but it is definitely something to look at with cashier d especially in light of the fact that we have a 100 managers approval threshold i wonder if this cashier has an accomplice who's coming through the line at her cashier every day and she or he is not ringing up all of the items and keeping it under one hundred dollars so that is definitely something to look at that's how to apply benford's distribution to some sales transaction data if you've enjoyed the content of this video and you'd like to see more like it please hit my subscribe button i will be releasing one new video a week minimum for the foreseeable future and you can learn about excel and expand your excel skills thanks for visiting spreadsheet geek [Music] you
Info
Channel: undefined
Views: 4,943
Rating: 4.9793816 out of 5
Keywords: excel 2019 tutorial for beginners, Excel tips, excel 2016 tutorial, leila gharani, excel formulas and functions, excel tutorial, excel for analysts, advanced excel tutorial, advanced excel training, benfords law, fraud detection, Forensic, Detect fraud, benford law, law of anomalous numbers, excel left function, detect crime, Statistics, Excel statistics, excel analysis tools, benford law excel, benford law fraud detection, benford law of large numbers, Election Fraud
Id: kWPb9a6vEZI
Channel Id: undefined
Length: 34min 13sec (2053 seconds)
Published: Sun Sep 20 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.