Statistics with SQL

Abhijeet Kamble
3 min readDec 5, 2019

Data scientists often use statistics to better understand data sets. SQL includes functions for describing data sets, so let’s see how to use some of the most commonly-used statistic functions, to better understand our data. I’m going to start by selecting from the staff column, I will select star, from staff, and I’ll limit this to 10, cause I’m just trying to get a feel for the data. And we’ll notice that this table includes information about employees, which includes some string characters and some dates, and some integer values to work with. Now, one of the things I might like to know is, how much does the company pay in salary to all of its staff, across a given year? Well, the way to do that is to first of all, we’ll remove the limit clause, cause we want to look at everyone, and we want to use the sum function, which is one of those statistic functions, and it’s applied to a column, like salary, and I’ll execute the query, and notice that the result is approximately $97.3 million is paid per year, across all of the staff. Now next, I might like to understand how much does each department pay in salary? Well, to do that, I’d want to use the group by clause again. So I will group by department, and I’ll be sure to include the department name in the results set. I’ll execute that query, and I’ll notice that I have 22 rows returned, one for each department, and also, a value of the total amount spent in salary each year for that department, okay, that’s really useful information. Different departments may have different number of employees so it’s also helpful to understand what’s the average salary paid per employee in department. To find that, I can use the AVG or average function, and ask for the average salary, and since I’m grouping by department, the average function will calculate the average salary per department, and I’ll simply execute the statement, and now we’ll notice that I have a third column added. And I’m just going to spread it out here, to make it easier to read. And this is the average salary per department. There are a couple of other statistics that are sometimes used by statisticians and others who are familiar with statistics. These are called the variance, and standard deviation. We won’t go into details about what these mean, but I’ll roughly describe them as both good ways of measuring how spread out data is around an average. So, let’s look first at the variance, and how to calculate that. SQL has a function called var pop, short for variance of population, and that’s applied to a numeric column, like salary. And when we execute this statement, we’ll get a fourth column added to our results set, and this is known as the variance. I won’t go into details about it, but it’s a good way of measuring how spread out the data is. A more commonly used statistic for understanding the distribution of data is something called the standard deviation. In SQL, that function is stddev, underscore pop, which is short for standard deviation of population. And we want to know the standard deviation over the salary, so we’ll use that function, and execute, and again, another column is added to our result set, and as we can see, this number indicates the standard deviation over the salary, by department. Now you’ll notice that when we use functions like average, and standard deviation, we’re getting a lot of decimal places after the decimal point. Since we’re talking about salaries, we’re talking about monetary values, it’s probably helpful to round these to two decimal points. To do that, we simply use the round function, and apply it to the average, the variance, and the standard deviation. So I’ll do that right now, I’ll add round, and one of the things I want to do is tell the round function to round the average salary, and keep just two decimal points, and I’ll do the same thing with variance, I’ll ask to round it and keep just two decimal points, and finally I’ll do the same for standard deviation. And when I execute, you’ll notice that each column that has the round function applied to it, has only two decimal places, this makes the results a little easier to read. Now I will say, if you’re not familiar with statistics, and terms like variance and standard deviations are not things you’ve come across before, that’s nothing to worry about, it’s just important to know that if you do need them in the future, the functions are available to you in SQL.

--

--