Why is my SQL query taking so long?

This is one of the questions we, as Data Analysts or Data Scientists, almost always ask ourselves, mostly when we are dealing with large databases.

Also, how many times do we have the error ‘No more memory spool space for [USER]‘? I have had that error so many times that I cannot count them, most of the time this is caused because we are doing an inefficient query.

In this article, we are going to learn a bit about how to make an efficient query

Is using subqueries a good idea?

It really depends on how many rows will the subquery parse (And how many rows would not doing it would generate), let me give you two examples:

First, we are going a make a subquery to get the names in a table that has all the addresses, names, and surnames of the people who live in a country, it would look something like this:

SELECT a.* FROM (SELECT name FROM database.table_with_people) a 

However, It looks a bit gross, doesn’t it? Well, it is actually inefficient too.

When you define a subquery, your SQL database will always first do this subquery instance before doing the actual query, so in this case, you would be doing a query and then from that query getting the results to show, this is kind of bad when dealing with very big datasets, it would be more efficient to do this query in this case:

SELECT name FROM database.table_with_people

Now with a bit more complexity

For instance, let’s assume you not only have the addresses, names, and surnames of all the people living in a country, but you also have their whole credit card history (Not only with their Credit Card debts, but also with their credits, dates, and card ids) in another table and a unique id to identify, let’s try to get their last credit card transaction.

SELECT a.unique_id, b.credit_card_transaction AS credit_card_last_transaction FROM database.table_with_people a
JOIN database.credit_card_transactions b
on a.unique_id=b.unique_people_id
QUALIFY row_number() OVER (PARTITION BY a.unique_id ORDER BY b.date_transaction DESC)=1

However, this code as is, will probably get you an error, why? Let’s take a look at what we are doing here:

  • First, we are selecting all the people in database.table_with_people
  • Then, we are LEFT JOINING all the data in database.credit_card_transactions
  • Finally, we are selecting the last row and ordering them by date_transaction (row_number() would put a fictional number to the data and with QUALIFY = 1 we are getting the last value because values were ordered from last to first).

But doesn’t something of all this sounds bad to you? We are getting ALL the transactions of the people in our query FIRST and then FILTERING. You are doing this on a table that probably has thousands of millions of data rows.

The right way of doing a subquery

So what should we do? In fact, a good approach here would be using a subquery and filtering before the JOIN, which would look something like this:

SELECT a.unique_id, b.credit_card_transaction AS credit_card_last_transaction FROM database.table_with_people a
JOIN (SELECT unique_people_id, credit_card_transaction FROM database.credit_card_transactions
     QUALIFY row_number() OVER (PARTITION BY unique_people_id ORDER BY date_transaction DESC)=1) b
on a.unique_id=b.unique_people_id

In spite of we are querying the same data as before, now your query will probably work fine, as you filter the data BEFORE making the JOIN.

INNER JOIN VERSUS LEFT JOIN

Likewise, another common mistake is using LEFT JOIN when you just need an INNER JOIN, let’s take the last query that was working fine and find out:

SELECT a.unique_id, b.credit_card_transaction AS credit_card_last_transaction FROM database.table_with_people a
LEFT JOIN (SELECT unique_people_id, credit_card_transaction FROM database.credit_card_transactions
     QUALIFY row_number() OVER (PARTITION BY unique_people_id ORDER BY date_transaction DESC)=1) b
on a.unique_id=b.unique_people_id

For instance, we assume all the data was correct, right? Well, it probably isn’t. What would have happened if a person didn’t have transactions? Our query would have returned a null value in unique_people_id when we join the two tables, but also, let’s assume that all not transactions have an unique_people_id (Trust me, this happens A LOT when you work with very big large datasets), so that would have been a null value also… Here is a little graph of how a LEFT JOIN works (Not the typical graph that doesn’t explain anything, a real graph):

Left join example

As you may see in our super graph, Mary Condo has two relationships with table two, this means that she will appear two times in your SELECT results, one for each row in the join that matches. And John Bay doesn’t have an unique_id yet (Maybe because it isn’t assigned), so it will match with a credit_card_transaction that won’t be his. So, how do we mitigate these errors?

We are getting there

Well, one way, we saw it before, right? Even if it is another query, using qualify to get just one value per unique people id would work but the NULL error will still remain, so maybe you should do something like this:

SELECT a.unique_id, b.credit_card_transaction AS credit_card_last_transaction FROM database.table_with_people a
LEFT JOIN (SELECT unique_people_id, credit_card_transaction FROM database.credit_card_transactions
     QUALIFY row_number() OVER (PARTITION BY unique_people_id ORDER BY date_transaction DESC)=1) b
on a.unique_id=b.unique_people_id
WHERE a.unique_id is NOT NULL

This will give us the results of the people that do have unique_id only, but what about people that have unique_id AND doesn’t have any transactions? Well, they will come in the result too. This will mean that we are querying our second table for these people, and they really don’t have any result that matters to us, we were looking only for the last credit card transaction of the people that really have credit card transactions. An inefficient way to solve this would be:

SELECT a.unique_id, b.credit_card_transaction AS credit_card_last_transaction FROM database.table_with_people a
LEFT JOIN (SELECT unique_people_id, credit_card_transaction FROM database.credit_card_transactions
     QUALIFY row_number() OVER (PARTITION BY unique_people_id ORDER BY date_transaction DESC)=1) b
on a.unique_id=b.unique_people_id
WHERE a.unique_id is NOT NULL and b.unique_people_id is NOT NULL

And again, as in our first subqueries, we are filtering AFTER querying the data.

Making the right query for the result we want

Nevertheless, this last query probably won’t get an error, is really inefficient. However, a way to solve this for good would be using INNER JOIN, it will only query and display the data that matches, and you won’t have so many problems in your query performance, it would be something like this:

SELECT a.unique_id, b.credit_card_transaction AS credit_card_last_transaction FROM database.table_with_people a
INNER JOIN (SELECT unique_people_id, credit_card_transaction FROM database.credit_card_transactions
     QUALIFY row_number() OVER (PARTITION BY unique_people_id ORDER BY date_transaction DESC)=1) b
on a.unique_id=b.unique_people_id
WHERE a.unique_id is NOT NULL

Where should I put my WHERE clause?

Let’s assume now that you don’t need the credit card last transaction, instead you need the sum of all the transactions made by each person last month, one way to do this would be:

SELECT a.unique_id, SUM(b.credit_card_transaction) AS sum_of_last_month_transactions FROM database.table_with_people a
JOIN database.credit_card_transactions b
on a.unique_id=b.unique_people_id
WHERE a.unique_id is NOT NULL AND b.date_transaction BETWEEN date and date-31

At this point, you actually can see what I am talking about, right? This will absolutely exhaust your memory usage, again, you are filtering the result after the query is done.

A nice solution for this would be, again, a subquery and filtering the data BEFORE doing the join of the queries.

SELECT a.unique_id, SUM(b.credit_card_transaction) AS sum_of_last_month_transactions FROM database.table_with_people a
JOIN (SELECT unique_people_id, credit_card_transaction FROM database.credit_card_transactions WHERE date_transaction BETWEEN date and date-31) b
on a.unique_id=b.unique_people_id
WHERE a.unique_id is NOT NULL

Conclusion

Making a good and efficient query depends really on the data and tables you are working on, you should always try to minimize the impact in memory of your query and try to find out how would the query works more efficiently. You have seen some ways here how to achieve this with some examples that will help you in your professional life, you should really, REALLY think what is the information you are trying to find and what would be the best way to achieve this. Databases will always work better without subqueries but sometimes the data you need is so heavy without using them that you will most definitely find exhausted memory if you don’t use them.

You can learn more about subqueries and how to use them in Microsoft SQL Server here.

Leave a Reply