Available in versions: Dev (3.20) | Latest (3.19) | 3.18 | 3.17 | 3.16 | 3.15 | 3.14 | 3.13 | 3.12 | 3.11 | 3.10
SQL: SELECT DISTINCT
Applies to ✅ Open Source Edition ✅ Express Edition ✅ Professional Edition ✅ Enterprise Edition
The SELECT DISTINCT
syntax can occasionally be useful to remove duplicates in case your schema isn't normalised. For example, in a log table, you may want to query for distinct IP addresses:
SELECT DISTINCT ip FROM access_log
create.selectDistinct(ACCESS_LOG.IP).from(ACCESS_LOG).fetch();
This is especially useful for ad-hoc querying.
A lot of times, when removing such duplicates in application queries, you'll want to add aggregate functions (e.g. number of log entries per IP, number of distinct logins per IP, etc.), so you'll use GROUP BY, instead of the DISTINCT clause.
However, be wary of using DISTINCT
for removing duplicates that you're not entirely sure why they happen, i.e. typically in very large and complex queries, but also a lot of times when there are (INNER) JOIN expressions, when in fact there should be SEMI JOIN expressions. For example, the following query looks for all authors of books available in book stores starting with the letter A
:
SELECT DISTINCT author.first_name, author.last_name FROM author JOIN book ON author.id = book.author_id JOIN book_to_book_store ON book.id = book_to_book_store.book_id WHERE book_to_book_store.name LIKE 'A%'
create.selectDistinct( AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME) .from(AUTHOR) .join(BOOK).on(AUTHOR.ID.eq(BOOK.AUTHOR_ID)) .join(BOOK_TO_BOOK_STORE) .on(BOOK.ID.eq(BOOK_TO_BOOK_STORE.BOOK_ID)) .where(BOOK_TO_BOOK_STORE.NAME.like("A%")).fetch();
Now, due to the nature of our schema:
- Authors have written many books
- Books are available from multiple book stores
Hence, it is very likely that the same author appears in the result multiple times, thus DISTINCT
. From a database performance perspective, this may be very bad, because the authors are duplicated into some internal data structure before they're removed again from the result. Besides, depending on the optimiser, we might have to scan all the books and all the book stores even if we already found an author. Optimisers are very different, so this may or may not be a problem of equal severity in each RDBMS, but certainly something to keep an eye out for.
A much better solution here is to use a SEMI JOIN, e.g. using the IN predicate or EXISTS predicate (they're semantically equivalent):
SELECT author.first_name, author.last_name FROM author WHERE author.id IN ( SELECT book.author_id FROM book JOIN book_to_book_store ON book.id = book_to_book_store.book_id WHERE book_to_book_store.name LIKE 'A%' )
create.select( AUTHOR.FIRST_NAME, AUTHOR.LAST_NAME) .from(AUTHOR) .where(AUTHOR.ID).in( select(BOOK.AUTHOR_ID) .from(BOOK) .join(BOOK_TO_BOOK_STORE) .on(BOOK.ID.eq(BOOK_TO_BOOK_STORE.BOOK_ID)) .where(BOOK_TO_BOOK_STORE.NAME.like("A%") ).fetch();
Now, we'll never get any duplicate authors.
If in doubt, check your execution plan and execution metrics about what is the best query here, but in a lot of cases, DISTINCT
is a hint that something unnecessary is being done, and then duplicates are removed again.
Feedback
Do you have any feedback about this page? We'd love to hear it!