SQL Query Optimization

IMPORTANT: This section refers to OrientDB v 2.2 only.

Some of these tips are also valid for previous 2.x versions.

V 3.0 has a completely new execution planner, so none of these tips can be considered valid on that version.

The SQL executor is a quite complex component and one of the oldest pieces in the architecture of OrientDB, writing efficient queries requires some knowledge of the internals and of the related components, like indexes and clusters. This said, following some basic guidelines allows to reach a good level of performance for most of the typical SQL queries. This section is intended to provide these guidelines and to help end users to write efficient SQL queries in OrientDB. We will concentrate on SQL SELECT statement, but most of these guidelines also apply to other statements.

Case A: Anatomy of an SQL query

Let's start from a basic but complete use case:

SELECT name, surname 
from Person 
WHERE age = 25
ORDER BY name ASC
SKIP 10 LIMIT 10

Let's start from the base situation:

  • 1.000.000 Person records in the DB
  • evenly distributed by age (0 - 99)
  • no indexes defined

For this simple statement, the query executor will perform the following actions:

  • fetch records from Person class
  • for each record, filter by age
  • calculate projectionis (name and surname)
  • order the resulting records based on name property
  • skip the first 10 records
  • return 10 records

NOTE: the ordering of the result is executed after the calculation of the projections, so the result can only be sorted by name or surname, ie. not by age

Some facts:

  • the SQL executor will scan the whole Person class to fetch all the records that match the condition...
  • ...even if only 10.000 of them will match the condition age = 25
  • the ORDER BY takes into consideration the SKIP/LIMIT, so it keeps in memory only 20 records with the "smallest" name (then the SKIP will discard the first 10)
  • the sorting is executed in HEAP memory

The most relevant operation here is the scan of 1.000.000 records (likely loading them from disk)

Case B: Index based filtering

Based on Case A, the first optimization we can do here is adding an index on age

CREATE INDEX Person.age on Person (age) NOTUNIQUE

With this simple optimization, the query execution plan becomes as follows:

  • fetch records from Person.age index, where key = 25
  • calculate projectionis (name and surname)
  • order the resulting records based on name property
  • skip the first 10 records
  • return 10 records

Some facts:

  • in this case the SQL executor will only fetch from disk the 10.000 records that match age = 25
  • the ORDER BY still only keeps in memory 20 records with the "smallest" name (then the SKIP will discard the first 10)

Compared to Case A, OrientDB only fetches from disk 1/100 of the records, so you can expect the original query to be 100 times faster approximately.

Case C: Index based sorting

Based on Case A (suppose no other indexes are available, ie. we never performed the optimization provided with Case B), another optimization we can do is adding an index on name property to speed up the sorting operation.

CREATE INDEX Person.name on Person (name) NOTUNIQUE

With this optimization, the query execution plan becomes as follows:

  • fetch records from Person.name index in ascending order
  • calculate projectionis (name and surname)
  • for each record, filter by age
  • collect the first 20 valid results
  • discard 10 records (SKIP)
  • return 10 records

Some facts:

  • There is no need for sorting here, as the index already provides records sorted by name
  • If you are lucky enough, the first 20 records will have age = 20, so the query will take ~1/50.000 of the original one...
  • ...but if you are particularly unlucky, to find 20 records that match age = 20 you will have to scan all the original dataset, so the performance will be the same as the Case A

IMPORTANT: Sorting based on indexes can be only performed on tree-based indexes (ie. UNIQUE and NOTUNIQUE indexes). All the other types of indexes (eg. NOTUNIQUE_HASH_INDEX, UNIQUE_HASH_INDEX, LUCENE) do not support sorting, so they will be ignored for ORDER BY operations.

Case D: Index based filtering + sorting

Let's try to mix Case B and C together and see if we can do better:

The naive approach of using both indexes together won't work:

//WRONG!

CREATE INDEX Person.age on Person (age) NOTUNIQUE
CREATE INDEX Person.name on Person (name) NOTUNIQUE

With these index definitions, OrientDB will be able to use only one index to optimize the query. In this case it will choose the index for filtering and will discard the other one.

IMPORTANT: THE EXECUTION PLAN CANNOT MIX INDEXES FOR SORTING AND FILTERING. IT WILL ALWAYS CHOOSE THE INDEX FOR FILTERING AND WILL IGNORE THE OTHER ONE.

OrientDB can actually exploit indexes for both filtering and sorting, but it has to be the same index:

//CORRECT

CREATE INDEX Person.age_name on Person (age, name) NOTUNIQUE

With this index, the query execution plan becomes much more efficient:

  • fetch records from Person.age_name index in ascending order, where age = 25
  • discard 10 records (SKIP)
  • return 10 records

This execution will ALWAYS fetch only 20 records from the storage, so the query performance is always 50.000x faster than Case A

Case E: Equality and Inequality conditions

Let's consider three different statements:

SELECT FROM Person WHERE age = 25

SELECT FROM Person WHERE age <> 25

SELECT FROM Person WHERE age > 25

The first statement has an equality expression; to execute it, OrientDB can use any type of index (apart from fulltext and spatial), ie. tree based and hash indexes.

The second statement has a "not equals" condition. OrientDB will never use indexes to optimize it. <> is equivalent to !=

The third statement has a "range" condition (range operators include >, <, >=, <=); OrientDB can only use three-based indexes (ie. UNIQUE and NOTUNIQUE) to optimize range queries. Hash indexes will be ignored.

Case F: Composite indexes - full match

A composite index is an index defined on multiple properties. Consider the following

CREATE CLASS Person
CREATE PROPERTY Person.name STRING
CREATE PROPERTY Person.surname STRING
CREATE PROPERTY Person.age INTEGER
CREATE PROPERTY Person.karma INTEGER

And an index defined as follows:

CREATE INDEX Person.name_surname_age_karma on Person (name, surname, age, karma) NOTUNIQUE

This index can of course be used for a full match, eg.

SELECT FROM Person WHERE name = 'foo' AND surname = 'bar' AND age = 25 AND karma = 100

Case G - Composite indexes - partial match

Consider the schema and the index defined in Case E. This index can also be used for partial queries, eg. the following queries can use that index to optimize the search

SELECT FROM Person WHERE name = 'foo' AND surname = 'bar' AND age = 25 

SELECT FROM Person WHERE name = 'foo' AND surname = 'bar'

SELECT FROM Person WHERE name = 'foo'

The partial match is allowed only on a prefix of the index definition. The following query won't be optimized by the above mentioned index:

//NOT INDEXED
SELECT FROM Person WHERE surname = 'bar' AND age = 25 AND karma = 100

//NOT INDEXED
SELECT FROM Person WHERE age = 25 

//NOT INDEXED
SELECT FROM Person WHERE karma = 100

IMPORTANT: Only tree-based indexes (ie UNIQUE, NOTUNIQUE) can be used for partial match. Hash indexes (eg. UNIQUE_HASH_INDEX, NOTUNIQUE_HASH_INDEX) will be ignored for partial match.

Case H - Composite indexes - range queries

Tree-based indexes can be used to optimize both equality and range queries. The same applies to composite indexes, with the only limitation that the range condition has to be on the last property that is used for index search. Let's make it clear with an example:

Given the schema and the index defined in Case E, consider the following query:

SELECT FROM Person WHERE name = 'foo' AND surname = 'bar' AND age = 25 AND karma > 100

This query will be executed using the full index (ie. on properties name, surname, age and karma).

Now consider the following:

SELECT FROM Person WHERE name = 'foo' AND surname = 'bar' AND age > 25 AND karma > 100

now the range condition is on age, that is the third property in the index definition. In this case, the query will be executed as follows:

  • fetch from the index, based on name, surname and age
  • filter the resulting records by karma

So if you have 1000 records with the same name, surname and age, but only one has karma > 100, the query will fetch all the 1000 records and filter them one by one, based on karma value.

This happens because now the first range condition is age > 25, this condition short-circuits the range query

The same would have happened if the condition on karma was an equality condition (ie. karma = 100); all the conditioins after the first range condition are ignored in partial index match.

IMPORTANT: range conditions short-circuit partial index usage

Case I - Composite indexes - partial match and sorting

As discussed in Case D, indexes can be used for filtering and sorting at the same time. This also applies to partial match. Consider the domain and the index defined in Case E and the following query:

SELECT FROM Person WHERE name = 'foo' AND surname = 'bar' ORDER BY age

In this case the query executor will use the index for both filtering (partial match on name and surname) and for sorting.

The conditions for this to happen are following:

  • the filtering has to be done based on equality conditions (ie. no range conditions)
  • the sorting has to be executed on a property that, in the index definition, is right next to the properties used for filtering

To make it clear, consider this scheme:

CREATE INDEX theIndex on TheClass (prop1, prop2... propN) NOTUNIQUE
SELECT FROM TheClass
WHERE
prop1 = ...
AND prop2 = ...
...
AND propX = ...
ORDER BY `propX+1`
          ALL EQUALITY CONDITIONS            NOTHING IN 
                    |                        THE MIDDLE
 +------------------+--------------------+     |
 |                                       |     |
equality    equality    equality    equality   |        
condition   condition   condition   condition  |  ORDER BY
    |           |           |           |      |    |       
    V           V           V           V      |    V       
  prop1       prop2       ...         propX    V  propX+1     ....   propN

IMPORTANT: both partial match and sorting are allowed only on tree-based indexes

Case J - IN condition

Consider a query like this:

SELECT FROM Person WHERE name in ['foo', 'bar']

This query can be optimized with an index that is defined on name property only (eg. not on an index that is defined on name and surname).

this is not a limitation in the index engine, but just a limitation in the implementation of the SQL executor, there is a chance that in next 2.2.x releases it will be addressed. This limitation does not exist in V 3.0.

The same applies to a query on a composite index, eg.

SELECT FROM Person WHERE name = 'foo' AND surname in ['xxx', 'yyy']

This query can be optimized using an index defined on name and surname.

As a general rule, IN conditions are optimized using indexes only when all these conditions apply:

  • the index can be used for a full match (not on partial match)
  • the IN condition is defined on the last property in the index definition

In all the other cases, the IN condition is not optimized using indexes

Case K - order of conditions in the WHERE clause

In v 2.2, OrientDB SQL executor tries to find the best index based on the conditions defined in the query, but in some cases if fails to find the right combination of conditions to consider for indexed execution.

An important rule to make OrientDB find the right index (and use it the right way) is to write the conditions in the same order as the properties appear in the index definition.

Consider the schema and index defined in Case E, a query like following

SELECT FROM Person WHERE name = 'foo' AND surname = 'bar' AND age = 25

Will correctly use the index on all the three properties. A query like following

//wrong order of properties
SELECT FROM Person WHERE surname = 'bar' AND age = 25 AND name = 'foo'

in some cases will only use the index to match the name and then will manually filter on surname and age

This is particularly relevant when using parentheses, eg. the following query

//wrong order of properties + parentheses
SELECT FROM Person WHERE (surname = 'bar' AND age = 25) AND (name = 'foo')

will likely fail to correctly use the index

IMPORTANT: always try to write your WHERE clause so that the order of the conditions matches the order of the fields in the index definition, this will make it easier for OrientDB to find the right index and use it correctly.

NOTE: this limitation is completely removed in v 3.0

Case L - AND vs OR conditions

Consider the following schema:

CREATE CLASS TheClass 
CREATE PROPERTY TheClass.prop1 STRING
CREATE PROPERTY TheClass.prop2 STRING

CREATE INDEX TheClass.prop1 on TheClass (prop1) NOTUNIQUE
CREATE INDEX TheClass.prop2 on TheClass (prop2) NOTUNIQUE

In such a scenario, you can write queries that involve both prop1 and prop2 and OrientDB will have to choose which index (or indexes) to use.

A simple case is following:

SELECT FROM TheClass WHERE prop1 = 'foo' OR prop2 = 'bar'

in this case, OrientDB will use both indexes and will merge the results.

OrientDB can use mutiple indexes when OR conditions are involved

Another possible case is this (please note that in this case an AND condition is involved):

SELECT FROM TheClass WHERE prop1 = 'foo' AND prop2 = 'bar'

in this case OrientDB will use only one index (eg. the one defined on prop1) and then will apply the rest of the conditions record by record.

The way OrientDB chooses the index depends on how many properties are involved in the indexing. Typically, OrientDB will try to use as many properties as possible for indexed query. When two indexes with the same number of properties are available, the choice is simply based on internals (eg. on the order the indexes appear in the schema).

As a general rule, you should not rely on this mechanism, because the general performance of the query is not completely predictable.

A better approach is to re-write the query as a chain of sub-queries, to make sure that he inner query uses the right index. Eg. if you want OrientDB to use the index on prop2, you can write the query as follows:

SELECT FROM (
   SELECT FROM TheClass WHERE prop2 = 'bar'
) WHERE prop1 = 'foo'

Case M - Optimization of count(*)

OrientDB can optimize a count(*) in some basic cases:

  • when it is performed on a class or on a cluster, without further conditions, eg.
SELECT count(*) FROM Person
  • when it's performed on an indexed query without further filtering:
--suppose an index is defined on "name"
SELECT count(*) FROM Person where name = 'a'

Understanding EXPLAIN command

EXPLAIN is a very useful tool to understand how a query is performing. To use it, just prefix the query with explain keyword, eg.

EXPLAIN SELECT FROM Person WHERE name = 'foo'

The result is a record containing statistics about the query execution, eg.

{
    "result": [
        {
            "@type": "d",
            "@version": 0,
            "documentReads": 2,
            "fullySortedByIndex": false,
            "documentAnalyzedCompatibleClass": 2,
            "recordReads": 2,
            "fetchingFromTargetElapsed": 0,
            "indexIsUsedInOrderBy": false,
            "compositeIndexUsed": 1,
            "current": "#74:0",
            "involvedIndexes": [
                "Person.name_surname_age_karma"
            ],
            "limit": -1,
            "evaluated": 2,
            "user": "#5:0",
            "elapsed": 0.655,
            "resultType": "collection",
            "resultSize": 1
        }
    ]
}

NOTE: In v 2.2, OrientDB will have to execute the query to calculate the explain, so if the query takes a lot of time/memory/resources to execute, the explain will take a lot as well.

The first information you can get from this is whether one or more indexes were used to execute the query. Just check involvedIndexes property in the result. If this property is not in the result, then the executor did not use any indexes.

Unfortunately, involvedIndexes does not give you any information about how the index was used, eg. you don't know if it was used for full match or for partial match.

To have some more information, you can check recordReads property. It reports how many records were actually fetched an analized.

Another useful information is provided by fullySortedByIndex: if it returs true it means that no ORDER BY operations were performed in memory, but all the sorting relies on indexes. Unfortunately you do not have the opposite information, ie. if the index was used only for sorting and not for filtering.

Sneak peek of V 3.0

Please note that all what was described above is going to change in v 3.0.

As a preview, here you can see the result of an EXPLAIN in OrientDB V 3.0

------------------------------
-- result of an EXPLAIN in V 3.0
------------------------------

explain select Name from Monuments 
where Id < 15 and Name LIKE 'Statua%'
ORDER BY Name
SKIP 1 LIMIT 3
+ FETCH FROM INDEX Monuments.Id
  Id < 15
+ EXTRACT VALUE FROM INDEX ENTRY
+ FILTER ITEMS WHERE 
  Name LIKE 'Statua%'
+ FILTER ITEMS BY CLASS 
  Monuments
+ CALCULATE PROJECTIONS
  Name
+ ORDER BY Name ASC
  (buffer size: 4)
+ SKIP ( SKIP 1)
+ LIMIT ( LIMIT 3)

In V 3.0 the execution planner will be much smarter (and already is, in the SNAPSHOT release) than in v 2.2:

  • the order of the conditions and parentheses does not matter anymore, OrientDB can figure out all the indexes that can be used for a query
  • the choice of the right index for a query is based on statistics collected during query execution
  • the execution planner is much more explicit, so you can exactly know how the query is being executed
  • the EXPLAIN is calculated without executing the query, so it returns immediately, also for very expensive queries

results matching ""

    No results matching ""