6 November 2015
6 November 2015,
 Off

Author: Luigi Dell’Aquila  – @ldellaquila

Do you know SQL? If you’re a software developer/DBA/DevOps, I’m pretty certain that I know your answer to that question.

This is precisely why OrientDB adopted SQL as its main query language.

SQL is elegant, clear, well-defined, and powerful (for what concerns RDBMS), but it was created during a period when tables were the only allowed data structure (you know, relational databases, tables, primary keys, foreign keys…)

Now, with NoSQL, things have changed. Data structures are now much more complex, so  SQL requires some enrichments to fit this new model.

Let’s see how OrientDB SQL is similar to the standard SQL and how it differs.

The past: Join, Filtering, Projection, Aggregation

This is how standard SQL works in relational databases.

Let’s take a look with an example: calculate the population of all European countries (that is the sum of all the populations of its cities).

City:

name countryId population
Rome 32 2.500.000

 

Country

id name
32 Italy

 

SELECT country.name, sum(city.population)
FROM city, country 
WHERE city.countryId = country.id
AND country.continent = ‘Europe’
GROUP BY country.name

 

You can think at the execution of this query in four steps:

  1. Join: match city and country tables based on a primary key (country.id) and a foreign key (city.countryId)
  2. Filtering: filter results based on some attributes (the continent name in this case)
  3. Projection: choose a subset of attributes to be returned (country name and city population in this case)
  4. Aggregation: aggregate data, calculate the sum of populations grouping by country name


This is a typical pattern in SQL, but it has some problems:

Problem 1: the Join operation is EXPENSIVE, it relies on a convention (the same value on country.id and city.countryId) and it has to be re-calculated at every query execution.

The situation is even more critical when you have MULTIPLE joins. Let’s make it more complicated:

SELECT country.name, sum(city.population)
FROM city, region, state, country, continent
WHERE city.regionId = regiony.id
AND region.stateId = state.id
AND state.countryId = country.id
AND country.continentId = continent.id
AND continent.name = ‘Europe’
GROUP BY country.name

 

How much time will it take to execute?

Problem 2: We are assuming that we know how deep the hierarchy is, but in some cases, we simply don’t know. Suppose you are dealing with an organization structure, with departments, managers and so on, how can you write a query that returns all the sub-trees under a manager?

Problem 3: Until now, we just assumed that our data is made of simple attributes, eg. the country name is a string, the city population is an integer. But this does not represent reality very well.

An example could be having a list of names for the same city.

names countryId population
[Rome, Roma] 32 2.500.000

Or even better, a list of names with languages 

names countryId population
[  {name: Rome, language: EN},   {name: Roma, language: IT} ] 32 2.500.000

It is not possible in standard RDBMS (but it is in NoSQL), so regular SQL lacks the commands to manage this kind of data. 

Problem 4: The schema is well-defined here, so we know exactly which properties a single table has and SQL was designed to work on this assumption. It’s not true in NoSQL, so we will need some extensions to manage cases where we don’t know the schema (or where there is no defined schema at all).

From Join to links

A Join operation is expensive, so OrientDB replaced it with something better: physical links

City:
{
   @rid: #11:50,
   name: “Rome”,
   population: 2.500.000,
   country: #12:32
}

Country:
{
   @rid: #12:32,
   name: “Italy”
}

 

The big difference here is that @rid values are not a convention, they just represent a physical position of a record, so calculating a relationship is a very fast operation, O(1) while the cost of a single lookup for a primary key in an RDBMS is O(logN).

Efficiency is a big advantage, of course, but SQL has to be enhanced to leverage this new element.

In OrientDB SQL, a join is not allowed, so relationships are calculated using dot notation:

Relational OrientDB (Document)
SELECT country.name, sum(city.population) 
FROM city, country 
WHERE city.countryId = country.id 
AND country.continent = ‘Europe’
GROUP BY country.name
SELECT country.name, sum(population)
FROM city
WHERE country.continent = ‘Europe’
GROUP BY country.name
SELECT country.name, sum(city.population) 
FROM city, region, state, country, continent
WHERE city.regionId = regiony.id
AND region.stateId = state.id
AND state.countryId = country.id
AND country.continentId = continent.id
AND continent.name = ‘Europe’
GROUP BY country.name
SELECT region.state.country.name as contryName, sum(population)
FROM city 
WHERE region.state.country.continent.name = ‘Europe’
GROUP BY countryName

 

From Links to Edges

OrientDB is also a Graph Database, so you can also represent your data elements as Vertices and relationships as Edges.

The big difference is that Edges are bidirectional, which means you can traverse them in both directions with the same efficiency, using out() and in() operators for outgoing and incoming edges.

Eg.

sql-img1

you can retrieve the continent of a city with:

SELECT out(‘inRegion).out(‘inState’).out(‘inCountry).out(‘inContinent’)
FROM city 
WHERE name = ‘Rome’

 

but you can also retrieve all the cities in a continent:

SELECT in(‘inContinent’).in(‘inContry’).in(‘inState’).in(‘inRegion’)
FROM continent 
WHERE name = ‘Europe’

 

Deep traversal

Sometimes you know how deep you have to traverse (eg. city -> region -> state -> country), but in other cases, you just don’t know. Consider having an organization structure:

sql-img2

If you want to retrieve all of the subtrees of a given department, but you don’t know how deep the structure is, there is no way to write a single standard SQL statement to perform the job.

OrientDB SQL has an extension called TRAVERSE, specifically designed for this use case

The following statement retrieves all the sub-departments of Dept 1

TRAVERSE in(‘Parent’) from (
    SELECT * FROM Department WHERE name = ‘Dept 1’
)

 

The following one retrieves the first three levels of sub-departments of Dept 1:

TRAVERSE in(‘Parent’) from (
    SELECT * FROM Department WHERE name = ‘Dept 1’
) WHILE $depth <= 3

 

or just, using the new MAXDEPTH operator:

TRAVERSE in(‘Parent’) from (
    SELECT * FROM Department WHERE name = ‘Dept 1’
) MAXDEPTH 3

 

You can also filter the result of this traversal to retrieve only the third level:

SELECT * FROM (
    TRAVERSE in(‘Parent’) from (
       SELECT * FROM Department WHERE name = ‘Dept 1’
   ) WHILE $depth <= 3
) WHERE $depth = 3

 

The WHILE condition is used by the query to decide if the traversal has to continue on the current traversal branch. It can contain any boolean expression that could be evaluated in a WHERE condition.

Dealing with complex types

Standard SQL was designed to manage simple data types (eg. strings and numbers). OrientDB data model is quite complex compared to RDBMS: records can have simple attributes (strings, numbers, dates), link attributes (see dot notation), list and set attributes, embedded objects. As a consequence, OrientDB SQL was enriched to manage this kind of data.

Managing collections and embedded objects

Records can be created with complex attributes, eg.

INSERT INTO Conference SET
name = ‘OrientDB Next World Conference’, 
tags = [‘NoSQL’, ‘Graph’, ‘Document’], 
location = {address: ‘abc Street’, city: {name: ‘London’, country: {name: ‘UK’}} }

 

sql-img3

The result will be the following:

{
   @class: Conference,
   @rid: #12:10, //self generated and managed by OrientDB
   name:  ‘OrientDB Next World Conference’,
   tags = [‘NoSQL’, ‘Graph’, ‘Document’], 
   location = {
      address: ‘abc Street’, 
      city: {
         name: ‘London’, 
         country: {
            name: ’UK’
         }
      } 
   }
}

 

As you can see, complex attributes are not managed as strings or BLOBs: they are first-class, structured objects that can be queried and manipulated via SQL.

To add a new tag to the conference, we can just use the UPDATE ADD syntax:

UPDATE Conference ADD tags = ‘Multi-Model’ 
WHERE name = ‘OrientDB Next World Conference’

 

To remove a tag, you can just use UPDATE REMOVE syntax:

UPDATE Conference REMOVE tags = ‘NoSQL’ 
WHERE name = ‘OrientDB Next World Conference’

 

You can also use DOT notation to update embedded objects content, eg.

UPDATE Conference SET location.city.country.name = ‘United Kingdom’ 
WHERE name = ‘OrientDB Next World Conference’

 

Querying the Conference class will result in a single record:

sql-img4

 

UNWINDing collections

In some cases, you will need to unwind collection attributes to have a single row per collection item.

OrientDB 2.1 provides a new UNWIND operator, specifically designed for this need:

SELECT FROM Conference UNWIND tags

sql-img5

UNWIND operator can also be used to unwind multiple collections in the same query.

Querying collections and embedded objects

Of course, you will be able to compare collections with other collections using a WHERE condition.

SELECT FROM Conference WHERE tags = [‘Graph’, ‘Document’, ‘Multi-Model’]

 

but in some cases, you will have to query for collections that contain a particular value:

SELECT FROM Conference WHERE tags contains ‘Graph’

 

or for embedded objects that contain a particular value as a key or as a value:

SELECT FROM Conference WHERE location containsKey 'address'

 

SELECT FROM Conference WHERE location containsValue 'abc Street'

Working with Schema(less)

OrientDB allows you to work with a dynamic schema or in a completely schemaless mode, which means that two records in the same class can have completely different attribute sets and types.

OrientDB is very tolerant when dealing with non-existing attributes, eg. queries will never result in an error if attributes in the WHERE condition are not defined for a particular record.

Sometimes you just want to query a set of records based on their content, without having all the complete information about their schema. Consider querying the Conference class for any records that contain the word ‘OrientDB’ in any of their attributes, you can use any() operator for this:

SELECT FROM Conference WHERE any() like ‘%OrientDB%’

 

Of course, if needed, you can define a strict schema for your classes, so you can define attribute names, types, and constraints (not null, min, max etc.)

sql-img6

When you have a defined schema, you can query it using SQL:

SELECT expand(classes) from metadata:schema

sql-img7

Conclusion

SQL is a fantastic language because it’s clear and effective, but to harness all the power of a Multi-Model database, it needs some small improvements. OrientDB SQL is very similar to the standard, but includes the addition of operators to operate graph traversal, manage and query complex attributes, and to manage schema and schema-less documents.

If you already know SQL, you are already at 70% of the learning curve and the remaining 30% will be very smooth. All you need to do is get used to a richer, more powerful, versatile and more efficient data model: the Multi-Model database.

Read the next post about SQL and Pattern Matching with OrientDB.

 

Comments are closed.