Why Today’s Data Governance Requires a NoSQL Database

More than 2.5 quintillion bytes of data—as much as 250,000 times the printed material in the U.S. Library of Congress—come into existence every day. What this data means for the average enterprise is opportunity: the opportunity to improve fraud protection, compliance and personalization of services and products.

But first, you need to make sure you are working with the right data and that your data is consistent and clean.

While data governance itself is not a new concept, the need for significantly better data governance has grown with the volume, variety and velocity of data. With this need for better data governance has come a need for better databases. Before we get into that, let’s make sure we’re clear on what data governance is and how it’s used.

The Three Pillars of Data Governance

Data governance is the establishment of processes around data availability, usability, consistency, integrity and security, all of which fall into the three pillars of data governance.

Pillar 1: Data Stewardship

In an age when data silos run rampant and “bad data” is blamed for nearly every major strategic oversight at an enterprise, it’s critical to have someone or something at the ready to ensure business users have high-quality, consistent and easily accessible data.

Enter data stewardship and the “data steward.” A data steward ensures common, meaningful data across applications and systems. This is much easier said than done, of course, and quite often the problems with data stewardship arise from a lack of clarity or specificity around the data steward’s function, as there are many ways to approach it (i.e., according to subject area, business function, business process, etc.).

Nevertheless, properly stewarding data has become a key ability for today’s enterprises and is a key aspect of proper data governance at any organization.

Pillar 2: Data Quality

Where data governance itself is the policies and procedures around the overall management of usability, availability, integrity and security of data, data quality is the degree to which information consistently meets the expectations and requirements of the people using it to perform their jobs.

The two are, of course, very intertwined, although data quality should be seen as a natural result of good data governance, and one of the most important results that good data governance achieves.

How accurate is the data? How complete? How consistent? How compliant? These are all questions of data quality, and they are often addressed via the third pillar of data governance: master data management.

Pillar 3: Master Data Management

Master data management, or MDM, is often seen as a first step towards making the data usable and shareable across an organization. Enterprises are increasingly seeking to consolidate environments, applications and data in order to:

MDM is a powerful method used to achieve all of the above via the creation of a single point of reference for all data.

NoSQL Graph Database for Data Governance

Considering the recent Facebook fiasco with personal data, and with big regulations like the General Data Protection Regulation (GDPR) now in effect, it’s impossible to understate the importance of data governance.

NoSQL databases were designed with modern IT architectures in mind. They use a more flexible approach that enables increased agility for development teams, which can evolve the data models on the fly to account for shifting application requirements. NoSQL databases are also easily scalable and can handle large volumes of structured, semi-structured and unstructured data.

Graph databases can be implemented as native graphs, while non-native graph databases, which are slower, store data in relational databases or other NoSQL databases (such as Cassandra) and use graph processing engines for data access. Graph databases are well-suited for applications traversing paths between entities or where the relationship between entities and their properties needs to be queried.

This relationship-analysis capability makes them ideal for empowering solid data governance at organizations of all types and sizes. From fraud protection to compliance to getting a complete view of the customer, a NoSQL graph database makes data governance much easier and much less costly.

To learn more about how to use a NoSQL graph database for data governance, click here.

Josh Ledbetter
Senior Account Executive, OrientDB, an SAP Company

Today, banks, credit unions and other financial services firms must cross many data chasms to keep customer information secure, accessible and functional for modern service application development and risk detection. Yet, the intertwined nature of aging data architectures and tangled data lineages make rapid application development difficult to accomplish—and buries organizations in technical debt.

What is Technical Debt?

Technical debt refers to the financial, time and risk costs of running services and applications from outdated system design or IT infrastructure.

Within financial services firms, IT architecture is built upon multiple databases and software systems in varying stages of their lifecycle. DevOps teams at financial firms are likely running multiple instances of Windows, Linux and numerous desktops connected across physical hardware and networks. Customer data is spread across Oracle, SQL, Hadoop and Hive databases and accessed via insecure mobile devices, mobile applications, online banking systems, lending platforms, banking centers and more. As individual platforms reach their usage limits, the code on which they were built can become outdated, cause latency issues and expose areas of risk in older systems.

On top of the costs of running, maintaining and deploying those systems, many institutions are stuck servicing old HR, CRM or other databases with monthly retainers to software vendors just to host and access data hosted within redundant systems. Often, there isn’t an easy way to offload this data into a useable format without tremendous cost and effort.

The result? Financial institutions want to upgrade and innovate but are stuck paying thousands in service and maintenance fees for data infrastructures that are too inflexible, outdated and risk-prone to modify.

This technical debt builds and builds until a company is stuck with an inert, messy platform that makes decommissioning legacy applications extremely complex; implementing new applications, like mobile banking, within the current data topology expensive; and detecting fraud and performing risk assessment like finding a needle in a haystack.

How Multi-Model and Graph Databases Cut These Costs and Cut Through the Data

Instead of trying to mine disconnected data within limited relational databases, multi-model graph databases reduce technical debt by correlating relationships between many different data types.

Deploying a multi-model graph database enables DevOps teams to start running different functions within the same platform, not multiple platforms, service layers and data centers that all need to be maintained.

On-demand visualization models show how application changes will affect the entire IT ecosystem, providing a better snapshot of the financial impact of application development and deployment for new tech-based customer services.

This structure increases the “plugability” of new SaaS, cloud solutions and integrated customer applications. With multi-model graph solutions automatically connecting new nodes and data formats, financial services firms can quickly innovate on top of their existing network and service layers without outdated code and data getting in the way.

In short, multi-model graph databases can turn DevOps from a cost center to an innovation center. The cost efficiency of having accessible data can be used to build and drive even more cost savings through:

Once technical debt has been transformed into a surplus of accessible data, data lineage becomes more much manageable and a strong strategic asset for financial services DevOps teams servicing business units that want to modernize their operations.

Gerard (Jerry) Grassi, P.E.
Senior Vice President, OrientDB, an SAP Company

The power of graph database solutions means that creating data and finding relationships between data sets is not constrained by defined data classifications, formats, storage locations or original data structure.

By storing and monitoring relationship data, graph solutions enable organizations to act on changes to their data model without the limitations of a set database structure, known as schemas. A database that isn’t limited to one schema type also simplifies data modeling and querying of connected data.

Although graph solutions support schema-less use, you might still want to use a schema to enforce some structure within your data model and use.

The data structure you choose to use depends on:

Here’s how schema types can enhance your graph database use and what to consider when aligning data schemas with graph database deployment.

Schema Options with Graph Databases

In traditional relational database use, schemas include tables, views, indexes, foreign keys and check constraints. A graph database such as OrientDB still includes a basic schema with nodes (data entities), vertex and edge objects, in which edges store information about a particular data relationship.

The degree to which you define the classes of these edges and vertices depends on your graph database needs. Graph solutions simply provide more flexibility in how you define your data model.

OrientDB’s graph solutions support three types of schema options:

With graph solutions, you can define each schema type as you create the structure of your graph database.

Selecting a Schema to Fit Your Business Insights

The type of schema you choose to power your graph solution ultimately depends on the kinds of questions you want your graph solution to infer from your data relationships. The ultimate difference between the different schema types is how specifically you define the constraints on the types of nodes and the allowed relationships between the node types.

There’s no one-size-fits-all approach, but the schema flexibility of graph solutions allows you to think about why you are querying data instead of what data when building your graph database solution.

For example, if you’re building an application for recommended services to existing customers, such as upselling financial packages for banking members, you can likely use a schema-hybrid model to define the nodes and edges with specific data types.

If you’re trying to uncover unforeseen relationships between data sets, such as in fraud-detection applications, a schema-less model enables you to adjust relationship guidelines as the database generates real-time visualizations.

Selecting a Schema that Will Scale

The schema model you choose also depends on how you’d like to scale the database for use with new or changing data inputs, systems and use cases.

Graph databases shine here because relationships and vertex types are created as new data comes into the system, allowing your database to “expand” as business use changes.

As you scale your graph solution to different areas of your business, the schema you have in place will impact how you build traversals between data. Perhaps you started with a customer targeting application using a schema-hybrid. As that grows, you might want to move to a schema-less model to extract even more data around the results of that targeting and use it to infer relationships between customer use and product innovation. Discovering or creating new relationships between data types and applications works best with flexible system design, which a schema-less model can provide. In this instance, using a dynamic language can help modify or eliminate data classes for a less rigid design.

Likewise, if you started with a schema-less strategy when building your graph database, you might find you want to enforce certain data quality standards or governance rules as more applications or inputs connect to your database. Or, perhaps, you want to bring in legacy schema indexes and represent those structures within your graph solution. In that case, it might make sense to switch to a schema-hybrid model or schema-full strategy with more defined global relationship types and rules within your database. Graphical query tools can enable developers to start building more structure into their existing database.

Selecting a Schema for Optimal Performance

Graph solutions already have a major performance advantage over schema-enforced databases. With OrientDB, users can store up to 120,000 records per second in nodes and process transactions about ten times faster than other databases with defined schemas.

Still, if you’re enforcing more defined types of rules, such as mandatory, unique or null constraints, within your schema model, it’s important to test how that structure and the applications you’re using will impact transactional output, such as:

The schema model you choose is highly dependent on the applications you want to build and how your organization wants to leverage graph databases. No matter which model you choose, graph databases will enforce data nodes and classes to maintain data integrity. However, a schema is hardly set in stone. One of the major advantages of the graph model is that it supports multiple types of schemas side-by-side and enables schema constraints to be reconfigured as needs change.

Luigi Dell’Aquila
Director of Consulting, OrientDB, an SAP Company

 

Developers have always had to do more with less. What the OrientDB team has loved about working with developers over the last eight years is learning all the ways in which they’ve innovated around complex data challenges even as data types, formats and application usage have changed.

When OrientDB founder Luca Garulli created our database management system, he wanted to empower developers’ unsung innovation by freeing them from the chains of monolithic data formats and use. His mission was to create a high-performance transaction graph database that enables developers to:

OrientDB’s goal has always been to offer a solution that gives developers all the tools they need, in one place, to build innovative applications that meet their unique business challenges. It goes beyond providing an open-source product; at OrientDB, we aim for an open innovation strategy that makes not just the code, but the business transformation steps, accessible to developers from all types of industries.

We’re happy to announce that we’ve launched the next phase of this mission: OrientDB.org, a free, one-stop resource for downloading, using, optimizing and deploying the OrientDB graph database solution. Built just for developers, the site includes product documentation, help files, case studies, training materials and release notes to help developers in every step of graph database use, from download to deployment.

Evolving Graph Options for Evolving Data Use

Luca built OrientDB in response to the challenges developers face as new business applications come into play across the enterprise. When database technology was invented 40 years ago, developers didn’t have to contend with capturing and managing unstructured data from social networks, mobile applications and big-data analytics.

Now, application developers are faced with the task of:

Over the years, we’ve baked features into the OrientDB platform that enable developers to solve these challenges, including our Teleporter migration tools, auditing capabilities, offline monitoring, database backups without delays, and dynamic-distribution configuration and clustering.

The work of an application developer is always a moving target, though, as new business needs, data inputs and business goals shift.

Supporting Developers in Delivering Groundbreaking Applications

Even as data volumes and formats have grown, developers have continued to create cutting-edge applications using OrientDB. They’ve not only found a way to absorb and work with uncharted data types, but they’ve spun them up into next-generation business applications, from responsive, geospatial network management for telecommunications to real-time data governance reporting.

In 2019, our goal isn’t just to provide developers with the tools they need to use graph solutions; we want to empower them to build the most powerful cloud business applications in the world.

When you’re building next-generation applications from scratch in your enterprise, there’s not often a blueprint for how to do it. With OrientDB.org, we’ve centralized instructions on the many use cases and applications our customers have built using the OrientDB platform. The site includes deep insight into how they’ve built and deployed those solutions, on both a technical and transformational level.

Take advantage of the institutional knowledge from a multitude of developers across the world by checking out the OrientDB.org site. It’s our heartfelt Valentine’s Day gift to the innovative application developers working hard to move their projects forward every day!

The OrientDB Team
OrientDB, an SAP Company

What if average citizens were able to quickly experiment with public government spending data to determine whether any officials were misusing taxpayer funds?

That’s the question Gabriel Mesquita, a software developer and computer scientist from Brazil, recently set out to answer.

In a post on Hacker Noon, Mesquita explored whether any Brazilian government officials were using their monthly allowances illegally by buying products and services from companies owned by people they know.  

In his experimental attempt to detect fraudulent patterns in spending, he turned to OrientDB, the world’s fastest NoSQL database.

A Data Model That Uncovers Anomalies

After obtaining the public data, Mesquita built a data model that leveraged graph database technology. It’s in Portuguese, but here’s what the model looks like:

Gabriel Mesquita data model leveraging graph database technology

 

Mesquita’s model detects which deputies performed multiple transactions with specific companies, whether those companies donated to the specific deputy’s campaign and whether the deputy has any connections, directly or indirectly, to each company in question.

The results? Seven deputies spent their monthly allowances with companies that supported their campaigns in 2014. Another deputy received a donation from a company and then used taxpayer money three different times to support that company.

None of this behavior is illegal, Mesquita suggests.

But, in support of transparency and to serve as another check and balance on politicians, it’s important that taxpayers know about it.

The Takeaways

Mesquita identified two major takeaways from his research:

  1. The Brazilian Democratic Movement (Partido do Movimento Democrático Brasileiro) uses more money than any other political party.
  2. Politicians spend the most money on travel.

Since data pertaining to friends and relatives of politicians isn’t available in Brazil, Mesquita used “fake data” to flesh out his model.

“To validate the model and the whole process I inserted fake data to simulate the fraud scenarios,” Mesquita writes. “Hopefully, if the Chamber of Deputies has this kind of data, they could use the same process to inspect the deputies’ expenses.”

Because he couldn’t access all of the real-world data needed to truly test his thesis, Mesquita’s exercise was experimental in nature.

Still, he found the right tool in OrientDB.

“OrientDB is a great multi-model DBMS,” Mesquita writes. “Graphs are great [at] exposing relationships,” and OrientDB “is a viable solution to find patterns with open data and to provide transparency for our population.”

For more details on Mesquita’s project, read the full piece on Hacker Noon.

To learn more about the world’s leading multi-model graph database and NoSQL solution, visit https://orientdb.com.

 

Venus Picart
Senior Marketing Director, OrientDB, an SAP Company

In the world of master data management, silos are a tremendous challenge.

When enterprises try to process information from disparate systems, they too often use sub-optimal applications and initiatives laden with errors and misinformation, not to mention blown timelines and budgets. But master data management (MDM) is actually more than just the breaking down of data silos. It’s about efficiency and service, innovation and security, clarity and perspective. It’s about getting the most of your most valuable resource: your data.

Here are the five things you need to know about MDM:

The Challenge of Multiple Data Systems

For existing enterprises, one of the largest hurdles to developing an MDM system is the multiplicity of databases and applications usually involved. What’s more, Enterprise Resource Planning and Customer Relationship Management systems rely on structured data, whereas the proliferation of IoT devices has created exponential growth in unstructured data.

Take the example of Enel, which is one of the largest power utilities in Europe. Enel was struggling to provide analytics and reporting across all of its power generation plants and equipment. They see data flowing in from multiple systems, including IoT devices on power generation equipment, plant maintenance systems, scheduling applications, and other sources. Each of these data sources has its own data types. Enel was exporting data to csv files and manually aligning the data to generate reports and analytics.

Other companies in similar scenarios might invest in expensive integration bus systems to support a polyglot persistent environment.

Enel found a solution in a native multi-model database. This allowed them to bring all data into a single database. This means no more worrying about different data types or keeping the different systems in sync. The result was real-time data analysis across all sites and multiple data systems. No more month-long manual processes to manually generate reports.

Master Data Management Really is for Everyone

All companies are now digital enterprises. Since all systems rely on data, MDM is a discipline in which all organizations need to remain competitive. Master data powers everything from financial reporting to real estate transactions to fraud protection. The ultimate results are faster and better decisions, improved customer satisfaction, enhanced operational efficiency, and a better bottom line.

Redundancy Elimination is Only Part of It

Most people who’ve heard of MDM immediately link it to one of its primary objectives: the elimination of redundant data. Yes—having a central repository of data will eliminate data redundancies, as long as it’s done correctly. But the benefits of MDM extend beyond redundancy elimination. Namely: data consistency, data flexibility, data usability, and data security (from role-based access).

Mergers and Acquisitions Don’t Have to Mean a Master Data Management Nightmare

Mergers and acquisitions can be rough on data consistency. Reconciling several master data systems brings headaches from different data system taxonomy and structures. This usually results in two systems remaining separate and linked only through a special reconciliation process.

As more acquisitions and mergers occur, the problem compiles into a labyrinth of siloed systems and data. This brings you back the problem that spurned you to invest in MDM in the first place.

The answer lies in the database management system and vendor you choose for your master data MDM system. Make sure to choose a vendor that offers a flexible, multi-model database that allows you to easily develop a single data taxonomy.

The Database that Backs Your Master Data Management System is Key

The most powerful and effective MDM systems run on databases that fit the business model in question.

As an example, Xima Software uses networks that are like graphs. As such, for a telco, an MDM system via a multi-model graph database is the most effective MDM strategy because the database allows for easy visualization of the network since it uses the same graph model.

Master Data Management is Evolving

If there were a fifth thing you needed to know about MDM, it’s that it’s rapidly evolving to meet the needs of today’s enterprises and their customers. Retailers are using it to improve time-to-market and address their customers’ growing expectations to deliver a true omnichannel experience. The consumer packaged goods industry is using it to ensure the accuracy of nutritional information and comply with local disclosure regulations. And every industry is using it to break down data silos.

Gerard (Jerry) Grassi, P.E.
Senior Vice President – OrientDB
SAP

OrientDB Community Awards

OrientDB Community Awards

 

We value and appreciate the hard work put in by the world-wide OrientDB community. That’s why, as a small token of appreciation, we’ve started sending out some gadgets and rewards to our community members.  

 


Code Contributors

Stabrizi
Saeed Tabrizi

A special thank you to Saeed for his dedication to OrientDB. Among the numerous and valuable contributions, some noteworthy examples include Pull Requests on OrientJS repository in which, among several improvements, he implemented the IF NOT EXIST clause when creating classes and properties, and IF EXIST clause when dropping classes and properties.

mpollmeier
Michael Pollmeier

Michael is the original Author of the Apache TinkerPop 3 Graph Structure Implementation for OrientDB, which will be officially supported in upcoming major OrientDB releases!

 


Community Contributor

smolinari
Scott Molinari

Not only has Scott provided detailed bug reports and documents, he’s helped countless community members by shedding light on new features and helped countless others experiencing issues.

 


Thank You for Your Contributions

Thank you to Saeed, Michael and Scott, who as a gesture of appreciation will be receiving a Raspberry Pi 3® Starter Kit along with some OrientDB merchandise (T-shirt, stickers and that kind of stuff)**.

r-pi2orientdbshirts1

Next time – Bloggers and Writers

We’d also like to send out a special thank you to all the community members writing about OrientDB in their blogs, articles & papers. Thats why next time around we’ll be sending out some more gadgets to our top community bloggers.

So if you’re currently writing about @OrientDB, remember to use the the #OrientDB and #Multimodel tags in your posts and head back to this page regularly. You might find your name on our Top Contributors list!

*All trademarks are the property of their respective owners.
**All OrientDB Community Award winners will be contacted individually in order to receive their prize.

 

London, April 4, 2016

The OrientDB Team has just released OrientDB v2.1.15, resolving 8 issues from v2.1.14. This is the last stable release. Please upgrade your production environments to v2.1.15. For more information, take a look at the Change Log.

Download OrientDB v2.1.15 now: https://orientdb.com/download

A big thank you goes out to the OrientDB team and all the contributors who worked hard on this release, providing pull requests, tests, issues and comments.

Best regards,

Luigi Dell’Aquila
Director of Consulting
OrientDB LTD

OrientDB launches its Open Source NoSQL Graph-Document Database through CenturyLink’s Cloud Marketplace

 

LONDON, UK – March 17, 2016 OrientDB, the pioneer behind the world’s first Open Source, NoSQL distributed graph-document database, today announced its certification under the CenturyLink Cloud Marketplace Provider Program. Through this partnership, CenturyLink Cloud users are now able to deploy and manage OrientDB’s community or enterprise edition databases via CenturyLink’s Blueprints library.

OrientDB is a second-generation distributed graph database with the flexibility of documents and an open source Apache 2 license. By treating every vertex and edge as a JSON (JavaScript Object Notation) document, OrientDB enables the creation of multi-directional property graphs, allowing bulks of data to be traversed with ease. This new multi-model approach, with a polyglot engine, eliminates the need for multiple systems, ensures data consistency and optimizes the formation of complex relationships. Even for a document-based database, the relationships are managed, as in graph databases, with direct connections amongst records. Its versatility and rapid integration makes OrientDB a perfect candidate for use cases ranging from recommendation engines and fraud detection to real-time analytics and content management. Fortune 500 companies, government entities and startups all use the technology to build large-scale innovative applications.

CenturyLink Cloud customers can now benefit from OrientDB’s features:

OrientDB Community Edition is free for any purpose (including commercial use). OrientDB Enterprise Edition serves as an extension of the Community Edition by providing enterprise-needed features such as: Query Profiler, distributed clustering configuration, metrics recording, and Live Monitor with configurable alerts.

The CenturyLink Cloud Marketplace Provider Program allows participating technology companies, like OrientDB, to integrate with the CenturyLink Cloud platform. These additional business-ready solutions are available to CenturyLink’s cloud, hosting and network customers.

“Companies hoping to leverage big data are getting tired of dealing with multiple systems and increasing infrastructural costs,” said Luca Garulli, CEO of OrientDB. “Customers choose OrientDB for its innovative Multi-model database capabilities and affordable nature. Expanding our capabilities to the cloud through CenturyLink provides the perfect accessible solution without the need for multiple database systems or costly servers.”

“The foundation of the big data revolution on our platform has been software innovation around unstructured data management,” said David Shacochis, vice president of platform enablement at CenturyLink, “OrientDB is a great example of this trend, allowing our customers to manage their unstructured data relations in a scalable model that drives insight out of their business workloads”

To start using OrientDB on CenturyLink Cloud today, refer to the “Getting Started” guide on the CenturyLink Cloud Knowledge Base.

About OrientDB

OrientDB is an open source 2nd Generation Distributed Graph Database with the flexibility of Documents and a familiar SQL dialect. With downloads exceeding 70,000 per month, more than 100 community contributors and 1000’s of production users, OrientDB is experiencing tremendous growth in both community and Enterprise adoption. First generation Graph Databases lack the features that Big Data demands: multi-master replication, sharding and more flexibility for modern complex use cases. See for yourself, Download OrientDB and give it a try.

Editorial Contacts:
Paolo Puccini
OrientDB Ltd
+44 203 3971 609
info@orientdb.com

February 8, 2016

By Andrey Lomakin, Lead Research & Development Engineer at OrientDB

In OrientDB v2.2 we’ve added tools which enable storage performance metrics to be gathered for a whole system and for a concrete command executed at that current moment. This feature will not only be interesting for database support teams, but it will probably also be of interest to users who want to understand why a database is fast or slow for their use case and what the reasoning is for results attained in a benchmark.

But before we consider characteristics gathered during storage profiling, let’s take a look at OrientDB’s architecture.

All high level OrientDB components, exposed to the user as clusters or indexes, are implemented inside the storage engine as “durable components” and extend the ODurableComponent class. This class is part of the framework created to make components/data structure operations atomic and isolated in terms of  ACID properties. Each durable component has to hold its data in direct memory, not in Java heap. But if in Java we operate variables to store/read application data, durable components operate pages.

A Page is a continuous snippet of memory which always has the same fixed size and is mapped to a file placed on a disk. When data is written to the page it is  automatically written to the file, but data is not written instantly, it must sometimes pass between the moment when data is written to the page and the moment data is written to the file.

We separate write operations on pages and the file system because file system operations are slow and we try to decouple data operations and file system operations. When we change the page it is not written to the disk instantly, as I have already mentioned above, but is placed in the write cache. The write cache aggregates all changed pages and stores them to the disk in a background thread in order of their file position. So, if we have changed pages with positions 3, 2, 8, 4, they will be stored in the order 2, 3, 4, 8.

Pages are sorted by their file positions because it does not matter whether you use DDR, SSD or HDD to store your data; sequential IO operations are always faster than random IO operations. Because pages are stored in a separate background thread, disk write operation speed will be decoupled from data modification operation speed.

In case of write operations we, may delay a data write and try to convert it to a sequential IO operation, but if we need to read data we need it instantly and can not delay the data read from file. So in this case we use the well known technique of caching frequently used pages in read cache.

So taking all of the above into account, you can see that OrientDB uses 2 caches:

When we read a page from a file, the following steps are performed:

When we modified the page content, it is automatically placed in the write cache.

There is one big problem with all those caches. Such system is not durable. If the application crashes, then some data which have not yet been written to the disk will be lost.

To avoid this kind of problems we use a database journal, aka WAL (write ahead log). This makes the whole process of writing of data a bit complex. When we modify the page we do not put the page in the write cache.  Instead, we write the difference of the page content into map keys which consist of a file and index of the changed page and values which contain diff of changes between original and changed page.

When an operation on a cluster or index is completed without exceptions we extract all changes from the map and log them inside the database journal and only after that do we apply those changes to the file pages and put them in the write cache. The database journal may be treated as an “append only” log so all write operations to the database journal are sequential and as result are fast. The process of writing changes to the database journal and applying them to the “real” file page is called “atomic operation commit”.

What value does the database journal give to us ?

In both cases data consistency will not be compromised.

Taking all of above into account you probably have already concluded that the main performance characteristics of OrientDB’s storage engine (and not only OrientDB)  are:

All those numbers will show us the direction our project must evolve towards. For example, if we have good numbers for a disk cache hit rate and very few pages are read for a single component operation, we have to improve disk cache speed as a whole.  However if we have a lot of page reads for a single component operation and very low numbers for page read speeds, we need to minimize the amount of pages accessed for the single operation and convert data structures to ones which uses more sequential rather than random IO operations.

Readers may ask: “well, all of this is very good but how it is related to us?”

The answer is: when you report performance issues please provide a benchmark (though we all have different hardware and sometimes cannot simply reproduce your issue) but also provide performance numbers gathered as the results of storage profiling.

Readers might also ask: “How is that done?”

It may be done in 2 ways, by using JMX and by using SQL commands.

The JMX console provides numbers gathered from the execution of all operations in storage but SQL commands provide data which are gathered for a selected set of commands.

To gather performance for a selected set of commands you can execute a script such as the one shown below:

At the end of the script you will see the following result:

As you can see you may see numbers for storage performance as a whole and numbers for the performance of each component.

Data from the atomic operation commit phase is presented as data from the component named “atomic operation”.

If you work with an embedded database you can start and stop storage profiling by calling the following methods:

OAbstractPaginatedStorage#startGatheringPerformanceStatisticForCurrentThread() to start storage profiling and OAbstractPaginatedStorage#completeGatheringPerformanceStatisticForCurrentThread() to stop profiling of storage.

You may also connect to the JMX server and read current performance numbers from MBean with name:

com.orientechnologies.orient.core.storage.impl.local.statistic:type=OStoragePerformanceStatisticMXBean,name=,id=

Hope it will be interesting for you to read overview of OrientDB architecture and performance characteristics which are important for us. Please do not forget to send results of profiling together with performance reports.

If you have any questions about this blog entry or about any of OrientDB’s features please post your question on stackoverflow and we will answer it.

Start using the world’s leading multi-model database today