Introduction to SAP HANA
SAP HANA : Background
The Relational database was designed in the late 80s/early 90s in order to get to a more structured relational way to manage data using SQL.
At the time the hardware was very different than it is today. DRAM, for example, was much more expensive and much smaller than it is now. Also CPUs were single core.
By 2003, it was clear for SAP that a complete new database paradigm was possible and building it would be done using the new hardware capabilities (multi-core processing, massively larger and cheaper main memory and the usage of columnar structures).
Accessing data from memory is dramatically faster than accessing it from the disk (>10k x).
This was the beginning of the path towards HANA (name invented by Dr Vishal Sikka on August 2006 in Palo Alto – initially the name was a reference to “Hasso’s New Architecture”).
The paper for the new in-memory columnar database was presented by Hasso Plattner in New England in 2009. It was extremely well received and development of the HANA project started in October 2009. On December 1st of 2010, HANA was launched and went into RTC.
In June 2011, HANA went generally available !
HANA technology : Parallelism
The combination of multi-core parallelism, data locality in memory, columnar structures and the fact that everything has been rethought from scratch is the secret behind the creation of HANA.
The power that HANA derives is from the fact that it runs massively parallel. Every operator in HANA operates in parallel.
Modern servers can have 80 CPU cores, 2TB Dram, 5TB SSD. HANA was designed to take maximum advantage of this computing power.
One of the most important statistics to remember is that HANA does 3.5 Billion scans per second per core. It also does 12.5 to 15 million aggregations per second per core. This means that everything can be aggregated on the fly.
All the major operators, calculations, joins and scans use parallelism. In fact they use “Intra-operator parallelism”. This means that not only a job can be distributed across processors but even within the operator itself a part of the Job can be run within an operator in parallel.
The intra-operator parallelism runs 6.5x faster than just parallelism itself. This is where some of the tremendous advantages of HANA comes from.
HANA technology : Row and Column stores
In memory with amazing inventions like OLFIT (Optimistic Latch Free Index Traversal). OLFIT is a new wide data structure which facilitates storing a transaction in memory without locking the entire index or system. This caters to the need of wide data structure by the Financial Apps which have massive tables with around 320 fields.
The benefit of the row store is that you can do transactions very quickly.
Traditionally column store were known to be slow.
SAP worked super hard over the years to make sure this is not the case by enabling amazing techniques:
Main column store
Delta column store. Transactions go very quickly to this store and are merged later in the main store.
Wherever there is a question, a join is done between the main and the delta store.
Another addition to this design is an additional row store (L1 delta) which sits as a buffer in front of the delta. This can absorb transaction really really fast.
The benefit of the column store is that you can do analytics dramatically faster (3.5 Billion scans per second per core).
HANA technology : Projections, Dynamic aggregation, Integrated compression
Get only what you need (Minimal projection). This can be done by getting only the columns that we need out of the columnar store.
The same applies here as we can do calculations on the fly thanks to the fact that HANA does aggregations very quickly (12.5 – 15 Million aggregations per second per core).
Aggregations do not have to be stored into unnecessary data structures, tables and intermediate values. This is hugely important in analytics.
When you organize things by columns, you don’t have to store every single row. You take just the fields that are necessary for columns. For this, you can create dictionaries in HANA.
Integrated compression enables HANA to do tremendous savings in space (Analytical workloads customers who run analytical data warehouses routinely get 10x, 20x and even 30x compression).
HANA technology : Insert only, Partitioning and Scale-Out, Active and Passive Storage
In column store it is very advantageous to only insert new records even for an update process. Instead of updating an existing record, you simply create a new entry and in a separate process you invalidate the previous entry.
Depending on the invalidation strategy you can recreate histories, it is possible to do audit trails. This is an invaluable capability that comes native with HANA.
Another important thing that is extremely important is the ability to partition data and to scale out across machines.
You can dynamically partition data across machines and then use the full calculation power of every machine. Partitioning can be done by row or by column.
Typically, on a 16 node cluster you can see performance improvement of 11x.
You don’t need to hold all data in the Hot/Active memory (for financial data for example, you need only +/- 14 months and maybe 28 months if you also want to do year over year comparison).
You can store the remaining data into flash and SSD and organize data this way in Hot and Cold data so you can get even more compression and better performance.
HANA technology : SQL, Libraries and Summary
SQL is the language to access HANA both on the row store and the column store.
There is also a support for MDX, text-functions, functional enhancements for business functions, geographical syntaxes, map-reduce operations (NO SQL), SQL script, L (LLVM).
HANA itself is written in C++ and there are a lot of C++ libraries in HANA for GIS data, Text data, BFL (Business Function Library), PAL (Predictive Analytics Library), R (Statistical package), IMSL (International Mathematics and Statistics Library), AFL (integrated any user library inside HANA in a safe way).
HANA performance Benchmarks
5 dimensions of performance :
Data size: the larger the data, the slower the system gets.
The query complexity: How complex are our questions (ranging from simple scans and joins to highly complex statistical analysis, finding medians, doing clustering analysis, etc… ?
Rate of change : How quickly does the system absorb new information ?
Data prepared : Is the data prepared or is it raw ?
Response time : How quickly can we get our questions answered ? Ideally in less than 3 seconds.
The more of these 5 dimensions are in there, the more HANA performance stands out.
HANA roadmap and Re-thinking Software Development
So what is SAP doing with this technology?
The roadmap is very simple. Every single product already runs or will eventually run on HANA:
The business suite now runs on HANA : ERP, SCM, CRM, ...
Cloud applications already running on HANA : ARIBA, Success factors, …
Beyond that, on the technology products, BW has been delivered on HANA back in November 2011. This dramatically accelerated things running on BW:
Many BW reports run 500+ x faster.
Parallel loading of data into BW. DSO, PSA activations (staging areas and cubes built inside BW) run 10 to 20x faster.
All other technical products are built and are optimized to run on HANA:
Application platforms ABAP 7.4 and Java.
HANA in practice and summary
Beyond rethinking the existing portfolio, SAP is working on implementing solutions in totally new areas for problems that were beyond their reach before the advent of HANA.
Amazing applications can be developed on top of SAP HANA in many industries and a lot of companies are developing today new solutions taking advantage of the massive computational power provided by HANA.
With HANA, the only limitation is our imagination !
For the full videos of this introduction course to HANA delivered by Dr. Vishal Sikka, please visit open.sap.com.