Random notes based on the seminal book "NoSQL Distilled by Pramod J. Sadalage and Martin Fowler", aimed at enabling faster comprehension
Business needs of the Modern Enterprise
Real-time capture and analysis of big data – coming from multiple sources and formats and spread across multiple locations
Better customer engagements through personalization, content management and 360 degree views in a Smartphone era
Ability and agility in proactively responding to new markets and channels
Constraints of the RDBMS environment
Frequent database design & schema revisions in response to fast-changing data needs have serious application-wide ramifications as RDBMS is the point of business integration
Growing data storage needs call for more computing resources but RDBMS ‘scale up’ is prohibitively expensive
Clustering is an effective solution but cluster-aware Relational DBs can’t escape the ‘single point of failure’ trap in making all writes to an abundantly-available shared disk.
Sharding in RDBMS puts unsustainable loads on applications
NoSQL in perspective
Over time, enterprise with complex and concurrent data needs created tailored non-relational solutions specific to their respective business environments.
They are a natural fit for the clustering environment and fulfill the two principal needs of the modern enterprise, viz,
Cost-effective data storage ensuring fit-for-purpose resilience and several options for data consistency and distribution
Optimal and efficient database-application interactions
It would be appropriate to name this ever-expanding universe as NoSQL, which contrary to what the name implies, is ‘non-relational’ rather that ‘non-SQL’ since many RDBMS systems come with custom extensions. (NewSQL hybrid databases are likely to open new doors of possibilities)
Each data model of the NoSQL universe has a value prop that needs to be considered in the light of the given business case including the required querying type and data access patters. There’s nothing prescriptive about their adoption. And they are not a replacement for SQL, only smart alternatives.
NoSQL data models
A closer look at two common features:
Concept of Aggregates
Group all related data into ‘aggregates’ or collection of discrete data values (think rows in a RDBMS table)
Operations updating multiple fields within each aggregate are atomic, operations across aggregates generally don’t provide the same level of consistency
In column-oriented models, unit of aggregation is column-family, so updates to column-families for the same row may not be atomic
Graph-oriented models use aggregates differently – writes on a single node or edge are generally atomic, while some graph DBs support ACID transactions across nodes and edges
To enable data combinations and summarization, NoSQL DBs offer pre-computed and cached queries, which is their version of RDBMS materialized views for read-intensive data which can afford to be stale for a while. This can be done in two ways:
Update materialized views when you update base data: so each entry will update history aggregates as well
Recommended when materialized view reads are more frequent than their writes, and hence views need to be as fresh as possible
This is best handled at application-end as it’s easier to ensure the dual updates – of base data and materialized views.
For updates with incremental map-reduce, providing the computation to the database works best which then executes it based on configured parameters.
Update materialized views in batches of regular intervals depending on how ‘stale’ your business can afford them to be
Domain-specific compromises on consistency to achieve:
a. High availability through Replication: Master-slave & peer-to-peer clusters
b. Scalability of performance through Sharding
In each case, the domain realities would matter than developmental possibilities –what level and form of compromise is acceptable in the given business case would help arrive at a fit for purpose solution.
Many NoSQL models offer a blended solution to ensure both High Availability and High Scalability - where sharding is replicated using either Master-slave or peer-to-peer methods.
Works best for read-intensive operations
Database copies are maintained on each server.
One server is appointed Master: all applications send write requests to Master which updates local copy. Only the requesting application is conveyed of the change which, at some point, is broadcast to slave servers by the Master.
At all times, all servers – master or slaves - respond to read requests to ensure high availability. Consistency is compromised as it is ‘eventually consistent’. Which means an application may see older version of data if the change has not be updated at its end at the time of the read.
Fail scenarios in Master-slave cluster and their possible mitigation:
Master fails: promote a slave as the new master. On recovery, original Master updates needful changes that the new Master conveys.
Slave fails: Read requests can be routed to any operational slave. On recovery, slave is updated with needful changes if any.
Network connecting Master and (one or more) Slaves fails: affected slaves are isolated and live with stale data till connectivity is restored. In the interim, applications accessing isolated slaves will see outdated versions of data.
Works best for write-intensive operations.
All servers support read and write operations.
Write request can be made to any peer which saves changes locally and intimates them to requesting application. Other peers are subsequently updated.
This approach evenly spreads the load, but if two concurrent applications change the same data simultaneously at different servers, conflicts occur which have to be resolved through Quorums. If there’s a thin chance of two applications updating the same data at almost same times, a quorum rule can state that data values be returned as long as two servers in the cluster agree on it.
Evenly partition data on separate databases, store each database on a separate server. If and when workload increases, add more servers and repartition data across new servers.
To make the most of sharding, data accessed together is ideally kept in the same shard. It’s hence recommended to proactively define aggregates and their relationships in a manner that enables effective sharding.
In case of global enterprises of widely-dispersed user locations, the choice of sites for hosting shards should be based on user proximity apart from most accessed data. Here again, aggregates should be designed in a manner that supports such geography-led partitioning.
Sharding largely comes in two flavors:
Non-sharing Shards that function like an autonomous databases and sharding logic is implemented at application-end.
Auto Shards where sharding logic is implemented at database-end.
Sharding doesn’t work well for graph-oriented data models due to the intricately connected nodes and edges which make partitioning a huge challenge.
Ways to improve ‘eventual consistency’ : Quorums Versioning
Read and Write Quorums
Quorums help consistency by establishing read and write quorums amongst servers in a cluster. In case of reads, data values stored by the read quorum are returned. In case of writes, it is approved by a write quorum of servers in the cluster.
Applications read and write data with no knowledge of quorum arrangements which happen in the background.
The number of servers in a quorum – read or write – have a direct bearing on database performance and application latency. More the number of servers, more the time for read and write quorum approvals.
Consistency problems can arise in Relational and Non-relational despite ACIDity or quorum rules. A case in point is a lost updates from concurrent access of the same data where one modification overwrites the changes made by other. In business cases which can’t afford pessimistic locking, version stamps are a way out:
An application reading a data item also retrieves version information. While updating, it re-reads version info, if it’s unchanged , it saves modified data to the database with the new version info. If not, it retrieves the latest value probably changed by another application and proceeds to re-read version stamp before modifying data.
In the time between re-reading the version info and changing values, an update can still be lost from a change made by another application. To prevent this, data can be locked in the given time frame in the hope that it will be miniscule.
Few NoSQL models like column-oriented DBS enable storing of multiple versions of the same data in an aggregate along with version timestamp. As and when needed, an application can consult the history to determine the latest modification.
When synchronicity between servers in a cluster is in question due to network constraints, vector clocks are seen as a way out. Each server in the cluster maintain a count of updates enforced by it, which other servers can refer to thereby avoiding conflicts.
What ‘schema-less’ actually means?
Onus shifts to Application-side
In NoSQL databases, data structures and aggregates are created by applications. If the application is unable to parse data from the database, a schema-mismatch is certain. Only that it would be encountered at application-end.
So contrary to popular perception, the schema of the data needs to be considered for refactoring applications.
That applications have the freedom to modify data structures does not condone the need for a disciplined approach. Any unscrupulous changes in structure can invite undesirable situations: they can complicate data access logic and even end up with a lot of non-uniform data in the database.