Saturday, November 30, 2024

Failover cluster

(Thanks to StarWind) 

Always On Failover Cluster Instances (FCI) has a very similar goal as AG – deliver High Availability for your SQL Server. The main difference though is that FCI works on a server-instance level. FCI represents a single instance of an SQL Server that is deployed across Failover Cluster nodes. In case of hardware or software issues on the node, the instance is failed over to the other one.

Just as with AG, FCI runs in a Failover Cluster while quorum is maintained. The difference here is storage. Whereas AG does not need any shared storage, FCI requires some form of it. This can be cluster disks on an iSCSI or Fiber Channel SAN, Storage Spaces Direct, StarWind Virtual SAN or SMB file shares.

 

In this case, shared storage allows moving FCI among the nodes in the cluster and this can be done either manually whenever maintenance on other node is needed or automatically (in case of an actual failover event). Of course, there is only a single owner node of SQL Server instance at a time. Correspondingly, there are no specific replication settings as it was with AG since the failover is handled by WSFC and there is no data loss.

Benefits of Failover Cluster Instances

So, what are the benefits you get with FCI?

·         High Availability at the SQL Server instance level;

·         Both automatic and planned failover are available and managed from a Failover Cluster;

·         Flexible shared storage options like iSCSI or FC SAN, SMB shares etc.;

·         No need to reconfigure applications and clients associated with SQL Server instance during failovers.

Restrictions of Failover Cluster Instances

Despite providing a lot of benefits for SQL Server high Availability, there are certain downsides to Always On Failover Cluster Instances.

·         No option to read from secondary databases as with AG since there is a single instance running, hence, no load balancing;

·         Relying on a single SAN as shared storage creates a single point of failure;

·         No Disaster Recovery options unless combined with AG.




Always On availability group

(Thanks to StarWind) 


Always On availability groups is a mechanism that creates a replicated environment (High Availability or Disaster Recovery depending on the availability mode) for a set of databases you specify. These databases are called Availability Databases. Hence, since they are in a group, such databases failover together and at the same time.

AG operates on a database level and each set of availability databases is hosted by two types of availability replicas: primary and secondary. Primary replica provides read and write access to the database and a set of up to eight secondary replicas is available to become primary in case of a failover. We’ve mentioned availability mode above and this basically determines if you’re going to build HA or DR for your databases as well as determines the failover options.

There are two types of availability mode: asynchronous-commit mode and synchronous-commit mode. To put it simply, with asynchronous, some data loss is possible, with synchronous – no data is lost but at the cost of higher transaction latency. It all depends on your data significance (RPO in other words).

 

AG requires Windows Server Failover Cluster (WSFC). Each availability replica is hosted on a separate node in WSFC and for each AG, a separate cluster role is created. There is no witness role in Always On availability groups as it was with Database Mirroring. The quorum now depends on the number of nodes in WSFC and all nodes take part, it doesn’t matter if they host replicas or not.

 

Benefits of Always On Availability Groups

The benefits of using Always On availability groups are probably obvious, but still, let’s summarize them:

·         AG supports up to 9 availability replicas (one primary and up to eight secondary);

·         You are flexible with failover: planned manual failover or automatic failover (if we’re talking about synchronous-commit mode with no data loss) and forced failover in case of asynchronous-commit mode where data loss is possible;

·         In order to make more use of your secondary replicas instead of just let them be and wait for a failover event, you can set them to Active and allow for example to perform backups from them or allow read-only mode to distribute the workload;

·         Supports encryption and compression;

·         Provides Always On Dashboard for monitoring Always On availability groups, availability replicas, and availability databases.

Restrictions of Always On Availability Groups

Of course, there are certain considerations or, better to say, restrictions to be taken into account when working with AG.

·         Availability replicas must be hosted by different nodes within the same WSFC;

·         You can use asynchronous-commit mode for all the replicas (one primary and up to eight secondary) or only up to five replicas (including primary) in synchronous-commit mode;

·         Failover Cluster Manager should not be used to move availability groups to other nodes or for failover;

·         SQL Server logins, linked servers, agent jobs etc. are not synchronized to the secondary databases;

lkk



 

 

Differences between always on and fail-over cluster - in brief

The main difference between Always On and failover cluster is the level at which they provide high availability and disaster recovery: 

Always On

Provides high availability and disaster recovery at the SQL database level. Always On availability groups (AAG) do not require shared storage. 

Failover cluster

Provides high availability for applications and services by using a group of independent servers. If one server fails, another server in the cluster can take over its workload. Failover clusters require some form of shared storage. 

Here are some other differences between Always On and failover cluster: 

 

How they work

Always On failover cluster instances (FCIs) work on the server-instance level, while failover clusters work on the server level. 

 

How they use shared storage

FCIs require shared storage, while AAGs do not. 

 

How they handle failover

FCIs use Windows Server Failover Clustering (WSFC) to handle failover, while AAGs have replication settings. 

 

How they provide remote disaster recovery

FCIs can use AAGs to provide remote disaster recovery at the database level.


Benefits of an Always On SQL Server

The benefits of having an Always On SQL Server in your IT environment cannot be overstated. An Always On SQL Server is a modern database system that provides continuous access to data, allowing your business to remain agile and responsive.

Enhanced data availability

Having an Always On SQL Server means that your data is always available, no matter what happens.

This ensures that your business remains operational and responsive even in the event of unforeseen circumstances such as power outages, hardware failures, or network outages. With an Always On SQL Server, your business will have access to its data regardless of the situation.

Improved performance

An Always On SQL Server can improve the performance of your IT environment by providing faster access to data. This is especially beneficial when your business needs to quickly access large amounts of data. With an Always On SQL Server, your business can access data quickly and efficiently, resulting in improved performance.

 

Increased security

An Always On SQL Server can also enhance your security by providing a secure data environment. With an Always On SQL Server, your business can trust that your data is safe and secure. This is important for protecting sensitive customer and business information from hackers and other cyber threats.

 

Scalability

An always-on SQL Server, is scalable, meaning it can easily grow with your organization. This makes it ideal for businesses that need to quickly scale up or down to meet changing demands

 Cost Savings

An always-on SQL Server can help you save money. It requires less hardware and maintenance costs than a traditional database and less energy to power the server. This helps to reduce your overall IT costs.

Enhanced disaster recovery

An Always On SQL Server can help you recover from disasters more quickly by automatically failing over to a secondary instance or replica in the event of a failure.

 Enhanced disaster recovery

An Always On SQL Server can help you recover from disasters more quickly by automatically failing over to a secondary instance or replica in the event of a failure.

 

Increased Reliability

The “Always On” mode of SQL Server also provides increased reliability. This is because the server is constantly running and monitoring the system for any potential issues. This helps to prevent data loss and other downtime-related issues.

Simplified Management

An Always On SQL Server simplifies the management and maintenance of your data. Organizations can manage their data in a centralized location by utilizing an Always On SQL Server, allowing for easier access and control. This simplifies the process of managing and maintaining your data.

 

Easier Maintenance

Having an Always On SQL Server can help to make maintenance easier for businesses. This is because the system will remain operational, even in times of emergency. This makes it easier for businesses to perform necessary maintenance tasks, such as running backups and other maintenance tasks.

 

These are just a few of the many benefits of having an Always On SQL Server. With an Always On SQL Server, your business can remain agile and responsive, have access to its data at any time, and have increased security. If your business wants to improve its data availability, performance, and security, then an Always On SQL Server is the perfect solution.

 

Always on availability group

 

What is an always on SQL Server availability group?

An Always On SQL Server Availability Group is a feature of SQL Server that allows you to create a highly available and resilient environment for your databases. It does this by creating one or more copies (replicas) of your databases on different servers and automatically failing over to a replica if the primary database becomes unavailable.

 1 Availability Databases

Availability databases are the databases that are included in an availability group. An availability group can contain one or more availability databases.

 2. Availability Replicas

Availability replicas are copies of the availability databases hosted on different servers. There are two types of availability replicas: Primary replicas and secondary replicas. The primary replica is the main copy of the database and handles all read-write workloads. The secondary replicas are copies of the primary replica and are used for failover and offloading read-only workloads.

 3. Availability Modes

Availability modes refer to the level of availability of the availability replicas. There are two availability modes: asynchronous-commit mode and synchronous-commit mode.

 4. Asynchronous-commit Mode

In asynchronous-commit mode, the secondary replicas do not need to be synchronized with the primary replica in real-time. This means that the primary replica can commit transactions to the database even if one or more secondary replicas are unavailable. This mode provides lower levels of data protection but higher levels of performance.

 5. Asynchronous-commit Mode

In asynchronous-commit mode, the secondary replicas do not need to be synchronized with the primary replica in real-time. This means that the primary replica can commit transactions to the database even if one or more secondary replicas are unavailable. This mode provides lower levels of data protection but higher levels of performance.

Always on Fail-over cluster

High availability and redundancy are key requirements in todayʼs IT infrastructure and Always On Failover Clustering Instances (FCIs) are one of the best options to deliver it. Always On FCI is a type of failover clustering that enables a high-availability solution to provide continuous availability of data and applications for mission-critical enterprise workloads.

 Itʼs used by businesses to ensure that their data and applications are always up and running so that business operations arenʼt disrupted due to any unexpected downtime. It does this by creating one or more copies (instances) of your databases on different servers and automatically failing over to an instance if the primary instance becomes unavailable. FCIs are implemented using Windows Server Failover Clustering (WSFC), which is a feature of the Windows operating system that allows you to create a cluster of servers that can provide high availability for applications and services. FCIs use WSFC to provide high availability for SQL Server instances.

 When you create an FCI, you install a copy of the SQL Server on each node in the cluster. One of the nodes is designated as the primary node, and the other nodes are designated as secondary nodes. The primary node hosts the primary instance of the SQL Server, and the secondary nodes host secondary instances of the SQL Server.

If the primary node fails, WSFC will automatically failover to one of the secondary nodes, which will become the new primary node and host the primary instance of the SQL Server. This process is transparent to users and clients, who will continue to access the primary instance of the SQL Server as if nothing had happened. FCIs provide a highly available and resilient environment for your SQL Server instances by automatically failing over to a secondary instance in the event of a failure. This can help improve your database environment's reliability, performance, and uptime.

 

How does Always On Failover Clustering Instances work?

Always On FCI works by setting up two or more servers to host the same applications and share the same data. If one server encounters an issue or goes down, the other server automatically takes over and provides the same services and data. This ensures that itʼs always available, even during an outage.

This setup is highly reliable because a single system monitors and manages the two servers. If the primary server fails, the other server will be automatically triggered to take over without any manual intervention. This means that the data and applications remain available, so your business operations remain uninterrupted.


CSLs & KMs

Critical Service Levels (CSLs), which measure performance of functions that are most important to the business at the time the contract is signed (e.g.: Response Time, Resolution time, Schedule Adherence etc.) 

Key Measurements (KMs), which, while not as critical to the customer’s business, still represent performance regarding functions that are important to the customer’s business (e.g.: year on year service improvements, CMM level, Estimation Accuracy, customer satisfaction score etc.) 

The supplier is expected to provide a solution that is designed to meet all of the service levels (CSLs and KMs)These service levels are typically designed to provide an objective measure of how well the services are being performed. 

 

Typically, each metric has both an Expected Performance Level and a Minimum Performance Level. The supplier’s solution should be designed to deliver at the Expected Performance Level most of the time.   

 

The only substantive difference between CSLs and KMs is that failure to meet a CSL can result in the supplier incurring a Service Level Credit. Typically, if the supplier (a) fails to meet the Expected Performance Level for a CSL 3 months in a row or (b) fails to meet the Minimum Performance Level for a CSL in any month, and the failure is not caused by events outside of the supplier’s control (e.g., failure of equipment or software supported by customer, unexpected spikes in ticket volumes above an agreed threshold, etc.), then the supplier incurs a Service Level Credit. If a supplier fails to meet the Minimum Performance Level for a CSL in a second consecutive month, the supplier would incur an escalated Service Level CreditIf performance does not meet expectations, the client retains the right in the future to promote them to CSLs if/when deemed necessary to get the attention and focus of the provider. 

 

In this model, a Service Level Credit is calculated based on an Amount at Risk that is a percentage of the supplier’s monthly revenue for steady state services (i.e., typically projects or other spend with significant monthly variability is not included) and a weighting factor that is assigned to each CSL. 

 

Amount at Risk: The Amount at Risk is typically between 5-12% of the monthly revenue, and represents a cap on the amount of a supplier’s monthly revenue that is at risk for Service Level Credits. The lower end of the range is generally applied to services that are commodities, where the supplier’s margins are lower, and the upper end of the range is generally applied to services where the supplier has some specialization and, thus, higher margins. 

 

Weighting Factor: However, if there are too many SLAs, spreading 5-12% among; this can result in SL Credits that are not meaningful. The weighting factor is designed to address that issue. The weighting factor (known as the Pool Percentage) provides a multiplier that increases the credit associated with each SLA, because it is unlikely that a supplier will fail all (or even many) SLAs in a given month. The weighting factor is typically between 100–250 for deals of the current size and can go upto 400 for large deals. Allocation of weighting points to individual CSLs is usually subject to a cap ranging from 10-35 points, depending on the number of SLAs. 

 

Measurement Period: Typically, the measurement period for all service levels is monthly. The reason for this is that the calculation model is based on risking a portion of monthly revenue. Where a measurement is performed quarterly or even annually, allocating a portion of the monthly amount at risk to quarterly or annual measurements becomes mathematically more complex. 

 

Earn Back: The standard industry methodology also includes an opportunity for the supplier to earn back SL Credits through “good performance”. If a supplier performs below the Minimum Performance Level for a Critical SLA in one month, if the supplier meets or exceeds the Expected Performance Level for a period that is usually between 3-6 months, then the SL Credit is “earned back.” 

 

 “Burn-in” and Base lining: Some service levels can be effective on the Commencement Date, e.g., Attrition of Key Personnel, because they are not dependent on the customer’s environment. Other, more operationally and environmentally sensitive service levels can be subject to a base lining period and a “burn in” period, depending on the maturity of the customer’s environmentWhere a customer can demonstrate that the customer has been meeting a specified SLA for the previous 6-12 months (depending on the SLA, seasonality of the customer’s environment, etc.), the supplier will typically have a 3-6 month “burn in” period beginning on the Commencement Date. Where the customer cannot demonstrate that the customer has been meeting a specified SLA during such a period, the parties will typically agree to a 6-12 month base lining period beginning on the Commencement Date, which is sometimes followed by a 3-6 month “burn in” period once the appropriate level is agreed. 

 

Service level measurements when designed and implemented correctly will protect the investment made by the client, holds the vendor accountable, measurable and has the power to penalize and incentivize the services being delivered.