Skip to content

Latest commit

 

History

History
81 lines (55 loc) · 24.7 KB

File metadata and controls

81 lines (55 loc) · 24.7 KB

Failover Plugin v2

The AWS Advanced JDBC Wrapper uses the Failover Plugin v2 to provide minimal downtime in the event of a DB instance failure. The plugin is the next version (v2) of the Failover Plugin and unless explicitly stated otherwise, most of the information and suggestions for the Failover Plugin are applicable to the Failover Plugin v2.

Plugin Availability

The plugin is available since version 2.4.0.

Differences between the Failover Plugin and the Failover Plugin v2

The Failover Plugin performs a failover process for each DB connection. Each failover process is triggered independently and is unrelated to failover processes in other connections. While such independence between failover processes has some benefits, it also leads to additional resources like extra threads. If dozens of DB connections are failing over at the same time, it may cause significant load on a client environment.

Picture 1. Each connection triggers its own failover process to detect a new writer.

If a connection needs to get the latest topology, it calls RdsHostListProvider. It should be noted that RdsHostListProvider runs in the same thread as a connection failover process. As shown in Picture 1 above, different connections start and end their failover processes independently.

The Failover Plugin v2 uses an optimized approach where the process of detecting and confirming a cluster topology is delegated to a central topology monitoring component that runs in a separate thread. When the topology is confirmed and a new writer is detected, each waiting connection can resume and reconnect to a required node. This design helps minimize resources required for failover processing and scales better compared to the Failover Plugin.

Picture 2. Connections call MonitoringRdsHostListProvider, which is responsible for detecting the new writer. While waiting for MonitoringRdsHostListProvider, connection threads suspend.

If two connections encounter communication issues with their internal (physical) DB connections, each connection may send a request to the topology monitoring component (MonitoringRdsHostListProvider in Picture 2) for updated topology information reflecting the new writer. Both connections are notified as soon as the latest topology is available. Connection threads can resume, continue with their suspended workflows, and reconnect to a reader or a writer node as needed.

The topology monitoring component mentioned above (MonitoringRdsHostListProvider) updates topology periodically. Usually it uses a connection to a writer node to fetch a cluster topology. Using a connection to a writer node allows to get topology from the first hand without a risk of getting stale data as in case of fetching it from a reader. In some exceptional cases the monitoring component may (temporarily) use a reader connection to fetch a topology however it switch back to a writer node as soon as possible.

Picture 3. MonitoringRdsHostListProvider detects a new writer by establishing connections to nodes in separate threads.

When the cluster topology needs to be confirmed, the monitoring component opens new threads, one for each node (see Picture 3). Each of these threads tries to connect to a node and checks if the node is a writer. When Aurora failover occurs, the new writer node is the first node to reflect the true topology of the cluster. Other nodes connect to the new writer shortly after and update their local copies of the topology. Topology information acquired from a reader node may be outdated/inaccurate for a short period after failover. You can see a typical example of stale topology in the diagram above: thread instance-3, box Topology, to the right. The stale topology incorrectly shows that instance-3 is still a writer.

The threads monitoring the topology stop when a new writer is detected. For 30 seconds after a new writer is detected (and after all waiting connections have been notified), topology continues to be updated at an increased rate. This allows time for all readers to appear in the topology, since 30 seconds is usually enough time for cluster failover to complete and cluster topology to stabilize.

All improvements mentioned above help the Failover Plugin v2 to operate with improved performance and less demand for resources.

A summary of the key differences and between the failover and failover2 plugins is outlined below.

  • Each connection performs its own failover process.
  • Each connection fetches topology by calling the RdsHostListProvider in the same thread.
  • Topology may be fetched from a reader node and it may be stale.

With the failover2 plugin:

  • Each connection delegates detection of the new writer to the MonitoringRdsHostListProvider (which runs in its own thread) and suspends until the new writer is confirmed.
  • The MonitoringRdsHostListProvider tries to connect to every cluster node in parallel.
  • The MonitoringRdsHostListProvider uses an "Am I a writer?" approach to avoid reliance on stale topology.
  • The MonitoringRdsHostListProvider continues topology monitoring at an increased rate to ensure all cluster nodes appear in the topology.

Using the Failover Plugin v2

The Failover Plugin v2 will be enabled by default if the wrapperPlugins value is not specified. If you would like to override the default plugins, you can explicitly include the failover plugin v2 in your list of plugins by adding the plugin code failover2 to the wrapperPlugins value, or by adding it to the current driver profile. After you load the plugin, the failover feature will be enabled.

Please refer to the failover configuration guide for tips to keep in mind when using the failover plugin.

Warning

Do not use the gdbFailover, failover and/or failover2 plugins (or their combination) at the same time for the same connection!

Verify plugin compatibility within your driver configuration using the compatibility guide.

Failover Plugin v2 Configuration Parameters

In addition to the parameters that you can configure for the underlying driver, you can pass the following parameters for the AWS Advanced JDBC Wrapper through the connection URL to specify additional failover behavior.

Parameter Value Required Description Default Value
failoverMode String No Defines a mode for failover process. Failover process may prioritize nodes with different roles and connect to them. Possible values:

- strict-writer - Failover process follows writer node and connects to a new writer when it changes.
- reader-or-writer - During failover, the driver tries to connect to any available/accessible reader node. If no reader is available, the driver will connect to a writer node. This logic mimics the logic of the Aurora read-only cluster endpoint.
- strict-reader - During failover, the driver tries to connect to any available reader node. If no reader is available, the driver raises an error. Reader failover to a writer node will only be allowed for single-node clusters. This logic mimics the logic of the Aurora read-only cluster endpoint.

If this parameter is omitted, default value depends on connection url. For Aurora read-only cluster endpoint, it's set to reader-or-writer. Otherwise, it's strict-writer.
Default value depends on connection url. For Aurora read-only cluster endpoint, it's set to reader-or-writer. Otherwise, it's strict-writer.
clusterInstanceHostPattern String If connecting using an IP address or custom domain URL: Yes

Otherwise: No
This parameter is not required unless connecting to an AWS RDS cluster via an IP address or custom domain URL. In those cases, this parameter specifies the cluster instance DNS pattern that will be used to build a complete instance endpoint. A "?" character in this pattern should be used as a placeholder for the DB instance identifiers of the instances in the cluster. See here for more information.

Example: ?.my-domain.com, any-subdomain.?.my-domain.com:9999

Use case Example: If your cluster instance endpoints follow this pattern:instanceIdentifier1.customHost, instanceIdentifier2.customHost, etc. and you want your initial connection to be to customHost:1234, then your connection string should look like this: jdbc:aws-wrapper:mysql://customHost:1234/test?clusterInstanceHostPattern=?.customHost
If the provided connection string is not an IP address or custom domain, the JDBC Driver will automatically acquire the cluster instance host pattern from the customer-provided connection string.
globalClusterInstanceHostPatterns String For Global Databases: Yes

Otherwise: No
This parameter is similar to the clusterInstanceHostPattern parameter but it provides a comma-separated list of instance host patterns. This parameter is required for Aurora Global Databases. The list should contain host patterns for each region of the global database. Each host pattern can be based on an RDS instance endpoint or a custom user domain name. If a custom domain name is used, the instance template pattern should be prefixed with the AWS region name in square brackets ([<aws-region-name>]).

The parameter is ignored for other types of databases (Aurora Clusters, RDS Clusters, plain RDS databases, etc.).

Example: for an Aurora Global Database with two AWS regions us-east-2 and us-west-2, the parameter value should be set to ?.XYZ1.us-east-2.rds.amazonaws.com,?.XYZ2.us-west-2.rds.amazonaws.com. Please note that user identifiers are different for different AWS regions (XYZ1 and XYZ2 in the example above).

Example: if using custom domain names, the parameter value should be similar to [us-east-2]?.customHost,[us-west-2]?.anotherCustomHost. The port can also be provided: [us-east-2]?.customHost:8888,[us-west-2]?.anotherCustomHost:9999

For complete Aurora Global Database configuration, see Aurora Global Databases.
clusterTopologyRefreshRateMs Integer No Cluster topology refresh rate in milliseconds when a cluster is not in failover. It refers to the regular, slow monitoring rate explained above. 30000
failoverTimeoutMs Integer No Maximum allowed time in milliseconds to attempt reconnecting to a new writer or reader instance after a cluster failover is initiated. 300000
clusterTopologyHighRefreshRateMs Integer No Interval of time in milliseconds to wait between attempts to update cluster topology after the writer has come back online following a failover event. It corresponds to the increased monitoring rate described earlier. Usually, the topology monitoring component uses this increased monitoring rate for 30s after a new writer was detected. 100
failoverReaderHostSelectorStrategy String No Strategy used to select a reader node during failover. For more information on the available reader selection strategies, see this table. random
clusterId String If connecting to multiple database clusters within a single application:: Yes

Otherwise: No

A unique identifier for the cluster. Connections with the same cluster id share a cluster topology cache. This parameter is optional and defaults to 1. When supporting multiple database clusters, this parameter becomes mandatory. Each connection string must include the clusterId parameter with a value that can be any number or string. However, all connection strings associated with the same database cluster must use identical clusterId values, while connection strings belonging to different database clusters must specify distinct values. Examples of value: 1, 2, 1234, abc-1, abc-2.

Please see clusterId documentation for more information.
1
telemetryFailoverAdditionalTopTrace Boolean No Allows the driver to produce an additional telemetry span associated with failover. Such span helps to facilitate telemetry analysis in AWS CloudWatch. false
skipFailoverOnInterruptedThread Boolean No Enable to skip failover if the current thread is interrupted. This may leave the Connection in an invalid state so the Connection should be disposed. false

Please refer to the original Failover Plugin for more details about error codes, configurations, connection pooling and sample codes.

Sample Code

PostgreSQL Failover Sample Code

This sample code uses the original failover plugin, but it can also be used with the failover2 plugin. Configuration parameters should be adjusted in accordance with the table above.