Skip to content

PT-2534 - fixed connection leak due to confusion on parent flag#1123

Open
marcelohpf wants to merge 2 commits into
percona:3.xfrom
marcelohpf:connection-leak-on-get-replicas
Open

PT-2534 - fixed connection leak due to confusion on parent flag#1123
marcelohpf wants to merge 2 commits into
percona:3.xfrom
marcelohpf:connection-leak-on-get-replicas

Conversation

@marcelohpf
Copy link
Copy Markdown
Contributor

Introduces a new argument to Cxn to flag that connection should not be closed automatically with InactiveDestroy, while keep the parent argument for its current usage of topology parent (or connection source parent)

The code had two different purposes to the same argument "parent" - topology parent and process parent in fork scenarios leading to a leak when get_replicas was called multiple times (e.g. when waiting for replication delays).

None of the current percona toolkit bin/ scripts seems to use forking other than deadlock detection. None of the "parent => X" cases were properly used in a multiprocess scenario, even when daemonizing the tool, connections were created after the fork;

You can reproduce it by:

  • create large enough table;
  • start dev environment mysqls
  • trigger OSC:
./bin/pt-online-schema-change \
  'D=mysqlslap,t=t1' --alter "ADD COLUMN mycolumn varchar(255) NOT NULL DEFAULT ''" \
  --execute --no-version-check --statistics --no-check-alter --recurse 2 --max-lag 10 \
  --chunk-size 10 --progress time,30  \
  --recursion-method dsn=D=percona,t=dsns \ # important
  --no-check-replication-filters \
  --database mysqlslap --host 127.0.0.1 \
  --user msandbox --password msandbox --port 12345

Then connect to one of the read-replicas and watch process list grow indefinitely:

while true; do 
  mysql -h 127.0.0.1 -P 12346 -u msandbox -pmsandbox -e 'show processlist;'
  sleep 1;
done

The root cause of the issue is that InactiveDestroy prevent dbh closing connection as part of the DESTROY workflow that is triggered when object looses reference count.

# other code
my $dbh;
{
  $dbh = DBI::connect();
   $dbh -> {InactiveDestroy} = 1;
  fork();
  if (is children) {
     $dbh->ping();  # ok
     exit;
  }
}
$dbh->ping() # should not fail because InactiveDestroy = 1; else it would
  • The contributed code is licensed under GPL v2.0
  • Contributor Licence Agreement (CLA) is signed
  • util/update-modules has been ran
  • Documentation updated
  • Test suite update

@marcelohpf marcelohpf requested a review from svetasmirnova as a code owner May 8, 2026 07:23
@it-percona-cla
Copy link
Copy Markdown

it-percona-cla commented May 8, 2026

CLA assistant check
All committers have signed the CLA.

@sleto-it sleto-it changed the title PT-XXXX- fixed connection leak due to confusion on parent flag PT-2534- fixed connection leak due to confusion on parent flag May 11, 2026
@sleto-it sleto-it changed the title PT-2534- fixed connection leak due to confusion on parent flag PT-2534 - fixed connection leak due to confusion on parent flag May 11, 2026
Copy link
Copy Markdown
Contributor

@EvgeniyPatlan EvgeniyPatlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest some changes. They are for test scenario but still it is netter to follow perl recommendations

Comment thread t/lib/Cxn.t Outdated
Comment thread t/lib/Cxn.t Outdated
Comment thread t/lib/Cxn.t Outdated
Comment thread t/lib/Cxn.t Outdated
Comment thread t/lib/Cxn.t Outdated
@marcelohpf marcelohpf force-pushed the connection-leak-on-get-replicas branch from 34e5001 to d9a31b9 Compare May 12, 2026 08:19
Copy link
Copy Markdown
Contributor

@EvgeniyPatlan EvgeniyPatlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Introduces a new argument to Cxn to flag that connection should not be closed automatically with InactiveDestroy. The code had two different purposes to the same argument - replication topology parent and processes parent in fork scenarios leading to a leak.
@marcelohpf marcelohpf force-pushed the connection-leak-on-get-replicas branch from d9a31b9 to 5a9d429 Compare May 12, 2026 09:55
@marcelohpf
Copy link
Copy Markdown
Contributor Author

marcelohpf commented May 12, 2026

I had to update the commit author due to the CLA agreement, sorry for the force push;

@EvgeniyPatlan
Copy link
Copy Markdown
Contributor

thanks you @marcelohpf for the contribution and for the quick fixes!

@charles-001
Copy link
Copy Markdown

charles-001 commented May 20, 2026

I'm pretty sure this is the issue I'm currently running into with the latest version (3.7.1). While pt-osc is running, it's creating a ton of connections on the replica to check its lag (like 1000+). Eventually, it hits max_user_connections and fails. Seems like a regression happened in 3.7.1

Cannot connect to MySQL: DBI connect(';mysql_read_default_file=~/.my.cnf;host=host;mysql_read_default_group=client','',...) failed: User my_user already has more than 'max_user_connections' active connections at /usr/bin/pt-online-schema-change line 2351.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants