Loadbalancing of PostgreSQL databases using pgpool-II and repmgr
Loadbalancing of PostgreSQL databases using pgpool-II and repmgr
Original post from linux.xvx.cz
I had to solve the PostgreSQL HA and Redundancy a few weeks ago. A lot has been written about this topic, but I was not able to find a guide describing pgpool-II and repmgr. After reading some documents I built the solution which I’m going to describe.
In short it contains the Master/Slave DB Streaming replication and pgpool load distribution and HA. The replication “part” is managed by repmgr.
Here is the network diagram:
Master PostgreSQL database installation -
cz01-psql01:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
#PostgreSQL installation yum localinstall -y http://yum.postgresql.org/9.3/redhat/rhel-6-x86_64/pgdg-redhat93-9.3-1.noarch.rpm yum install -y postgresql93-server repmgr yum install -y --enablerepo=centos-base postgresql93-contrib service postgresql-9.3 initdb chkconfig postgresql-9.3 on sed -i.orig \ -e "s/^#listen_addresses = 'localhost'/listen_addresses = '*'/" \ -e "s/^#shared_preload_libraries = ''/shared_preload_libraries = 'repmgr_funcs'/" \ -e "s/^#wal_level = minimal/wal_level = hot_standby/" \ -e "s/^#archive_mode = off/archive_mode = on/" \ -e "s@^#archive_command = ''@archive_command = 'cd .'@" \ -e "s/^#max_wal_senders = 0/max_wal_senders = 1/" \ -e "s/^#wal_keep_segments = 0/wal_keep_segments = 5000/" \ -e "s/^#\(wal_sender_timeout =.*\)/\1/" \ -e "s/^#hot_standby = off/hot_standby = on/" \ -e "s/^#log_min_duration_statement = -1/log_min_duration_statement = 0/" \ -e "s/^log_line_prefix = '< %m >'/log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '/" \ -e "s/^#log_checkpoints =.*/log_checkpoints = on/" \ -e "s/^#log_connections =.*/log_connections = on/" \ -e "s/^#log_disconnections =.*/log_disconnections = on/" \ -e "s/^#log_lock_waits = off/log_lock_waits = on/" \ -e "s/^#log_statement = 'none'/log_statement = 'all'/" \ -e "s/^#log_temp_files = -1/log_temp_files = 0/" \ /var/lib/pgsql/9.3/data/postgresql.conf cat >> /var/lib/pgsql/9.3/data/pg_hba.conf << EOF host all admin 0.0.0.0/0 md5 host all all 10.32.243.0/24 md5 # cz01-psql01 host repmgr repmgr 10.32.243.147/32 trust host replication repmgr 10.32.243.147/32 trust # cz01-psql02 host repmgr repmgr 10.32.243.148/32 trust host replication repmgr 10.32.243.148/32 trust EOF for SERVER in cz01-psql01 cz01-psql02 cz01-pgpool-ha cz01-pgpool01 cz01-pgpool02; do echo "$SERVER.example.com:5432:postgres:admin:password123" >> ~/.pgpass echo "$SERVER.example.com:5432:repmgr:repmgr:repmgr_password" >> ~/.pgpass done chmod 0600 ~/.pgpass cp ~/.pgpass /var/lib/pgsql/ #Configure repmgr mkdir /var/lib/pgsql/repmgr cat > /var/lib/pgsql/repmgr/repmgr.conf << EOF cluster=pgsql_cluster node=1 node_name=cz01-psql01 conninfo='host=cz01-psql01.example.com user=repmgr dbname=repmgr' pg_bindir=/usr/pgsql-9.3/bin/ master_response_timeout=5 reconnect_attempts=2 reconnect_interval=2 failover=manual promote_command='/usr/pgsql-9.3/bin/repmgr standby promote -f /var/lib/pgsql/repmgr/repmgr.conf' follow_command='/usr/pgsql-9.3/bin/repmgr standby follow -f /var/lib/pgsql/repmgr/repmgr.conf' EOF cp -r /root/.ssh /var/lib/pgsql/ chown -R postgres:postgres /var/lib/pgsql/.ssh /var/lib/pgsql/.pgpass /var/lib/pgsql/repmgr echo "PATH=/usr/pgsql-9.3/bin:$PATH" >> /var/lib/pgsql/.bash_profile service postgresql-9.3 start #Add users sudo -u postgres psql -c "CREATE ROLE admin SUPERUSER CREATEDB CREATEROLE INHERIT REPLICATION LOGIN ENCRYPTED PASSWORD 'password123';" sudo -u postgres psql -c "CREATE USER repmgr SUPERUSER LOGIN ENCRYPTED PASSWORD 'repmgr_password';" sudo -u postgres psql -c "CREATE DATABASE repmgr OWNER repmgr;" #Register DB instance as master su - postgres -c "repmgr -f /var/lib/pgsql/repmgr/repmgr.conf --verbose master register" #Configure SSL Layer for PostgreSQL # shellcheck disable=SC2016 sed -i.orig \ -e 's@\$dir/cacert.pem@\$dir/example.com-ca.crt @' \ -e 's@\$dir/crl.pem@\$dir/example.com-ca.crl @' \ -e 's@\$dir/private/cakey.pem@\$dir/private/example.com-ca.key @' \ -e 's/^\(crlnumber\)/#\1/' \ -e 's/= XX/= CZ/' \ -e 's/^#\(stateOrProvinceName_default.*\) Default Province/\1 Czech Republic/' \ -e 's/= Default City/= Brno/' \ -e 's/= Default Company Ltd/= Example, Inc\./' \ -e 's/= policy_match/= policy_anything/' \ -e 's/^#\(unique_subject\)/\1/' /etc/pki/tls/openssl.cnf touch /etc/pki/CA/index.txt echo 01 > /etc/pki/CA/serial cd /etc/pki/CA || exit # Private key for CA ( umask 077 openssl genrsa -passout pass:password123 -out private/example.com-ca.key 1024 openssl pkey -text -passout pass:password123 -in private/example.com-ca.key > private/example.com-ca.key.info ) SUBJ=" C=CZ ST=Czech Republic O=Example, Inc. localityName=Brno commonName=example.com Certificate Authority " openssl req -passin pass:password123 -subj "$(echo -n "$SUBJ" | tr "\n" "/")" -new -x509 -key private/example.com-ca.key -days 3650 -out example.com-ca.crt openssl x509 -noout -text -in example.com-ca.crt > example.com-ca.crt.info # cz01-psql01 Certificate openssl genrsa -passout pass:password123 -des3 -out cz01-psql01.example.com_priv_encrypted.key 2048 openssl rsa -passin pass:password123 -in cz01-psql01.example.com_priv_encrypted.key -out cz01-psql01.example.com_priv.key SUBJ=" C=CZ ST=Czech Republic O=Example OU=Deployment L=Brno CN=cz01-psql01.example.com emailAddress=root@example.com " openssl req -passin pass:password123 -new -subj "$(echo -n "$SUBJ" | tr "\n" "/")" -days 3650 -key cz01-psql01.example.com_priv_encrypted.key -out cz01-psql01.example.com.csr openssl ca -passin pass:password123 -batch -in cz01-psql01.example.com.csr -out cz01-psql01.example.com.crt openssl x509 -noout -text -in cz01-psql01.example.com.crt > cz01-psql01.example.com.crt.info cp /etc/pki/CA/cz01-psql01.example.com.crt /var/lib/pgsql/9.3/server.crt cp /etc/pki/CA/cz01-psql01.example.com_priv.key /var/lib/pgsql/9.3/server.key chown postgres:postgres /var/lib/pgsql/9.3/server.* chmod 0600 /var/lib/pgsql/9.3/server.key sed -i \ -e "s/#ssl = off/ssl = on/" \ -e "s@#ssl_cert_file = 'server.crt'@ssl_cert_file = '../server.crt'@" \ -e "s@#ssl_key_file = 'server.key'@ssl_key_file = '../server.key'@" \ /var/lib/pgsql/9.3/data/postgresql.conf service postgresql-9.3 restart # Quick Test export PGSSLMODE=require psql --host cz01-psql01.example.com --username=fuzeme --dbname=fuzers -w -l
Slave PostgreSQL database installation -
cz01-psql02:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
#PostgreSQL installation yum localinstall -y http://yum.postgresql.org/9.3/redhat/rhel-6-x86_64/pgdg-redhat93-9.3-1.noarch.rpm yum install -y postgresql93-server repmgr yum install -y --enablerepo=centos-base postgresql93-contrib chkconfig postgresql-9.3 on echo "PATH=/usr/pgsql-9.3/bin:$PATH" >> /var/lib/pgsql/.bash_profile scp -r cz01-psql01.example.com:/root/{.pgpass,.ssh} /root/ cp -r /root/{.pgpass,.ssh} /var/lib/pgsql/ chown -R postgres:postgres /var/lib/pgsql/.pgpass /var/lib/pgsql/.ssh #Check the connection to primary node su - postgres -c "psql --username=repmgr --dbname=repmgr --host cz01-psql01.example.com -w -l" #Replicate the DB from the master mode su - postgres -c "repmgr -D /var/lib/pgsql/9.3/data -d repmgr -p 5432 -U repmgr -R postgres --verbose standby clone cz01-psql01.example.com" #Configure the repmgr mkdir /var/lib/pgsql/repmgr cat > /var/lib/pgsql/repmgr/repmgr.conf << EOF cluster=pgsql_cluster node=2 node_name=cz01-psql02 conninfo='host=cz01-psql02.example.com user=repmgr dbname=repmgr' pg_bindir=/usr/pgsql-9.3/bin/ master_response_timeout=5 reconnect_attempts=2 reconnect_interval=2 failover=manual promote_command='/usr/pgsql-9.3/bin/repmgr standby promote -f /var/lib/pgsql/repmgr/repmgr.conf' follow_command='/usr/pgsql-9.3/bin/repmgr standby follow -f /var/lib/pgsql/repmgr/repmgr.conf' EOF chown -R postgres:postgres /var/lib/pgsql/repmgr # cz01-psql02 Certificate cd /etc/pki/CA || exit openssl genrsa -passout pass:password123 -des3 -out cz01-psql02.example.com_priv_encrypted.key 2048 openssl rsa -passin pass:password123 -in cz01-psql02.example.com_priv_encrypted.key -out cz01-psql02.example.com_priv.key SUBJ=" C=CZ ST=Czech Republic O=Example OU=Deployment L=Brno CN=cz01-psql02.example.com emailAddress=root@example.com " openssl req -passin pass:password123 -new -subj "$(echo -n "$SUBJ" | tr "\n" "/")" -days 3650 -key cz01-psql02.example.com_priv_encrypted.key -out cz01-psql02.example.com.csr scp /etc/pki/CA/cz01-psql02.example.com.csr root@cz01-psql01.example.com:/etc/pki/CA/ ssh root@cz01-psql01.example.com << EOF cd /etc/pki/CA openssl ca -passin pass:password123 -batch -in cz01-psql02.example.com.csr -out cz01-psql02.example.com.crt openssl x509 -noout -text -in cz01-psql02.example.com.crt > cz01-psql02.example.com.crt.info EOF scp root@cz01-psql01.example.com:/etc/pki/CA/cz01-psql02.example.com.crt /etc/pki/CA/ cp /etc/pki/CA/cz01-psql02.example.com.crt /var/lib/pgsql/9.3/server.crt cp /etc/pki/CA/cz01-psql02.example.com_priv.key /var/lib/pgsql/9.3/server.key chown postgres:postgres /var/lib/pgsql/9.3/server.* chmod 0600 /var/lib/pgsql/9.3/server.key service postgresql-9.3 start #Register the DB instance as slave su - postgres -c "repmgr -f /var/lib/pgsql/repmgr/repmgr.conf --verbose standby register"
pgpool server installation (common for primary/secondary node) -
cz01-pgpool0{1,2}:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
#pgpool installation yum localinstall -y http://yum.postgresql.org/9.3/redhat/rhel-6-x86_64/pgdg-redhat93-9.3-1.noarch.rpm yum install -y pgpool-II-93 postgresql93 scp -r cz01-psql01.example.com:/root/{.ssh,.pgpass} /root/ scp cz01-psql01.example.com:/root/.pgpass /root/ cp /etc/pgpool-II-93/pcp.conf.sample /etc/pgpool-II-93/pcp.conf echo "admin:$(pg_md5 password123)" >> /etc/pgpool-II-93/pcp.conf sed \ -e "s/^listen_addresses = .localhost./listen_addresses = '*'/" \ -e "s/^log_destination = .stderr./log_destination = 'syslog'/" \ -e "s/^port = .*/port = 5432/" \ -e "s/^backend_hostname0 =.*/backend_hostname0 = 'cz01-psql01.example.com'/" \ -e "s/^#backend_flag0/backend_flag0/" \ -e "s/^#backend_hostname1 =.*/backend_hostname1 = 'cz01-psql02.example.com'/" \ -e "s/^#backend_port1 = 5433/backend_port1 = 5432/" \ -e "s/^#backend_weight1/backend_weight1/" \ -e "s/^#backend_data_directory1 =.*/backend_data_directory1 = '\/var\/lib\/pgsql\/9.3\/data'/" \ -e "s/^#backend_flag1/backend_flag1/" \ -e "s/^log_hostname =.*/log_hostname = on/" \ -e "s/^syslog_facility =.*/syslog_facility = 'daemon.info'/" \ -e "s/^sr_check_user =.*/sr_check_user = 'admin'/" \ -e "s/^sr_check_password =.*/sr_check_password = 'password123'/" \ -e "s/^health_check_period =.*/health_check_period = 10/" \ -e "s/^health_check_user =.*/health_check_user = 'admin'/" \ -e "s/^health_check_password =.*/health_check_password = 'password123'/" \ -e "s/^use_watchdog =.*/use_watchdog = on/" \ -e "s/^delegate_IP =.*/delegate_IP = '10.32.243.250'/" \ -e "s/^netmask 255.255.255.0/netmask 255.255.255.128/" \ -e "s/^heartbeat_device0 =.*/heartbeat_device0 = 'eth0'/" \ -e "s/^#other_pgpool_port0 =.*/other_pgpool_port0 = 5432/" \ -e "s/^#other_wd_port0 = 9000/other_wd_port0 = 9000/" \ -e "s/^load_balance_mode = off/load_balance_mode = on/" \ -e "s/^master_slave_mode = off/master_slave_mode = on/" \ -e "s/^master_slave_sub_mode =.*/master_slave_sub_mode = 'stream'/" \ -e "s@^failover_command = ''@failover_command = '/etc/pgpool-II-93/failover_stream.sh %d %H'@" \ -e "s/^recovery_user = 'nobody'/recovery_user = 'admin'/" \ -e "s/^recovery_password = ''/recovery_password = 'password123'/" \ -e "s/^recovery_1st_stage_command = ''/recovery_1st_stage_command = 'basebackup.sh'/" \ -e "s/^sr_check_period = 0/sr_check_period = 10/" \ -e "s/^delay_threshold = 0/delay_threshold = 10000000/" \ -e "s/^log_connections = off/log_connections = on/" \ -e "s/^log_statement = off/log_statement = on/" \ -e "s/^log_per_node_statement = off/log_per_node_statement = on/" \ -e "s/^log_standby_delay = 'none'/log_standby_delay = 'always'/" \ -e "s/^enable_pool_hba = off/enable_pool_hba = on/" \ /etc/pgpool-II-93/pgpool.conf.sample > /etc/pgpool-II-93/pgpool.conf cat > /etc/pgpool-II-93/failover_stream.sh << \EOF #!/bin/sh # Failover command for streaming replication. # # Arguments: $1: failed node id. $2: new master hostname. failed_node=$1 new_master=$2 ( date echo "Failed node: $failed_node" set -x # Promote standby/slave to be a new master (old master failed) /usr/bin/ssh -T -l postgres $new_master "/usr/pgsql-9.3/bin/repmgr -f /var/lib/pgsql/repmgr/repmgr.conf standby promote 2>/dev/null 1>/dev/null <&-" exit 0; ) 2>&1 | tee -a /tmp/failover_stream.sh.log EOF chmod 755 /etc/pgpool-II-93/failover_stream.sh cp /etc/pgpool-II-93/pool_hba.conf.sample /etc/pgpool-II-93/pool_hba.conf echo "host all all 0.0.0.0/0 md5" >> /etc/pgpool-II-93/pool_hba.conf mkdir -p /var/lib/pgsql/9.3/data groupadd -g 26 -o -r postgres useradd -M -n -g postgres -o -r -d /var/lib/pgsql -s /bin/bash -c "PostgreSQL Server" -u 26 postgres cp -R /root/.ssh /var/lib/pgsql/ sed -i '/^User /d' /var/lib/pgsql/.ssh/config pg_md5 -m -u admin password123 chown -R postgres:postgres /var/lib/pgsql /etc/pgpool-II-93/pool_passwd chmod 6755 /sbin/ifconfig chmod 6755 /sbin/arping chkconfig pgpool-II-93 on
Primary pgpool server installation -
cz01-pgpool01:1 2 3 4 5 6 7
sed \ -e "s/^wd_hostname =.*/wd_hostname = 'cz01-pgpool01.example.com'/" \ -e "s/^heartbeat_destination0 =.*/heartbeat_destination0 = 'cz01-pgpool02.example.com'/" \ -e "s/^#other_pgpool_hostname0 =.*/other_pgpool_hostname0 = 'cz01-pgpool02.example.com'/" \ -i /etc/pgpool-II-93/pgpool.conf service pgpool-II-93 start
Secondary pgpool server installation -
cz01-pgpool02:1 2 3 4 5 6 7
sed \ -e "s/^wd_hostname =.*/wd_hostname = 'cz01-pgpool02.example.com'/" \ -e "s/^heartbeat_destination0 =.*/heartbeat_destination0 = 'cz01-pgpool01.example.com'/" \ -e "s/^#other_pgpool_hostname0 =.*/other_pgpool_hostname0 = 'cz01-pgpool01.example.com'/" \ -i /etc/pgpool-II-93/pgpool.conf service pgpool-II-93 start
Now the all 4 server should be configured according the picture mentioned above. To be sure everything is working properly I decided to do various tests by stopping/starting the databases to see how all components are ready for outages. In the next pare there will be a lot of outputs of logs+commands which can be handy for troubleshooting in the future and which will test the proper configuration.
All the files modified and used for the configuration above can be found in the postgresql_pgpool_repmgr repository.
Testing
Check the cluster status
1
2
3
4
5
6
7
8
9
10
11
12
13
14
cz01-pgpool02 ~ # ssh postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show"
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
[2014-10-23 14:39:40] [INFO] repmgr connecting to database
Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf
Role | Connection String
* master | host=cz01-psql01.example.com user=repmgr dbname=repmgr
standby | host=cz01-psql02.example.com user=repmgr dbname=repmgr
cz01-pgpool02 ~ # pcp_node_count 1 localhost 9898 admin password123
2
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0
cz01-psql01.example.com 5432 1 0.500000
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1
cz01-psql02.example.com 5432 1 0.500000
Check if the replication is working
1
2
3
4
5
6
7
8
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "create database mydb"
CREATE DATABASE
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb
mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql01.example.com -w -l | grep mydb
mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql02.example.com -w -l | grep mydb
mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
Check what is the primary pgpool (it has 2 IPs)
1
2
3
4
5
6
7
8
cz01-pgpool02 ~ # ssh -q cz01-pgpool01 "ip a s"
1: lo: mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:50:56:91:0a:86 brd ff:ff:ff:ff:ff:ff
inet 10.32.243.157/25 brd 10.32.243.255 scope global eth0
inet 10.32.243.250/24 brd 10.32.243.255 scope global eth0:0
Stop the Master DB
1
2
3
4
5
cz01-pgpool02 ~ # date && ssh root@cz01-psql01.example.com "service postgresql-9.3 stop"
Thu Oct 23 14:43:09 CEST 2014
Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts.
Stopping postgresql-9.3 service: [ OK ]
The pgpool is monitoring both master and slave databases if they are responding. If one of them is not responding the pgpool executes the failover_stream.sh file. This script is responsible for promoting the slave to be a new master. The result is that the read-only slave will become read/write master. In the diagram below I used the red colour to see the changes which were done when slave was promoted to master.
pgpool01 logs right after the master was stopped
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
cz01-pgpool01 ~ # cat /var/log/local0
...
2014-10-23T14:43:11.547651+02:00 cz01-pgpool01 pgpool[23301]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
2014-10-23T14:43:11.547678+02:00 cz01-pgpool01 pgpool[23301]: make_persistent_db_connection: connection to cz01-psql01.example.com(5432) failed
2014-10-23T14:43:11.562642+02:00 cz01-pgpool01 pgpool[23301]: check_replication_time_lag: could not connect to DB node 0, check sr_check_user and sr_check_password
2014-10-23T14:43:12.327080+02:00 cz01-pgpool01 pgpool[23257]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
2014-10-23T14:43:12.327127+02:00 cz01-pgpool01 pgpool[23257]: make_persistent_db_connection: connection to cz01-psql01.example.com(5432) failed
2014-10-23T14:43:12.332564+02:00 cz01-pgpool01 pgpool[23257]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
2014-10-23T14:43:12.332661+02:00 cz01-pgpool01 pgpool[23257]: make_persistent_db_connection: connection to cz01-psql01.example.com(5432) failed
2014-10-23T14:43:12.332740+02:00 cz01-pgpool01 pgpool[23257]: health check failed. 0 th host cz01-psql01.example.com at port 5432 is down
2014-10-23T14:43:12.332822+02:00 cz01-pgpool01 pgpool[23257]: set 0 th backend down status
2014-10-23T14:43:12.332920+02:00 cz01-pgpool01 pgpool[23257]: wd_start_interlock: start interlocking
2014-10-23T14:43:12.348543+02:00 cz01-pgpool01 pgpool[23257]: wd_assume_lock_holder: become a new lock holder
2014-10-23T14:43:12.372682+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received but lock holder exists already
2014-10-23T14:43:13.369470+02:00 cz01-pgpool01 pgpool[23257]: starting degeneration. shutdown host cz01-psql01.example.com(5432)
2014-10-23T14:43:13.369521+02:00 cz01-pgpool01 pgpool[23257]: Restart all children
2014-10-23T14:43:13.369544+02:00 cz01-pgpool01 pgpool[23257]: execute command: /etc/pgpool-II-93/failover_stream.sh 0 cz01-psql02.example.com
2014-10-23T14:43:15.916452+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node_repeatedly: waiting for finding a primary node
2014-10-23T14:43:15.933033+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node: primary node id is 1
2014-10-23T14:43:15.937969+02:00 cz01-pgpool01 pgpool[23257]: wd_end_interlock: end interlocking
2014-10-23T14:43:16.963481+02:00 cz01-pgpool01 pgpool[23257]: failover: set new primary node: 1
2014-10-23T14:43:16.963534+02:00 cz01-pgpool01 pgpool[23257]: failover: set new master node: 1
2014-10-23T14:43:17.051441+02:00 cz01-pgpool01 pgpool[23301]: worker process received restart request
2014-10-23T14:43:17.055720+02:00 cz01-pgpool01 pgpool[23257]: failover done. shutdown host cz01-psql01.example.com(5432)
2014-10-23T14:43:18.059487+02:00 cz01-pgpool01 pgpool[23300]: pcp child process received restart request
2014-10-23T14:43:18.064463+02:00 cz01-pgpool01 pgpool[23257]: PCP child 23300 exits with status 256 in failover()
2014-10-23T14:43:18.064493+02:00 cz01-pgpool01 pgpool[23257]: fork a new PCP child pid 26164 in failover()
2014-10-23T14:43:18.064499+02:00 cz01-pgpool01 pgpool[23257]: worker child 23301 exits with status 256
2014-10-23T14:43:18.065907+02:00 cz01-pgpool01 pgpool[23257]: fork a new worker child pid 26165
psql01 (masted db) logs right after the master was stopped
1
2
3
4
5
6
7
8
9
10
11
12
13
14
cz01-psql01 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log
...
2014-10-23 14:43:10 CEST [18254]: [6-1] user=,db= LOG: received fast shutdown request
2014-10-23 14:43:10 CEST [18254]: [7-1] user=,db= LOG: aborting any active transactions
2014-10-23 14:43:10 CEST [24392]: [13-1] user=admin,db=postgres FATAL: terminating connection due to administrator command
2014-10-23 14:43:10 CEST [24392]: [14-1] user=admin,db=postgres LOG: disconnection: session time: 0:03:01.366 user=admin database=postgres host=10.32.243.157 port=53814
2014-10-23 14:43:10 CEST [24386]: [13-1] user=admin,db=postgres FATAL: terminating connection due to administrator command
2014-10-23 14:43:10 CEST [24386]: [14-1] user=admin,db=postgres LOG: disconnection: session time: 0:03:08.732 user=admin database=postgres host=10.32.243.157 port=53812
2014-10-23 14:43:10 CEST [18266]: [2-1] user=,db= LOG: autovacuum launcher shutting down
2014-10-23 14:43:10 CEST [18263]: [27-1] user=,db= LOG: shutting down
2014-10-23 14:43:10 CEST [18263]: [28-1] user=,db= LOG: checkpoint starting: shutdown immediate
2014-10-23 14:43:10 CEST [18263]: [29-1] user=,db= LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.002 s, total=0.479 s; sync files=1, longest=0.002 s, average=0.002 s
2014-10-23 14:43:10 CEST [18263]: [30-1] user=,db= LOG: database system is shut down
2014-10-23 14:43:11 CEST [18438]: [3-1] user=repmgr,db=[unknown] LOG: disconnection: session time: 0:48:37.268 user=repmgr database= host=10.32.243.148 port=50909
psql02 (slave db) logs right after the master was stopped
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
cz01-psql02 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log
...
2014-10-23 14:43:11 CEST [18031]: [2-1] user=,db= LOG: replication terminated by primary server
2014-10-23 14:43:11 CEST [18031]: [3-1] user=,db= DETAIL: End of WAL reached on timeline 1 at 0/61000090.
2014-10-23 14:43:11 CEST [18031]: [4-1] user=,db= FATAL: could not send end-of-streaming message to primary: no COPY in progress
2014-10-23 14:43:11 CEST [18030]: [5-1] user=,db= LOG: record with zero length at 0/61000090
2014-10-23 14:43:11 CEST [24120]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.157 port=52991
2014-10-23 14:43:11 CEST [24120]: [2-1] user=admin,db=postgres LOG: connection authorized: user=admin database=postgres
2014-10-23 14:43:11 CEST [24120]: [3-1] user=admin,db=postgres LOG: disconnection: session time: 0:00:00.010 user=admin database=postgres host=10.32.243.157 port=52991
2014-10-23 14:43:13 CEST [24121]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.158 port=53841
2014-10-23 14:43:13 CEST [24121]: [2-1] user=admin,db=postgres LOG: connection authorized: user=admin database=postgres
2014-10-23 14:43:13 CEST [24121]: [3-1] user=admin,db=postgres LOG: disconnection: session time: 0:00:00.009 user=admin database=postgres host=10.32.243.158 port=53841
2014-10-23 14:43:13 CEST [23435]: [9-1] user=admin,db=postgres LOG: disconnection: session time: 0:03:11.742 user=admin database=postgres host=10.32.243.157 port=52915
2014-10-23 14:43:13 CEST [23438]: [5-1] user=admin,db=postgres LOG: disconnection: session time: 0:03:04.381 user=admin database=postgres host=10.32.243.157 port=52917
2014-10-23 14:43:13 CEST [24129]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.148 port=54332
2014-10-23 14:43:13 CEST [24129]: [2-1] user=repmgr,db=repmgr LOG: connection authorized: user=repmgr database=repmgr
2014-10-23 14:43:13 CEST [24129]: [3-1] user=repmgr,db=repmgr LOG: statement: WITH pg_version(ver) AS (SELECT split_part(version(), ' ', 2)) SELECT split_part(ver, '.', 1), split_part(ver, '.', 2) FROM pg_version
2014-10-23 14:43:13 CEST [24129]: [4-1] user=repmgr,db=repmgr LOG: duration: 2.850 ms
2014-10-23 14:43:13 CEST [24129]: [5-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_is_in_recovery()
2014-10-23 14:43:13 CEST [24129]: [6-1] user=repmgr,db=repmgr LOG: duration: 0.339 ms
2014-10-23 14:43:13 CEST [24129]: [7-1] user=repmgr,db=repmgr LOG: statement: SELECT id, conninfo FROM "repmgr_pgsql_cluster".repl_nodes WHERE cluster = 'pgsql_cluster' and not witness
2014-10-23 14:43:13 CEST [24129]: [8-1] user=repmgr,db=repmgr LOG: duration: 2.634 ms
2014-10-23 14:43:13 CEST [24130]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.148 port=54334
2014-10-23 14:43:13 CEST [24130]: [2-1] user=repmgr,db=repmgr LOG: connection authorized: user=repmgr database=repmgr
2014-10-23 14:43:13 CEST [24130]: [3-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_is_in_recovery()
2014-10-23 14:43:13 CEST [24130]: [4-1] user=repmgr,db=repmgr LOG: duration: 1.347 ms
2014-10-23 14:43:13 CEST [24129]: [9-1] user=repmgr,db=repmgr LOG: statement: SELECT setting FROM pg_settings WHERE name = 'data_directory'
2014-10-23 14:43:13 CEST [24130]: [5-1] user=repmgr,db=repmgr LOG: disconnection: session time: 0:00:00.024 user=repmgr database=repmgr host=10.32.243.148 port=54334
2014-10-23 14:43:13 CEST [24129]: [10-1] user=repmgr,db=repmgr LOG: duration: 4.954 ms
2014-10-23 14:43:13 CEST [24129]: [11-1] user=repmgr,db=repmgr LOG: disconnection: session time: 0:00:00.067 user=repmgr database=repmgr host=10.32.243.148 port=54332
2014-10-23 14:43:13 CEST [18021]: [6-1] user=,db= LOG: received fast shutdown request
2014-10-23 14:43:13 CEST [18021]: [7-1] user=,db= LOG: aborting any active transactions
2014-10-23 14:43:13 CEST [18032]: [28-1] user=,db= LOG: shutting down
2014-10-23 14:43:13 CEST [18032]: [29-1] user=,db= LOG: restartpoint starting: shutdown immediate
2014-10-23 14:43:13 CEST [18032]: [30-1] user=,db= LOG: restartpoint complete: wrote 6 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.002 s, total=0.008 s; sync files=6, longest=0.000 s, average=0.000 s
2014-10-23 14:43:13 CEST [18032]: [31-1] user=,db= LOG: recovery restart point at 0/61000028
2014-10-23 14:43:13 CEST [18032]: [32-1] user=,db= DETAIL: last completed transaction was at log time 2014-10-23 14:40:08.829957+02
2014-10-23 14:43:13 CEST [18032]: [33-1] user=,db= LOG: database system is shut down
2014-10-23 14:43:14 CEST [24141]: [1-1] user=,db= LOG: database system was shut down in recovery at 2014-10-23 14:43:13 CEST
2014-10-23 14:43:14 CEST [24141]: [2-1] user=,db= LOG: database system was not properly shut down; automatic recovery in progress
2014-10-23 14:43:14 CEST [24141]: [3-1] user=,db= LOG: consistent recovery state reached at 0/61000090
2014-10-23 14:43:14 CEST [24141]: [4-1] user=,db= LOG: record with zero length at 0/61000090
2014-10-23 14:43:14 CEST [24141]: [5-1] user=,db= LOG: redo is not required
2014-10-23 14:43:14 CEST [24141]: [6-1] user=,db= LOG: checkpoint starting: end-of-recovery immediate
2014-10-23 14:43:14 CEST [24141]: [7-1] user=,db= LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.000 s, total=0.005 s; sync files=0, longest=0.000 s, average=0.000 s
2014-10-23 14:43:14 CEST [24145]: [1-1] user=,db= LOG: autovacuum launcher started
2014-10-23 14:43:14 CEST [24133]: [5-1] user=,db= LOG: database system is ready to accept connections
2014-10-23 14:43:15 CEST [24148]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=[local]
2014-10-23 14:43:15 CEST [24148]: [2-1] user=postgres,db=postgres LOG: connection authorized: user=postgres database=postgres
2014-10-23 14:43:15 CEST [24148]: [3-1] user=postgres,db=postgres LOG: disconnection: session time: 0:00:00.019 user=postgres database=postgres host=[local]
2014-10-23 14:43:15 CEST [24149]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.148 port=54335
2014-10-23 14:43:15 CEST [24149]: [2-1] user=repmgr,db=repmgr LOG: connection authorized: user=repmgr database=repmgr
2014-10-23 14:43:15 CEST [24149]: [3-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_is_in_recovery()
2014-10-23 14:43:15 CEST [24149]: [4-1] user=repmgr,db=repmgr LOG: duration: 2.252 ms
2014-10-23 14:43:15 CEST [24149]: [5-1] user=repmgr,db=repmgr LOG: disconnection: session time: 0:00:00.030 user=repmgr database=repmgr host=10.32.243.148 port=54335
failover_stream.sh output log from primary pgpool
1
2
3
4
5
6
7
cz01-pgpool01 / # cat /tmp/failover_stream.sh.log
Thu Oct 23 14:43:13 CEST 2014
Failed node: 0
+ /usr/bin/ssh -T -l postgres cz01-psql02.example.com '/usr/pgsql-9.3/bin/repmgr -f /var/lib/pgsql/repmgr/repmgr.conf standby promote 2>/dev/null 1>/dev/null
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
+ exit 0
Check the cluster status after master failure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cz01-pgpool02 ~ # ssh postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show"
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
[2014-10-23 14:49:07] [INFO] repmgr connecting to database
[2014-10-23 14:49:07] [ERROR] Connection to database failed: could not connect to server: Connection refused
Is the server running on host "cz01-psql01.example.com" (10.32.243.147) and accepting
TCP/IP connections on port 5432?
Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf
Role | Connection String
FAILED | host=cz01-psql01.example.com user=repmgr dbname=repmgr
* master | host=cz01-psql02.example.com user=repmgr dbname=repmgr
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0
cz01-psql01.example.com 5432 3 0.500000
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1
cz01-psql02.example.com 5432 1 0.500000
Check if everything is working
1
2
3
4
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "drop database mydb"
DROP DATABASE
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql02.example.com -w -l | grep mydb
Make the “old master” to be a “new slave” cz01-psql01
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
cz01-pgpool02 ~ # ssh cz01-psql01.example.com 'service postgresql-9.3 stop; su - postgres -c "repmgr -D /var/lib/pgsql/9.3/data -d repmgr -p 5432 -U repmgr -R postgres --verbose --force standby clone cz01-psql02.example.com"; service postgresql-9.3 start;'
Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts.
Stopping postgresql-9.3 service: [ OK ]
[2014-10-23 14:52:08] [ERROR] Did not find the configuration file './repmgr.conf', continuing
[2014-10-23 14:52:08] [NOTICE] repmgr Destination directory /var/lib/pgsql/9.3/data provided, try to clone everything in it.
[2014-10-23 14:52:08] [INFO] repmgr connecting to master database
[2014-10-23 14:52:08] [INFO] repmgr connected to master, checking its state
[2014-10-23 14:52:09] [INFO] Successfully connected to primary. Current installation size is 188 MB
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
[2014-10-23 14:52:09] [NOTICE] Starting backup...
[2014-10-23 14:52:09] [WARNING] directory "/var/lib/pgsql/9.3/data" exists but is not empty
[2014-10-23 14:52:09] [INFO] standby clone: master control file '/var/lib/pgsql/9.3/data/global/pg_control'
[2014-10-23 14:52:09] [INFO] standby clone: master control file '/var/lib/pgsql/9.3/data/global/pg_control'
[2014-10-23 14:52:09] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/global/pg_control /var/lib/pgsql/9.3/data/global'
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
receiving incremental file list
pg_control
8192 100% 7.81MB/s 0:00:00 (xfer#1, to-check=0/1)
sent 102 bytes received 236 bytes 676.00 bytes/sec
total size is 8192 speedup is 24.24
[2014-10-23 14:52:09] [INFO] standby clone: master data directory '/var/lib/pgsql/9.3/data'
[2014-10-23 14:52:09] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete --exclude=pg_xlog* --exclude=pg_log* --exclude=pg_control --exclude=*.pid postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/* /var/lib/pgsql/9.3/data'
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
receiving incremental file list
deleting base/16528/pg_filenode.map
deleting base/16528/PG_VERSION
deleting base/16528/12890
deleting base/16528/12888
...
...
...
pg_stat_tmp/db_16413.stat
6089 100% 849.47kB/s 0:00:00 (xfer#16, to-check=2/1542)
pg_stat_tmp/global.stat
1026 100% 125.24kB/s 0:00:00 (xfer#17, to-check=1/1542)
sent 3810 bytes received 94096 bytes 39162.40 bytes/sec
total size is 198017139 speedup is 2022.52
[2014-10-23 14:52:11] [INFO] standby clone: master config file '/var/lib/pgsql/9.3/data/postgresql.conf'
[2014-10-23 14:52:11] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/postgresql.conf /var/lib/pgsql/9.3/data'
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
receiving incremental file list
sent 11 bytes received 80 bytes 60.67 bytes/sec
total size is 20561 speedup is 225.95
[2014-10-23 14:52:12] [INFO] standby clone: master hba file '/var/lib/pgsql/9.3/data/pg_hba.conf'
[2014-10-23 14:52:12] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/pg_hba.conf /var/lib/pgsql/9.3/data'
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
receiving incremental file list
sent 11 bytes received 76 bytes 174.00 bytes/sec
total size is 4812 speedup is 55.31
[2014-10-23 14:52:12] [INFO] standby clone: master ident file '/var/lib/pgsql/9.3/data/pg_ident.conf'
[2014-10-23 14:52:12] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql02.example.com:/var/lib/pgsql/9.3/data/pg_ident.conf /var/lib/pgsql/9.3/data'
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
receiving incremental file list
sent 11 bytes received 78 bytes 178.00 bytes/sec
total size is 1636 speedup is 18.38
[2014-10-23 14:52:12] [NOTICE] Finishing backup...
NOTICE: pg_stop_backup complete, all required WAL segments have been archived
[2014-10-23 14:52:13] [INFO] repmgr requires primary to keep WAL files 000000010000000000000062 until at least 000000010000000000000062
[2014-10-23 14:52:13] [NOTICE] repmgr standby clone complete
[2014-10-23 14:52:13] [NOTICE] HINT: You can now start your postgresql server
[2014-10-23 14:52:13] [NOTICE] for example : pg_ctl -D /var/lib/pgsql/9.3/data start
Opening configuration file: ./repmgr.conf
Starting postgresql-9.3 service: [ OK ]
Once the slave has been promoted to master, the DB administrator should check exactly what happened to the failed master database. Once the problem has been analyzed, the administrator should not simply restart the “failed” database but instead reconfigure it as a slave. There is currently a master DB; therefore, every other database needs to be a slave. The command above sets up a new slave with data replication from the running master. Pgpool also needs to be notified that the “new” slave is up and running and ready for read-only queries (the commands will follow). The diagram below shows the current status, and the red color indicates the additional changes.
Logs right after the new slave cz01-psql01 was configured+started
1
2
3
4
5
6
7
8
cz01-psql01 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log
...
2014-10-23 14:52:14 CEST [26707]: [1-1] user=,db= LOG: database system was interrupted; last known up at 2014-10-23 14:52:09 CEST
2014-10-23 14:52:14 CEST [26707]: [2-1] user=,db= LOG: entering standby mode
2014-10-23 14:52:14 CEST [26708]: [1-1] user=,db= LOG: started streaming WAL from primary at 0/62000000 on timeline 1
2014-10-23 14:52:14 CEST [26707]: [3-1] user=,db= LOG: redo starts at 0/62000028
2014-10-23 14:52:14 CEST [26707]: [4-1] user=,db= LOG: consistent recovery state reached at 0/620000F0
2014-10-23 14:52:14 CEST [26698]: [5-1] user=,db= LOG: database system is ready to accept read only connections
Logs from cz01-psql02 after the new slave was configured
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
cz01-psql02 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log
...
2014-10-23 14:52:08 CEST [25481]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.147 port=52477
2014-10-23 14:52:08 CEST [25481]: [2-1] user=repmgr,db=repmgr LOG: connection authorized: user=repmgr database=repmgr
2014-10-23 14:52:08 CEST [25481]: [3-1] user=repmgr,db=repmgr LOG: statement: WITH pg_version(ver) AS (SELECT split_part(version(), ' ', 2)) SELECT split_part(ver, '.', 1), split_part(ver, '.', 2) FROM pg_version
2014-10-23 14:52:08 CEST [25481]: [4-1] user=repmgr,db=repmgr LOG: duration: 2.790 ms
2014-10-23 14:52:08 CEST [25481]: [5-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_is_in_recovery()
2014-10-23 14:52:08 CEST [25481]: [6-1] user=repmgr,db=repmgr LOG: duration: 0.451 ms
2014-10-23 14:52:08 CEST [25481]: [7-1] user=repmgr,db=repmgr LOG: statement: SELECT true FROM pg_settings WHERE name = 'wal_level' AND setting = 'hot_standby'
2014-10-23 14:52:09 CEST [25481]: [8-1] user=repmgr,db=repmgr LOG: duration: 7.175 ms
2014-10-23 14:52:09 CEST [25481]: [9-1] user=repmgr,db=repmgr LOG: statement: SELECT true FROM pg_settings WHERE name = 'wal_keep_segments' AND setting::integer >= '5000'::integer
2014-10-23 14:52:09 CEST [25481]: [10-1] user=repmgr,db=repmgr LOG: duration: 3.971 ms
2014-10-23 14:52:09 CEST [25481]: [11-1] user=repmgr,db=repmgr LOG: statement: SELECT true FROM pg_settings WHERE name = 'archive_mode' AND setting = 'on'
2014-10-23 14:52:09 CEST [25481]: [12-1] user=repmgr,db=repmgr LOG: duration: 3.191 ms
2014-10-23 14:52:09 CEST [25481]: [13-1] user=repmgr,db=repmgr LOG: statement: SELECT true FROM pg_settings WHERE name = 'hot_standby' AND setting = 'on'
2014-10-23 14:52:09 CEST [25481]: [14-1] user=repmgr,db=repmgr LOG: duration: 3.147 ms
2014-10-23 14:52:09 CEST [25481]: [15-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_tablespace_location(oid) spclocation FROM pg_tablespace WHERE spcname NOT IN ('pg_default', 'pg_global')
2014-10-23 14:52:09 CEST [25481]: [16-1] user=repmgr,db=repmgr LOG: duration: 2.541 ms
2014-10-23 14:52:09 CEST [25481]: [17-1] user=repmgr,db=repmgr LOG: statement: SELECT name, setting FROM pg_settings WHERE name IN ('data_directory', 'config_file', 'hba_file', 'ident_file', 'stats_temp_directory')
2014-10-23 14:52:09 CEST [25481]: [18-1] user=repmgr,db=repmgr LOG: duration: 3.842 ms
2014-10-23 14:52:09 CEST [25481]: [19-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_size_pretty(SUM(pg_database_size(oid))::bigint) FROM pg_database
2014-10-23 14:52:09 CEST [25481]: [20-1] user=repmgr,db=repmgr LOG: duration: 16.913 ms
2014-10-23 14:52:09 CEST [25481]: [21-1] user=repmgr,db=repmgr LOG: statement: SET synchronous_commit TO OFF
2014-10-23 14:52:09 CEST [25481]: [22-1] user=repmgr,db=repmgr LOG: duration: 0.282 ms
2014-10-23 14:52:09 CEST [25481]: [23-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_xlogfile_name(pg_start_backup('repmgr_standby_clone_1414068729'))
2014-10-23 14:52:09 CEST [24142]: [3-1] user=,db= LOG: checkpoint starting: force wait
2014-10-23 14:52:09 CEST [25489]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.158 port=53963
2014-10-23 14:52:09 CEST [25489]: [2-1] user=admin,db=postgres LOG: connection authorized: user=admin database=postgres
2014-10-23 14:52:09 CEST [25489]: [3-1] user=admin,db=postgres LOG: disconnection: session time: 0:00:00.008 user=admin database=postgres host=10.32.243.158 port=53963
2014-10-23 14:52:09 CEST [25490]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.157 port=53120
2014-10-23 14:52:09 CEST [25490]: [2-1] user=admin,db=template1 LOG: connection authorized: user=admin database=template1
2014-10-23 14:52:09 CEST [25490]: [3-1] user=admin,db=template1 LOG: disconnection: session time: 0:00:00.008 user=admin database=template1 host=10.32.243.157 port=53120
2014-10-23 14:52:09 CEST [24142]: [4-1] user=,db= LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.000 s, total=0.391 s; sync files=1, longest=0.000 s, average=0.000 s
2014-10-23 14:52:09 CEST [25481]: [24-1] user=repmgr,db=repmgr LOG: duration: 403.777 ms
2014-10-23 14:52:11 CEST [25481]: [25-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_tablespace_location(oid) spclocation FROM pg_tablespace WHERE spcname NOT IN ('pg_default', 'pg_global')
2014-10-23 14:52:11 CEST [25481]: [26-1] user=repmgr,db=repmgr LOG: duration: 0.514 ms
2014-10-23 14:52:12 CEST [25481]: [27-1] user=repmgr,db=repmgr LOG: statement: SELECT pg_xlogfile_name(pg_stop_backup())
2014-10-23 14:52:13 CEST [25481]: [28-1] user=repmgr,db=repmgr LOG: duration: 1005.802 ms
2014-10-23 14:52:13 CEST [25481]: [29-1] user=repmgr,db=repmgr LOG: disconnection: session time: 0:00:04.406 user=repmgr database=repmgr host=10.32.243.147 port=52477
2014-10-23 14:52:14 CEST [25524]: [1-1] user=[unknown],db=[unknown] LOG: connection received: host=10.32.243.147 port=52484
2014-10-23 14:52:14 CEST [25524]: [2-1] user=repmgr,db=[unknown] LOG: replication connection authorized: user=repmgr
No pgpool changes after the new slave cz01-psql01 was configured+started
1
cz01-pgpool01 / # tail -f /var/log/local0
Check the cluster status after failover
1
2
3
4
5
6
7
8
9
10
11
12
cz01-pgpool02 ~ # ssh postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show"
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
[2014-10-23 14:58:23] [INFO] repmgr connecting to database
Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf
Role | Connection String
standby | host=cz01-psql01.example.com user=repmgr dbname=repmgr
* master | host=cz01-psql02.example.com user=repmgr dbname=repmgr
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0
cz01-psql01.example.com 5432 3 0.500000
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1
cz01-psql02.example.com 5432 1 0.500000
Check if everything is working after reinitialization
1
2
3
4
5
6
7
8
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "create database mydb"
CREATE DATABASE
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb
mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql01.example.com -w -l | grep mydb
mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql02.example.com -w -l | grep mydb
mydb | admin | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
Reinitialize the slave in pgpool to be ready for read only queries
1
2
cz01-pgpool02 ~ # pcp_detach_node 0 localhost 9898 admin password123 0
cz01-pgpool02 ~ # pcp_attach_node 0 localhost 9898 admin password123 0
Logfile right after the slave was enabled for RO queries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cz01-pgpool01 / # cat /var/log/local0
...
2014-10-23T15:55:38.440946+02:00 cz01-pgpool01 pgpool[23264]: send_failback_request: fail back 0 th node request from pid 23264
2014-10-23T15:55:38.442034+02:00 cz01-pgpool01 pgpool[23257]: wd_start_interlock: start interlocking
2014-10-23T15:55:38.455732+02:00 cz01-pgpool01 pgpool[23257]: wd_assume_lock_holder: become a new lock holder
2014-10-23T15:55:38.459908+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received but lock holder exists already
2014-10-23T15:55:38.963952+02:00 cz01-pgpool01 pgpool[23257]: starting fail back. reconnect host cz01-psql01.example.com(5432)
2014-10-23T15:55:38.970549+02:00 cz01-pgpool01 pgpool[23257]: Do not restart children because we are failbacking node id 0 hostcz01-psql01.example.com port:5432 and we are in streaming replication mode
2014-10-23T15:55:38.974760+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node_repeatedly: waiting for finding a primary node
2014-10-23T15:55:39.018890+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node: primary node id is 1
2014-10-23T15:55:39.024830+02:00 cz01-pgpool01 pgpool[23257]: wd_end_interlock: end interlocking
2014-10-23T15:55:40.048487+02:00 cz01-pgpool01 pgpool[23257]: failover: set new primary node: 1
2014-10-23T15:55:40.048514+02:00 cz01-pgpool01 pgpool[23257]: failover: set new master node: 0
2014-10-23T15:55:40.048520+02:00 cz01-pgpool01 pgpool[23257]: failback done. reconnect host cz01-psql01.example.com(5432)
2014-10-23T15:55:40.050908+02:00 cz01-pgpool01 pgpool[26165]: worker process received restart request
2014-10-23T15:55:41.051525+02:00 cz01-pgpool01 pgpool[26164]: pcp child process received restart request
2014-10-23T15:55:41.056496+02:00 cz01-pgpool01 pgpool[23257]: PCP child 26164 exits with status 256 in failover()
2014-10-23T15:55:41.056543+02:00 cz01-pgpool01 pgpool[23257]: fork a new PCP child pid 1392 in failover()
2014-10-23T15:55:41.056565+02:00 cz01-pgpool01 pgpool[23257]: worker child 26165 exits with status 256
2014-10-23T15:55:41.057839+02:00 cz01-pgpool01 pgpool[23257]: fork a new worker child pid 1393
Check pgpool status the slave should have a good value now
1
2
3
4
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0
cz01-psql01.example.com 5432 1 0.500000
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1
cz01-psql02.example.com 5432 1 0.500000
Check if everything is working fine
1
2
3
4
5
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "drop database mydb"
DROP DATABASE
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql01.example.com -w -l | grep mydb
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql02.example.com -w -l | grep mydb
Stop the master DB (original slave) - new master should be promoted
1
2
3
4
cz01-pgpool02 ~ # ssh cz01-psql02.example.com 'service postgresql-9.3 stop'
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
Stopping postgresql-9.3 service: [ OK ]
Here is another example what may happen if original slave server, promoted to new master fails. Again the slave is automatically promoted by pgpool to the new master and “original master” (later slave) is master again. Changes are using blue color in the diagram below.
Logs right after the master (original slave) was stopped
1
2
3
4
5
6
7
8
9
10
11
12
13
14
cz01-psql02 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log
...
2014-10-23 16:01:30 CEST [24133]: [6-1] user=,db= LOG: received fast shutdown request
2014-10-23 16:01:30 CEST [24133]: [7-1] user=,db= LOG: aborting any active transactions
2014-10-23 16:01:30 CEST [1991]: [7-1] user=admin,db=postgres FATAL: terminating connection due to administrator command
2014-10-23 16:01:30 CEST [1991]: [8-1] user=admin,db=postgres LOG: disconnection: session time: 0:02:13.805 user=admin database=postgres host=10.32.243.157 port=53988
2014-10-23 16:01:30 CEST [1986]: [7-1] user=admin,db=postgres FATAL: terminating connection due to administrator command
2014-10-23 16:01:30 CEST [1986]: [8-1] user=admin,db=postgres LOG: disconnection: session time: 0:02:19.271 user=admin database=postgres host=10.32.243.157 port=53982
2014-10-23 16:01:30 CEST [24145]: [2-1] user=,db= LOG: autovacuum launcher shutting down
2014-10-23 16:01:30 CEST [24142]: [37-1] user=,db= LOG: shutting down
2014-10-23 16:01:30 CEST [24142]: [38-1] user=,db= LOG: checkpoint starting: shutdown immediate
2014-10-23 16:01:30 CEST [24142]: [39-1] user=,db= LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.008 s, total=0.449 s; sync files=1, longest=0.008 s, average=0.008 s
2014-10-23 16:01:30 CEST [24142]: [40-1] user=,db= LOG: database system is shut down
2014-10-23 16:01:31 CEST [25524]: [3-1] user=repmgr,db=[unknown] LOG: disconnection: session time: 1:09:17.350 user=repmgr database= host=10.32.243.147 port=52484
Logs from pgpool01 after the master was stopped
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
cz01-pgpool01 / # cat /var/log/local0
...
2014-10-23T16:01:33.199817+02:00 cz01-pgpool01 pgpool[23257]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
2014-10-23T16:01:33.200803+02:00 cz01-pgpool01 pgpool[23257]: make_persistent_db_connection: connection to cz01-psql02.example.com(5432) failed
2014-10-23T16:01:33.200989+02:00 cz01-pgpool01 pgpool[23257]: health check failed. 1 th host cz01-psql02.example.com at port 5432 is down
2014-10-23T16:01:33.201148+02:00 cz01-pgpool01 pgpool[23257]: set 1 th backend down status
2014-10-23T16:01:33.201280+02:00 cz01-pgpool01 pgpool[23257]: wd_start_interlock: start interlocking
2014-10-23T16:01:33.207742+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received it
2014-10-23T16:01:33.213943+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: failover request from other pgpool is canceled because it's while switching
2014-10-23T16:01:33.229107+02:00 cz01-pgpool01 pgpool[1393]: connect_inet_domain_socket: getsockopt() detected error: Connection refused
2014-10-23T16:01:33.229133+02:00 cz01-pgpool01 pgpool[1393]: make_persistent_db_connection: connection to cz01-psql02.example.com(5432) failed
2014-10-23T16:01:33.233566+02:00 cz01-pgpool01 pgpool[1393]: check_replication_time_lag: could not connect to DB node 1, check sr_check_user and sr_check_password
2014-10-23T16:01:33.716568+02:00 cz01-pgpool01 pgpool[23257]: starting degeneration. shutdown host cz01-psql02.example.com(5432)
2014-10-23T16:01:33.716597+02:00 cz01-pgpool01 pgpool[23257]: Restart all children
2014-10-23T16:01:36.720365+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node_repeatedly: waiting for finding a primary node
2014-10-23T16:01:36.735887+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node: primary node id is 0
2014-10-23T16:01:36.736038+02:00 cz01-pgpool01 pgpool[23257]: wd_end_interlock: end interlocking
2014-10-23T16:01:37.243499+02:00 cz01-pgpool01 pgpool[23257]: failover: set new primary node: 0
2014-10-23T16:01:37.243536+02:00 cz01-pgpool01 pgpool[23257]: failover: set new master node: 0
2014-10-23T16:01:37.332464+02:00 cz01-pgpool01 pgpool[1393]: worker process received restart request
2014-10-23T16:01:37.335636+02:00 cz01-pgpool01 pgpool[23257]: failover done. shutdown host cz01-psql02.example.com(5432)
2014-10-23T16:01:38.338656+02:00 cz01-pgpool01 pgpool[1392]: pcp child process received restart request
2014-10-23T16:01:38.345478+02:00 cz01-pgpool01 pgpool[23257]: PCP child 1392 exits with status 256 in failover()
2014-10-23T16:01:38.345510+02:00 cz01-pgpool01 pgpool[23257]: fork a new PCP child pid 2517 in failover()
2014-10-23T16:01:38.345516+02:00 cz01-pgpool01 pgpool[23257]: worker child 1393 exits with status 256
2014-10-23T16:01:38.345899+02:00 cz01-pgpool01 pgpool[23257]: fork a new worker child pid 2518
Logs from psql01 after the master was stopped
1
2
3
4
5
6
cz01-psql01 / # cat /var/lib/pgsql/9.3/data/pg_log/postgresql-Thu.log
...
2014-10-23 16:01:31 CEST [26708]: [2-1] user=,db= LOG: replication terminated by primary server
2014-10-23 16:01:31 CEST [26708]: [3-1] user=,db= DETAIL: End of WAL reached on timeline 1 at 0/64000090.
2014-10-23 16:01:31 CEST [26708]: [4-1] user=,db= FATAL: could not send end-of-streaming message to primary: no COPY in progress
2014-10-23 16:01:31 CEST [26707]: [5-1] user=,db= LOG: record with zero length at 0/64000090
Check status
1
2
3
4
5
6
7
8
9
cz01-pgpool02 ~ # ssh -q postgres@cz01-psql01.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show"
[2014-10-23 16:05:16] [INFO] repmgr connecting to database
Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf
Role | Connection String
* master | host=cz01-psql01.example.com user=repmgr dbname=repmgr
FAILED | host=cz01-psql02.example.com user=repmgr dbname=repmgr
[2014-10-23 16:05:16] [ERROR] Connection to database failed: could not connect to server: Connection refused
Is the server running on host "cz01-psql02.example.com" (10.32.243.148) and accepting
TCP/IP connections on port 5432?
Check the functionality
1
2
3
4
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -c "drop database mydb"
DROP DATABASE
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-pgpool-ha.example.com -w -l | grep mydb
cz01-pgpool02 ~ # psql --username=admin --dbname=postgres --host cz01-psql01.example.com -w -l | grep mydb
Reinitialize the stopped master to be slave again
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
cz01-pgpool02 ~ # ssh cz01-psql02.example.com 'service postgresql-9.3 stop; su - postgres -c "repmgr -D /var/lib/pgsql/9.3/data -d repmgr -p 5432 -U repmgr -R postgres --verbose --force standby clone cz01-psql01.example.com"; service postgresql-9.3 start;'
Warning: Permanently added 'cz01-psql02.example.com,10.32.243.148' (RSA) to the list of known hosts.
Stopping postgresql-9.3 service: [ OK ]
[2014-10-23 16:07:11] [ERROR] Did not find the configuration file './repmgr.conf', continuing
[2014-10-23 16:07:11] [NOTICE] repmgr Destination directory /var/lib/pgsql/9.3/data provided, try to clone everything in it.
[2014-10-23 16:07:11] [INFO] repmgr connecting to master database
[2014-10-23 16:07:11] [INFO] repmgr connected to master, checking its state
[2014-10-23 16:07:11] [INFO] Successfully connected to primary. Current installation size is 188 MB
Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts.
[2014-10-23 16:07:11] [NOTICE] Starting backup...
[2014-10-23 16:07:11] [WARNING] directory "/var/lib/pgsql/9.3/data" exists but is not empty
[2014-10-23 16:07:11] [INFO] standby clone: master control file '/var/lib/pgsql/9.3/data/global/pg_control'
[2014-10-23 16:07:11] [INFO] standby clone: master control file '/var/lib/pgsql/9.3/data/global/pg_control'
[2014-10-23 16:07:11] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/global/pg_control /var/lib/pgsql/9.3/data/global'
Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts.
receiving incremental file list
pg_control
8192 100% 7.81MB/s 0:00:00 (xfer#1, to-check=0/1)
sent 102 bytes received 235 bytes 674.00 bytes/sec
total size is 8192 speedup is 24.31
[2014-10-23 16:07:12] [INFO] standby clone: master data directory '/var/lib/pgsql/9.3/data'
[2014-10-23 16:07:12] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete --exclude=pg_xlog* --exclude=pg_log* --exclude=pg_control --exclude=*.pid postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/* /var/lib/pgsql/9.3/data'
Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts.
receiving incremental file list
backup_label
222 100% 216.80kB/s 0:00:00 (xfer#1, to-check=1239/1241)
backup_label.old
222 100% 216.80kB/s 0:00:00 (xfer#2, to-check=1238/1241)
recovery.done
102 100% 99.61kB/s 0:00:00 (xfer#3, to-check=1232/1241)
base/
base/1/
base/1/pg_internal.init
116404 100% 155.93kB/s 0:00:00 (xfer#4, to-check=1225/1490)
base/12896/
base/12896/pg_internal.init
116404 100% 153.41kB/s 0:00:00 (xfer#5, to-check=804/1542)
base/16386/
base/16386/pg_internal.init
116404 100% 151.97kB/s 0:00:00 (xfer#6, to-check=559/1542)
base/16413/
base/16413/pg_internal.init
116404 100% 150.56kB/s 0:00:00 (xfer#7, to-check=301/1542)
deleting pg_stat/global.stat
deleting pg_stat/db_16413.stat
deleting pg_stat/db_16386.stat
deleting pg_stat/db_12896.stat
deleting pg_stat/db_1.stat
deleting pg_stat/db_0.stat
global/
global/12789
8192 100% 7.81MB/s 0:00:00 (xfer#8, to-check=30/1542)
global/12791
16384 100% 15.62MB/s 0:00:00 (xfer#9, to-check=27/1542)
global/12792
16384 100% 15.62MB/s 0:00:00 (xfer#10, to-check=26/1542)
global/12892
8192 100% 7.81MB/s 0:00:00 (xfer#11, to-check=17/1542)
global/12894
16384 100% 15.62MB/s 0:00:00 (xfer#12, to-check=16/1542)
global/12895
16384 100% 5.21MB/s 0:00:00 (xfer#13, to-check=15/1542)
global/pg_internal.init
12784 100% 2.44MB/s 0:00:00 (xfer#14, to-check=13/1542)
pg_clog/0000
8192 100% 1.56MB/s 0:00:00 (xfer#15, to-check=12/1542)
pg_notify/
pg_stat/
pg_stat_tmp/
pg_stat_tmp/db_0.stat
2540 100% 310.06kB/s 0:00:00 (xfer#16, to-check=6/1542)
pg_stat_tmp/db_1.stat
1864 100% 227.54kB/s 0:00:00 (xfer#17, to-check=5/1542)
pg_stat_tmp/db_12896.stat
3047 100% 371.95kB/s 0:00:00 (xfer#18, to-check=4/1542)
pg_stat_tmp/db_16386.stat
4230 100% 458.98kB/s 0:00:00 (xfer#19, to-check=3/1542)
pg_stat_tmp/db_16413.stat
6089 100% 660.70kB/s 0:00:00 (xfer#20, to-check=2/1542)
pg_stat_tmp/global.stat
1026 100% 100.20kB/s 0:00:00 (xfer#21, to-check=1/1542)
sent 5452 bytes received 93077 bytes 65686.00 bytes/sec
total size is 198017139 speedup is 2009.73
[2014-10-23 16:07:13] [INFO] standby clone: master config file '/var/lib/pgsql/9.3/data/postgresql.conf'
[2014-10-23 16:07:13] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/postgresql.conf /var/lib/pgsql/9.3/data'
Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts.
receiving incremental file list
sent 11 bytes received 80 bytes 60.67 bytes/sec
total size is 20561 speedup is 225.95
[2014-10-23 16:07:14] [INFO] standby clone: master hba file '/var/lib/pgsql/9.3/data/pg_hba.conf'
[2014-10-23 16:07:14] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/pg_hba.conf /var/lib/pgsql/9.3/data'
Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts.
receiving incremental file list
sent 11 bytes received 76 bytes 174.00 bytes/sec
total size is 4812 speedup is 55.31
[2014-10-23 16:07:14] [INFO] standby clone: master ident file '/var/lib/pgsql/9.3/data/pg_ident.conf'
[2014-10-23 16:07:14] [INFO] rsync command line: 'rsync --archive --checksum --compress --progress --rsh=ssh --delete postgres@cz01-psql01.example.com:/var/lib/pgsql/9.3/data/pg_ident.conf /var/lib/pgsql/9.3/data'
Warning: Permanently added 'cz01-psql01.example.com,10.32.243.147' (RSA) to the list of known hosts.
receiving incremental file list
sent 11 bytes received 78 bytes 178.00 bytes/sec
total size is 1636 speedup is 18.38
[2014-10-23 16:07:14] [NOTICE] Finishing backup...
NOTICE: pg_stop_backup complete, all required WAL segments have been archived
[2014-10-23 16:07:15] [INFO] repmgr requires primary to keep WAL files 000000010000000000000065 until at least 000000010000000000000065
[2014-10-23 16:07:15] [NOTICE] repmgr standby clone complete
[2014-10-23 16:07:15] [NOTICE] HINT: You can now start your postgresql server
[2014-10-23 16:07:15] [NOTICE] for example : pg_ctl -D /var/lib/pgsql/9.3/data start
Opening configuration file: ./repmgr.conf
Starting postgresql-9.3 service: [ OK ]
Again - after the db administrator find out the cause why master went down he needs to initialize the failed master as a slave (by running command above). Then everything is like before the testing - original master/slave state. The green color is used for showing up the changes in the diagram.
Check cluster status after recovery
1
2
3
4
5
6
7
8
9
10
cz01-pgpool02 ~ # ssh -q postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show"
[2014-10-23 16:08:23] [INFO] repmgr connecting to database
Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf
Role | Connection String
* master | host=cz01-psql01.example.com user=repmgr dbname=repmgr
standby | host=cz01-psql02.example.com user=repmgr dbname=repmgr
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0
cz01-psql01.example.com 5432 1 0.500000
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1
cz01-psql02.example.com 5432 3 0.500000
Re-enable the slave to be able to receive read-only requests
1
2
cz01-pgpool02 ~ # pcp_detach_node 0 localhost 9898 admin password123 1
cz01-pgpool02 ~ # pcp_attach_node 0 localhost 9898 admin password123 1
Logs right after reenabling the slave again
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cz01-pgpool01 / # cat /var/log/local0
...
2014-10-23T16:09:06.675035+02:00 cz01-pgpool01 pgpool[23264]: send_failback_request: fail back 1 th node request from pid 23264
2014-10-23T16:09:06.675868+02:00 cz01-pgpool01 pgpool[23257]: wd_start_interlock: start interlocking
2014-10-23T16:09:06.691133+02:00 cz01-pgpool01 pgpool[23257]: wd_assume_lock_holder: become a new lock holder
2014-10-23T16:09:06.698706+02:00 cz01-pgpool01 pgpool[23264]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received but lock holder exists already
2014-10-23T16:09:07.200639+02:00 cz01-pgpool01 pgpool[23257]: starting fail back. reconnect host cz01-psql02.example.com(5432)
2014-10-23T16:09:07.205234+02:00 cz01-pgpool01 pgpool[23257]: Do not restart children because we are failbacking node id 1 hostcz01-psql02.example.com port:5432 and we are in streaming replication mode
2014-10-23T16:09:07.209510+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node_repeatedly: waiting for finding a primary node
2014-10-23T16:09:07.225224+02:00 cz01-pgpool01 pgpool[23257]: find_primary_node: primary node id is 0
2014-10-23T16:09:07.228223+02:00 cz01-pgpool01 pgpool[23257]: wd_end_interlock: end interlocking
2014-10-23T16:09:08.244875+02:00 cz01-pgpool01 pgpool[23257]: failover: set new primary node: 0
2014-10-23T16:09:08.244902+02:00 cz01-pgpool01 pgpool[23257]: failover: set new master node: 0
2014-10-23T16:09:08.244908+02:00 cz01-pgpool01 pgpool[23257]: failback done. reconnect host cz01-psql02.example.com(5432)
2014-10-23T16:09:08.247112+02:00 cz01-pgpool01 pgpool[2518]: worker process received restart request
2014-10-23T16:09:09.248473+02:00 cz01-pgpool01 pgpool[2517]: pcp child process received restart request
2014-10-23T16:09:09.252461+02:00 cz01-pgpool01 pgpool[23257]: PCP child 2517 exits with status 256 in failover()
2014-10-23T16:09:09.252492+02:00 cz01-pgpool01 pgpool[23257]: fork a new PCP child pid 3080 in failover()
2014-10-23T16:09:09.252499+02:00 cz01-pgpool01 pgpool[23257]: worker child 2518 exits with status 256
2014-10-23T16:09:09.253362+02:00 cz01-pgpool01 pgpool[23257]: fork a new worker child pid 3081
Final cluster status verification
1
2
3
4
5
6
7
8
9
10
cz01-pgpool02 ~ # ssh -q postgres@cz01-psql02.example.com "/usr/pgsql-9.3/bin/repmgr --verbose -f /var/lib/pgsql/repmgr/repmgr.conf cluster show"
[2014-10-23 16:10:33] [INFO] repmgr connecting to database
Opening configuration file: /var/lib/pgsql/repmgr/repmgr.conf
Role | Connection String
* master | host=cz01-psql01.example.com user=repmgr dbname=repmgr
standby | host=cz01-psql02.example.com user=repmgr dbname=repmgr
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 0
cz01-psql01.example.com 5432 1 0.500000
cz01-pgpool02 ~ # pcp_node_info 1 localhost 9898 admin password123 1
cz01-psql02.example.com 5432 1 0.500000
Huuh… It was a lot of copy & paste work. Anyway it’s here if somebody needs it ;-)