The concept of backup and restoration of data is a vital one to any database administrator. No system is immune from hard drive crashes, careless users, or any number of potential catastrophes that can endanger data stored within PostgreSQL.
This section covers two general methods for backing up your data. The first approach uses the pg_dump application distributed with PostgreSQL to create a set of SQL instructions with which a database can be fully restored. The second method consists of backing up the filesystem itself.
The pg_dump (which is short for "PostgreSQL dump") application is run from a command line, and creates a list of SQL commands. These commands, executed in the order provided, re-create the database from scratch.
Here is the syntax for pg_dump:
pg_dump [ options ] dbname
In this syntax, dbname is the name of the database that you want to "dump" SQL instructions for. The available options are similar in format to those of the other database management utilities included with PostgreSQL, such as createdb. The most common flag specified in the options to pg_dump is the -f flag, which specifies the file to store the dumped SQL statements within.
Note: If the -f flag is not specified to pg_dump, the dumped SQL will be written to stdout rather than stored in a file.
The complete list of pg_dump options follow:
Forces only COPY or INSERT SQL statements to be dumped (depending on whether or not the -d flag is used). This results in a backup of data, and not database objects (or schema). If the -d flag is not passed along with this flag, the dumped COPY commands are used to copy all data from stdin (i.e., the rows are stored literally within the dumped file as COPY commands from stdin). Otherwise, each row is represented as sequential INSERT statements.
Causes any large objects to be dumped as well as normal data. This option also requires that the -F flag be provided with either the t or c format. By default, large object data is not dumped.
Specifies that SQL statements to drop all existing objects will precede the SQL statements to create those objects. This option is useful in re-initializing an existing database, rather than dropping it and recreating it from scratch.
Specifies that the SQL statement to create the database (CREATE DATABASE) should be included in the dump.
Causes INSERT statements to be dumped for each row of data, rather than the default COPY statements. This can be safer, as a single corrupted row will cause a COPY statement to fail, though it is a much slower process to add a single row at a time during restoration.
Like the -d flag, causes INSERT statements to be dumped; however, with this flag, each INSERT statement is created with an explicit column target list in parentheses, immediately preceding the VALUES.
Directs the output of pg_dump to a file named FILENAME, rather than to stdout. The user executing pg_dump must have system permissions to write to this file.
Determines the file format of the output:
A format of c creates a gzip-compressed tar file (i.e., a .tar.gz file).
A value of t creates a tar file (i.e., a .tar file).
The default value of p causes plain text output.
Note that pg_restore is typically used to handle files created with the c or t (gzip-compressed or tar) formats.
Specifies that HOSTNAME should be connected to, rather than the localhost. Use this when the target database is on another server.
Overrides the check between the version of pg_dump and the version of PostgreSQL running. This option is not recommended unless there is no other alternative, as it most likely will produce errors due to changes in the system catalogs between versions. Normally, you should use the version of pg_dump matching the database that you are backing up.
Suppresses any double-quotes surrounding identifiers unless there are normally illegal characters in the identifier (e.g., spaces, or uppercase characters).
Explicitly indicates that double-quotes should be used around all identifiers. This has been the default behavior of pg_dump since PostgreSQL 6.4.
Causes OIDs (object identifiers) for each row of data to be dumped as well. This can be vital if any of your applications based around the data in PostgreSQL use OIDs in any kind of meaningful or associative way.
Causes ownership to not be taken into account in the dump. A restore with suppressed ownership will cause all re-created objects to belong to the user performing the restore.
Specifies that the database connection should be made on port PORT, rather than the default port (usually 5432, though it may have been configured differently when PostgreSQL was compiled, by the - -with-pgport flag).
Suppresses any \connect statements, which are usually used to enforce currently reflected ownerships when a backup is restored. This is similar in practice to the -O flag, but also precludes the ability to use the -C flag, as a reconnect is required after creation of a new database.
Causes only the schema-related (database objects such as tables, sequences, indices and views) SQL statements to be dumped, ignoring re-creation of the data. This can be useful in moving a general database structure from a development machine to a production machine.
Causes only TABLE to be dumped from the specified database, rather than all tables.
Provides a prompt for a username and password. As of PostgreSQL 7.1.x, this is the only method to provide an alternate username. If the user's password is unset (NULL), you may simply press enter when prompted for it.
Causes verbose output from the pg_dump functions being performed to be displayed to stderr (not stdout).
Suppresses any GRANT or REVOKE statements, which are usually used to preserve the rights set at the time of the dump. Use this flag if you do not wish to enforce any existing rights or restrictions when re-creating a database from this dump.
Sets the degree of compression (0 for the least compression, 9 for the most) when used with the -F c argument.
Any system user may run pg_dump by default, but the user with which you connect to PostgreSQL must have SELECT rights for every object in the database being dumped. Example 9-20 demonstrates the use of pg_dump on the booktown database, with the manager PostgreSQL user. The -C flag passed causes the CREATE DATABASE command to be included in the SQL dump as well. This command is not always included, as you may prefer to create the database beforehand with non-standard options.
Example 9-20. Using pg_dump
[jworsley@booktown ~]$ pg_dump -u -C -f booktown.sql booktown Username: manager Password: [jworsley@booktown ~]$ ls -l booktown.sql -rw-rw-r- - 1 jworsley jworsley 46542 Sep 13 16:42 booktown.sql
Note that since pg_dump provides the standard connection options for specifying a host connection (e.g., the -h, -u and -p flags), it can be used to perform remote backups from any host allowed to make a remote connection (according to the pg_hba.conf file: see Chapter 8 for more on this subject). Example 9-21 shows a user on a remote server specifying a connection to the booktown.commandprompt.com server, in order to create a backup file in compressed format (with the -F c flag) called booktown.sql.tar.gz.
Example 9-21. Using pg_dump remotely
[jworsley@cmd ~]$ pg_dump -u -h booktown.commandprompt.com \ > -F c -f booktown.sql.tar.gz booktown Username: manager Password: [jworsley@cmd ~]$ ls -l booktown.sql.tar.gz -rw-rw-r- - 1 jworsley jworsley 45909 Sep 13 17:12 booktown.sql.tar.gz
If you wish to use large objects in your dumped file, it is necessary that you use either the tar (t) or gzip-compressed (c) format, since the plain-text format cannot include large objects. Otherwise, normal plain-text formatted dumps are suitable for most users.
Creating a dump in tar format can often result in a backup file more than twice the size of its plain-text counterpart, even without large objects. This is because the tar format involves a hierarchy of TOC (table of contents) .dat files which represent information on how to be used by the corresponding pg_restore command; thus, more disk space is taken up by the extra instructions. Since tar is not a compression utility, the gzip-compressed (c) format exists as well to compress the tarred file into gzipped format automatically.
PostgreSQL supplies a supplementary wrapper command to the pg_dump application called pg_dumpall. The primary use of this application is to allow the entire cluster of PostgreSQL databases on a system to be dumped at once, rather than having to perform a pg_dump for each database, one at a time.
Here is the syntax for pg_dumpall, displayed from the - -help flag:
pg_dumpall [ -c ] [ -h HOSTNAME ] [ -p PORT ] [ -g ]
The pg_dumpall command accepts the same connection parameters available to pg_dump. The following are the pg_dumpall-specific parameters:
Specifies that SQL statements to drop existing global objects will precede the SQL statements to create those objects.
Specifies that HOSTNAME should be connected to, rather than the localhost, or the host defined by the PGHOST environment variable. Use this when the target database is on another server.
Specifies that the database connection should be made on port PORT, rather than the default port (usually 5432).
Specifies that only global objects will be dumped. This is primarily useful for recreating just users and groups, or for duplicating them on another machine (by taking the dump to another machine, and executing it). The -g flag implicitly causes all users to be deleted from the pg_shadow table prior to the CREATE statements. Exercise caution with the output from this command!
Do not pass the -? flag to pg_dumpall, as it will result in passing that flag to pg_dump for each database, which may result in quite a few more help screens than you would expect. If you wish to view pg_dumpall's help, use the - -help flag.
Note that as of PostgreSQL 7.1.x, the use of the pg_dumpall script does have some practical limitations over the normal use of pg_dump. For example, the -u flag cannot be used to provide a different username and password, and the -F flag may not be used to specify a format other than plain text (your backups will be stored as plain text, regardless of chosen format). This means that the -b flag cannot be used with pg_dumpall either, as it requires a format other than plain-text.
While the -f flag can be used to pass a filename through to pg_dump, doing so does not create a complete dump, as pg_dumpall's global data is still sent to stdout. To solve the problem of not having the -f flag available to pg_dumpall, the shell re-direction operator (>) can be used to redirect the output of pg_dumpall to a file.
A simple workaround to the lack of the -u flag is to set the PGUSER environment variable. You can also set the PGPASSWORD environment variable in this fashion if you are connecting to a system which requires password authentication, and you do not wish to provide a password for each database that it connects to and dumps.
Example 9-22 demonstrates a simple bash-shell shorthand for a temporary environment variable (PGUSER) when invoking any PostgreSQL client. While not usually necessary, it can be a handy trick to know for exceptions such as the pg_dumpall script. Example 9-22 uses this technique to create one dump file for all databases.
The first part of the shell statement demonstrated in Example 9-22 sets a temporary environment variable named PGUSER to the value of postgres. This variable is set for the duration of the pg_dumpall command, and expires when the command has finished.
Note: The pg_dumpall command generally requires that the user executing the script be a PostgreSQL superuser. This is because the pg_dumpall command requires access to the PostgreSQL system catalogs, as it dumps global objects as well as database objects.
You can also run the pg_dumpall command remotely, though be sure to set any environment variables as needed. These will depend greatly on the remote host configuration in the pg_hba.conf file.
You should use the pg_dumpall command if it is especially inconvenient to backup all your existing databases individually, or if you have any kind of complex user and group system in place. If you are inhibited by the limitations of pg_dumpall for data output (particularly if your database makes use of large objects), the simplest solution is to use pg_dumpall with the -g flag to keep a backup of all user and group data, and to subsequently use pg_dump for each database which needs to be backed up, individually.
Data from a SQL dump can be restored to a database in one of two ways. If the dump created by pg_dump as a simple, plain-text file, it may be passed through to psql directly as an input file. Alternatively, if another output format was chosen (e.g., the tar or compressed tar format), the pg_restore command must be used.
A database may either be restored from scratch, to an empty database, or to a non-existent database; how you go about restoring a database depends largely on how it was dumped (e.g., if only data were dumped, or if the commands to create the database were included).
A plain text output file from pg_dump may be passed through to psql as an input file. This executes sequentially each of the dumped SQL instructions. Depending on how the dump was created, and for what purpose, there are a variety of ways to invoke psql with the dump.
If your dump was created with the -C flag, the SQL statement to create the database is included in the dump file itself. This means that, most likely, the database was either dropped, or has not yet been created on the system on which it is being restored. If the database already exists, it may need to be dropped, but only do this if you are sure the dump is up to date.
On the other hand, if the -C flag was not used, you may need to first create the database before connecting to it and restoring its attributes and data. Remember also that you need to specify the usual connection parameters to psql in order to operate as a user with the rights to create a database.
Example 9-23 demonstrates recreating the booktown database with the booktown.sql file created in Example 9-20, in the Section called Using pg_dump" earlier in this chapter. Since the -C flag was used in this example, there is no need to create the database first; it can be created by bootstrapping through the template1 database.
Example 9-23. Recreating the booktown database
[jworsley@booktown ~]$ psql -U manager -f booktown.sql template1 CREATE DATABASE You are now connected to database booktown as user postgres. COMMENT CREATE CREATE CHANGE [...]
As each dumped command in the booktown.sql file is processed by PostgreSQL, the resulting server messages (e.g., CREATE, CHANGE) will be displayed to stderr.
Note: Since psql can be used remotely, this same technique may be used across a network, provided the correct connection parameters are used from an authorized host.
For files created by pg_dump with a file format other than plain text, the pg_restore command exists to seamlessly restore the dumped database from the tar, or compressed tar file.
Here is the syntax for the pg_restore command:
pg_restore [ options ] [ file ]
In this syntax, if file is not specified, pg_restore will wait for data from stdin. This means that you may effectively use the < shell redirection character with the same results. Notice especially in the options the -d flag. If left unspecified, pg_restore will simply display the database restoration statements to stdout (and thus, to the screen) rather than actually restoring a database.
If you are using the -C flag to create a database from scratch, you must still supply the -d flag (with the name of an existing database to connect to first, e.g., template1) from which to initialize a connection and create the new database. In such a case, it is not important to which database you initially connect, as it is only a temporary connection until the new database is created.
Many of the options for pg_restore directly mirror those available in pg_dump. In some cases, the same option must be supplied in both pg_dump and pg_restore in order for the desired functionality to be achieved. For example, this is the case with the -C flag. If used with pg_dump, but not pg_restore, the CREATE DATABASE command will be ignored by pg_restore, even though the command is present in the dump file.
The following are more detailed explanations of each option:
Causes any reference to creation of database schema objects to be ignored, restoring only data records (those with COPY or INSERT statements).
Causes any DROP SQL statements to be executed before creating database objects. Without the -c flag, these statements are ignored, even if they are present in the dump file.
Causes the CREATE DATABASE SQL statement (if found in the dump file) to be executed. Without the -C flag, the statement is ignored.
Specifies the database called NAME to be connected to for the restoration. If the -C flag is used to create a new database, the -d flag should be pointed to template1. If this parameter is not specified, the commands to restore the database will instead be displayed to stdout rather than sent to PostgreSQL.
Indicates that FILENAME is the target for the database restoration SQL commands, rather than a the postmaster backend (with the -d flag), or stdout (the default).
Specifies the format of the input file, FILENAME. A value of c indicates that it is a compressed and tarred dump, while t indicates that it is only tarred. Note that this option is usually not necessary, as pg_restore can judge what kind of file it is dealing with through its header data.
Specifies that you should connect to HOSTNAME, rather than the localhost.
Specifies that only indices be recreated. Note that due to a bug, the -i flag may not work, though the - -index flag should (as of PostgreSQL 7.1.x).
Specifies that only the table of contents of database objects should be output in a comma-delimited, PostgreSQL specific TOC (table of contents) format. This output can be redirected to a file (either with shell redirection via the < character, or with the -f flag) and later used with the -L flag to control what database objects are restored.
Indicates use of the PostgreSQL TOC file FILENAME to determine which objects should be restored by pg_restore. This file is generated using the -l flag. After generating the file, delete lines for objects you do now wish to restore, or preface those lines with a semicolon (;). When -L is used, pg_restore only restores those objects listed in the specified TOC file.
Causes the restore to occur in the same order that the pg_dump originally dumped the objects in (through the use of the extra TOC information in a tar or gzip-compressed format). This is not the same as the literal order in which the statements are placed in the dump file itself, which is the default restore order. This option excludes the use of the -o or -r options.
If, during a restoration, database objects are created in an incorrect order (e.g., an object which relies on another existing object is created before the object it relies on), you can re-initialize a database and try this flag to override the order which was originally chosen by pg_dump.
Causes the restore to occur strictly in the order of OIDs, ascending; this option excludes the use of the -N or -r options.
Forces pg_restore to ignore any \connect statements which would be used to enforce ownership.
Specifies that the database connection should be made on port PORT, rather than the default port (usually 5432, though it may have been configured differently when PostgreSQL was compiled, by the - -with-pgport flag).
Specifies that only functions are to be recreated. Like the -i flag, due to a bug, the -P flag may not work, though the - -function flag should, as of PostgreSQL 7.1.x.
Causes the restore to occur in the order chosen by pg_dump at the time of the dump file's creation. Most of the objects are created in OID order with this option, though statements creating rules and indices are placed towards the end of the file. This option is the default.
Forces pg_restore to ignore all \connect statements (not just those intended to enforce ownership). This cannot be used with the -C flag, which requires at least one reconnection after the creation of a new database.
Causes only the creation of database schema objects, such as tables, views, sequences, and indices. No rows will be copied or inserted into the tables, and sequences will initialize to their default values. This can be used, for example, to create an empty production database that matches the structure of a development database.
Specifies the superuser with username NAME to be used in disabling triggers (if necessary, to recreate a trigger), as well as to set ownership of schema elements.
Causes only the table NAME to be restored, rather than all database objects. Specifying just - -table causes only tables to be restored.
Causes only the trigger NAME to be restored, rather than all database objects. Specifying just - -trigger causes only triggers to be restored.
Causes pg_restore to provide a prompt for a username and password.
Causes verbose output of each action as it is performed. This output is sent to stderr, rather than stdout.
Suppresses any SQL GRANT or REVOKE statement in the dump being restored.
Example 9-24 demonstrates a restoration of the booktown database on a separate machine from which the original was made. It uses the booktown.sql.tar file created in Example 9-21, in the Section called Using pg_dump," earlier in this chapter, as the source for the restoration.
Example 9-24. Restore with pg_restore
[jworsley@cmd ~]$ pg_restore -v -C -O -d template1 booktown.sql.tar Connecting to database for restore Creating DATABASE booktown Connecting to new DB 'booktown' as postgres Connecting to booktown as postgres Creating COMMENT DATABASE "booktown" Creating TABLE inventory [...]
You can see upon examining the pg_restore command in Example 9-24 that it uses the -v flag for verbose output as it operates, the -C flag to create the database (as this is a new database on this machine), and the -O flag to ignore ownership from the original database (as the users on another machine are not guaranteed to exist locally). Notice also the -d flag is used to connect to the template1 database before creating, and connecting to, the booktown database.
Note that the use of the -O flag can be dangerous if ownership is an important part of the recreation of a database. It can play a helpful role in moving from a development environment to a production environment (e.g., if test or development account names were associated with various database objects). However, if a database is being restored on an existing machine (e.g., from a nightly backup), it is not recommended that the -O flag be used.
Remember that pg_restore exists only for files that are output in either tar format (t), or compressed tar format (c). Plain text SQL dumps may be processed with psql, as documented in the Section called Using psql for plain text dumps," earlier in this chapter.
An important consideration to the use of the pg_dump, pg_dumpall, and pg_restore commands is when to use them, and when not to. Fortunately, in respect to each of these procedures, PostgreSQL is quite accommodating.
With regards to backing up data with either pg_dump or pg_dumpall, there are few considerations necessary for when they may be performed. PostgreSQL has supported hot backup procedures since Version 6.5—these allow you to request data without blocking the normal activity of other concurrent users. It is called a hot backup because it is performed while the system is running, uninterrupted.
Therefore, the only potential considerations for backup PostgreSQL center around performance. An exceptionally large database may take a while to dump all of its contents. The use of large objects may also be a factor if you intend to back up large object data with pg_dump's -b flag (thus, adding to the amount of data needing to be output).
If you have a large database that takes a substantial amount of time to complete a dump, it is recommended that you schedule the pg_dump execution for a time when the database is not heavily used. Even though a dump does not block users from requesting and completing transactions, it can still slow down the general performance of such a system during heavy, or even medium usage.
With respect to restoration there are several more considerations to be taken into account than when merely backing up data. Specifically, these apply to how "deep" a restoration must go; restoring just the data is a very different operation from totally recreating the database from scratch, and the restrictions involved scale with the depth of the operation.
The least restrictive kind of restoration is one which restores data only. This can be executed while users are actively connected to the database. It may even be executed while connected users are in the middle of transactions. This is possible through PostgreSQL's multiversion control. Such a restore can be performed on the fly, without having to restart the database system. Once modifications are synchronized with the database, the changes are immediately available to connected users.
A restoration involving dropping and recreating database schema (e.g., tables, views, etc.) may also be performed while the system is running. This method is not as seamless as a data-only restoration, because database objects will briefly be removed from the system, which may cause temporary problems to applications relying on certain objects to exist. The exact nature of such a restriction is dependent on the nature of the application accessing the database.
The most restrictive kind of restoration is one which involves dropping the database itself. If such a restoration is scheduled, it must be done at a time when no other user is connected to that database. The DROP DATABASE command will fail if any user is actively connected at the time it is executed.
It may in fact be necessary to shut down and restart PostgreSQL with TCP/IP connections disabled if a highly-used database is intended to be dropped and recreated from scratch; this will prevent any external machine from connecting to the database server until the work is completed.
While PostgreSQL abstracts the literal data files from its users, all of the data within PostgreSQL databases can be found in normal system files. These files are constantly in a state of flux when the database is running, as not all changes made in PostgreSQL are immediately written to disk. These files are stored within the PostgreSQL directory (e.g., /usr/local/pgsql/data, or whatever path is specified by the PGDATA environment variable).
Rather than creating a set of SQL instructions to re-create a database, you may instead opt to stop the PostgreSQL server (in order to ensure that all changes are synchronized with the hard disk) and create a backup of that area of the filesystem. Typically this is done with the tar utility, and optionally compressed with a compression utility such as gzip.
Example 9-25 demonstrates the backing up of a filesystem in which PostgreSQL keeps its data, as performed by the user who owns the data files (which is the same user that runs PostgreSQL's backend). In this case, the path is /usr/local/pgsql/data, and the system user is postgres.
Example 9-25. Backing up the PostgreSQL filesystem
[postgres@booktown ~]$ cd /usr/local/pgsql [postgres@booktown pgsql]$ pg_ctl stop Smart Shutdown request at Fri Sep 14 14:54:15 2001 DEBUG: shutting down waiting for postmaster to shut down......DEBUG: database system is shut down done postmaster successfully shut down [postgres@booktown pgsql]$ tar czf pgsql.bak.tar.gz data/ [postgres@booktown pgsql]$ ls -l *.tar.gz -rw-rw-r- - 1 postgres postgres 66124795 Sep 14 14:36 pgsql.bak.tar.gz
Notice that the pg_ctl command is called before the tar process, to stop the PostgreSQL backend (if installed, the SysV script may be invoked with the service command to the same end). As stated, this is to ensure that the most recent changes to the database have been synchronized with the hard disk, as well as to verify that none of the data files are modified while backed up.
The primary advantage to backing up your data in this fashion is that you have a literal backup of PostgreSQL's data files. In order to restore a crashed database from this kind of file, it needs to be decompressed in the appropriate location, and for the backend to be re-started. There is no need for negotiation of options, ownership, or potential conflicts between the pg_dump output and restoring it to a live PostgreSQL server through sequential SQL statements.
However, while this method is easier to implement, it presents several limitations. First, the database must be shut down completely to backup or restore the data, eliminating the primary advantage of a hot backup-capable DBMS, which is limited downtime. Further, it is not possible to backup only specific databases, or tables. The entire data directory must be backed up for a complete restoration of the filesystem. This is because there are many files associated with a particular database, and it is not obvious which files correlate to which databases.
Finally, because more than abstract information is represented on disk by a live database, a much greater amount of disk space is required to backup even a compressed copy of the entire data directory.