SQL Server Collations and PostgreSQL Encoding - SQL Server to Aurora PostgreSQL Migration Playbook
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

SQL Server Collations and PostgreSQL Encoding

Feature compatibility Amazon SCT / Amazon DMS automation level Amazon SCT action code index Key differences


                              Three star feature compatibility


                              No automation

Collations

UTF16, NCHAR, and NVARCHAR data types aren’t supported.

SQL Server Usage

SQL Server collations define the rules for string management and storage in terms of sorting, case sensitivity, accent sensitivity, and code page mapping. SQL Server supports both ASCII and UCS-2 UNICODE data.

UCS-2 UNICODE data uses a dedicated set of UNICODE data types denoted by the prefix N: Nchar and Nvarchar. Their ASCII counterparts are CHAR and VARCHAR.

Choosing a collation and a character set has significant implications on data storage, logical predicate evaluations, query results, and query performance.

To view all collations supported by SQL Server, use the fn_helpcollations function:

SELECT * FROM sys.fn_helpcollations()

Collations define the actual bitwise binary representation of all string characters and the associated sorting rules. SQL Server supports multiple collations down to the column level. A table may have multiple string columns that use different collations. Collations for non-UNICODE character sets determine the code page number representing the string characters.

UNICODE and non-UNICODE data types in SQL Server aren’t compatible. A predicate or data modification that introduces a type conflict is resolved using predefined collation precedence rules. For more information, see Collation Precedence.

Collations define sorting and matching sensitivity for the following string characteristics:

  • Case

  • Accent

  • Kana

  • Width

  • Variation selector

SQL Server uses a suffix naming convention that appends the option name to the collation name. For example, the collation Azeri_Cyrillic_100_CS_AS_KS_WS_SC, is an Azeri-Cyrillic-100 collation that is case-sensitive, accent-sensitive, kana type-sensitive, width-sensitive, and has supplementary characters.

SQL Server supports three types of collation sets:

  • Windows collations use the rules defined for collations by the operating system locale where UNICODE and non-UNICODE data use the same comparison algorithms.

  • Binary collations use the binary bit-wise code for comparison. Therefore, the locale doesn’t affect sorting.

  • SQL Server collations provide backward compatibility with previous SQL Server versions. They aren’t compatible with the windows collation rules for non-UNICODE data.

You can define collations at various levels:

  • Server-level collations determine the collations used for all system databases and is the default for future user databases. While the system databases collation can’t be changed, you can specify an alternative collation as part of the CREATE DATABASE statement.

  • Database-level collations inherit the server default unless the CREATE DATABASE statement explicitly sets a different collation. This collation is used as a default for all CREATE TABLE and ALTER TABLE statements.

  • Column-level collations can be specified as part of the CREATE TABLE or ALTER TABLE statements to override the default collation setting of your database.

  • Expression-level collations can be set for individual string expressions using the COLLATE function. For example, SELECT * FROM MyTable ORDER BY StringColumn COLLATE Latin1_General_CS_AS.

SQL Server supports UCS-2 UNICODE only.

SQL Server 2019 adds support for UTF-8 for import and export encoding, and as database-level or column-level collation for string data. Support includes PolyBase external tables, and Always Encrypted when not used with Enclaves. For more information, see Collation and Unicode Support.

Syntax

CREATE DATABASE <Database Name>
[ ON <File Specifications> ]
COLLATE <Collation>
[ WITH <Database Option List> ];
CREATE TABLE <Table Name>
(
<Column Name> <String Data Type>
COLLATE <Collation> [ <Column Constraints> ]...
);

Examples

The following example creates a database with a default Bengali_100_CS_AI collation.

CREATE DATABASE MyBengaliDatabase
ON
( NAME = MyBengaliDatabase_Datafile,
  FILENAME = 'C:\Program Files\Microsoft SQL Server-\MSSQL13.MSSQLSERVER\MSSQL\DATA\MyBengaliDatabase.mdf', SIZE = 100)
LOG ON
( NAME = MyBengaliDatabase_Logfile,
FILENAME = 'C:\Program Files\Microsoft SQL Server-\MSSQL13.MSSQLSERVER\MSSQL\DATA\MyBengaliDblog.ldf', SIZE = 25)
COLLATE Bengali_100_CS_AI;

The following example creates a table with two different collations.

CREATE TABLE MyTable
(
Col1 CHAR(10) COLLATE Hungarian_100_CI_AI_SC NOT NULL PRIMARY KEY,
COL2 VARCHAR(100) COLLATE Sami_Sweden_Finland_100_CS_AS_KS NOT NULL
);

For more information, see Collation and Unicode support in the SQL Server documentation.

PostgreSQL Usage

PostgreSQL supports a variety of different character sets, also known as encoding, including support for both single-byte and multi-byte languages. The default character set is specified when initializing a PostgreSQL database cluster with initdb. Each individual database created on the PostgreSQL cluster supports individual character sets defined as part of database creation.

Note

For Amazon Relational Database Service (Amazon RDS), starting with PostgreSQL 13, the Windows version now supports obtaining version information for collations or ordering rules from the operating system.

When you query the collation in PostgreSQL running on Windows, prior to version 13 there wasn’t any value to reflect the OS collation version. For example, for PostgreSQL version 11 running on Windows, the result is shown following:

CREATE COLLATION german (provider = libc, locale = 'de_DE');

CREATE COLLATION

select oid,collname,collversion from pg_collation
where collprovider='c' and collname='german';

oid    collname  collversion
16394  german
(1 row)

select pg_collation_actual_version (16394);

pg_collation_actual_version
(1 row)

For PostgreSQL version 13 running on Windows, the result is shown following:

CREATE COLLATION german (provider = libc, locale = 'de_DE');

CREATE COLLATION

select oid,collname,collversion from pg_collation
where collprovider='c' and collname='german';

oid    collname  collversion
32769  german    1539.5,1539.5
(1 row)

select pg_collation_actual_version (32769);

pg_collation_actual_version
1539.5,1539.5
(1 row)

Clients can use all supported character sets. However, some client-side only characters aren’t supported for use within the server.

Unlike SQL Server, PostgreSQL doesn’t natively support an NVARHCHAR data type and doesn’t provide support for UTF-16.

Type Function Implementation level

Encoding

Defines the basic rules on how alphanumeric characters are represented in binary format. For example, Unicode encoding.

Database

Locale

A superset that includes LC_COLLATE and LC_CTYPE among others. For example, LC_COLLATE defines how strings are sorted and must be a subset supported by the database encoding.

Table-Column

Examples

The following example creates a database named test01 which uses the Korean EUC_KR Encoding the and the ko_KR locale.

CREATE DATABASE test01 WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;

The following example shows how to view the character sets configured for each database by querying the system catalog.

select datname, datcollate, datctype from pg_database;

Changing Character Sets or Encoding

In-place modification of the database encoding isn’t recommended nor supported. Instead, export all data, create a new database with the new encoding, and import the data.

Export the data using the pg_dump utility.

pg_dump mydb1 > mydb1_export.sql

Rename or delete a database.

ALTER DATABASE mydb1 TO mydb1_backup;

Create a new database using the modified encoding.

CREATE DATABASE mydb1_new_encoding WITH ENCODING 'UNICODE' TEMPLATE=template0;

Import data using the pg_dump file previously created. Verify that you set your client encoding to the encoding of your old database.

PGCLIENTENCODING=OLD_DB_ENCODING psql -f mydb1_export.sql mydb1_new_encoding

The client_encoding parameter overrides the use of PGCLIENTENCODING.

Client-Server Character Set Conversions

PostgreSQL supports conversion of character sets between servers and clients for specific character set combinations as described in the pg_conversion system catalog.

PostgreSQL includes predefined conversions. For more information, see Available Character Set Conversions in the PostgreSQL documentation.

You can create a new conversion using the SQL command CREATE CONVERSION.

Examples

The following example creates a conversion from UTF8 to LATIN1 using the custom myfunc1 function.

CREATE CONVERSION myconv FOR 'UTF8' TO 'LATIN1' FROM myfunc1;

The following example configures the PostgreSQL client character set.

Method 1
========
psql \encoding SJIS

Method 2
========
SET CLIENT_ENCODING TO 'value';

View the client character set and reset it back to the default value.

SHOW client_encoding;

RESET client_encoding;

Table Level Collation

PostgreSQL supports specifying the sort order and character classification behavior on a per-column level.

Example

Specify specific collations for individual table columns.

CREATE TABLE test1 (col1 text COLLATE "de_DE", col2 text COLLATE "es_ES");

Summary

Feature SQL Server Aurora PostgreSQL

View database character set

SELECT collation_name FROM sys.databases;

select datname, pg_encoding_to_char(encoding), datcollate, datctype from pg_database;

Modify the database character set

RECRATE the database

  • Export the database.

  • Drop or rename the database.

  • Re-create the database with the desired new character set.

  • Import database data from the exported file into the new database.

Character set granularity

Database

Database

UTF8

Supported

Supported

UTF16

Supported

Not Supported

NCHAR or NVARCHAR data types

Supported

Not Supported

For more information, see Character Set Support in the PostgreSQL documentation.