Portable File Format

These days, most computers use the same internal data formats for integer and floating-point data, if one ignores little differences like big- versus little-endian byte ordering. This has not always been true, particularly in the 1960s or 1970s, when the portable file format originated as a way to exchange data between systems with incompatible data formats.

At the time, even bytes being 8 bits each was not a given. For that reason, the portable file format is a text format, because text files could be exchanged portably among systems slightly more freely. On the other hand, character encoding was not standardized, so exchanging data in portable file format required recoding it from the origin system's character encoding to the destination's.

Some contemporary systems represented text files as sequences of fixed-length (typically 80-byte) records, without new-line sequences. These operating systems padded lines shorter lines with spaces and truncated longer lines. To tolerate files copied from such systems, which might drop spaces at the ends of lines, the portable file format treats lines less than 80 bytes long as padded with spaces to that length.

The portable file format self-identifies the character encoding on the system that produced it at the very beginning, in the header. Since portable files are normally recoded when they are transported from one system to another, this identification can be wrong on its face: a file that was started in EBCDIC, and is then recoded to ASCII, will still say EBCDIC SPSS PORT FILE at the beginning, just in ASCII instead of EBCDIC.

The portable file header also contains a table of all of the characters that it supports. Readers use this to translate each byte of the file into its local encoding. Like the rest of the portable file, the character table is recoded when the file is moved to a system with a different character set so that it remains correct, or at least consistent with the rest of the file.

The portable file format is mostly obsolete. System files are a better alternative.

Sources

The information in this chapter is drawn from documentation and source code, including:

  • pff.tar.Z, a Fortran program from the 1980s that reads and writes portable files. This program contains translation tables from the portable character set to EBCDIC and to ASCII.

  • A document, now lost, that describes portable file syntax.

It is further informed by a corpus of about 1,400 portable files. The plausible creation dates in the corpus range from 1986 to 2025, in addition to 131 files with alleged creation dates between 1900 and 1907 and 21 files with an invalid creation date.

Portable File Characters

Portable files are arranged as a series of lines of 80 characters each. Each line is terminated by a carriage-return, line-feed sequence ("new-lines"). New-lines are only used to avoid line length limits imposed by some OSes; they are not meaningful.

Most lines in portable files are exactly 80 characters long. The only exception is a line that ends in one or more spaces, in which the spaces may optionally be omitted. Thus, a portable file reader must act as though a line shorter than 80 characters is padded to that length with spaces.

The file must be terminated with a Z character. In addition, if the final line in the file does not have exactly 80 characters, then it is padded on the right with Z characters. (The file contents may be in any character set; the file contains a description of its own character set, as explained in the next section. Therefore, the Z character is not necessarily an ASCII Z.)

For the rest of the description of the portable file format, new-lines and the trailing Zs will be ignored, as if they did not exist, because they are not an important part of understanding the file contents.

Portable File Structure

Every portable file consists of the following records, in sequence:

  • Splash strings.

  • Version and date info.

  • Product identification.

  • Author identification (optional).

  • Subproduct identification (optional).

  • Variable count.

  • Case weight variable (optional).

  • Variables. Each variable record may optionally be followed by a missing value record and a variable label record.

  • Value labels (optional).

  • Documents (optional).

  • Data.

Most records are identified by a single-character tag code. The file header and version info record do not have a tag.

Other than these single-character codes, there are three types of fields in a portable file: floating-point, integer, and string. Floating-point fields have the following format:

  • Zero or more leading spaces.

  • Optional asterisk (*), which indicates a missing value. The asterisk must be followed by a single character, generally a period (.), but it appears that other characters may also be possible. This completes the specification of a missing value.

  • Optional minus sign (-) to indicate a negative number.

  • A whole number, consisting of one or more base-30 digits: 0 through 9 plus capital letters A through T.

  • Optional fraction, consisting of a radix point (.) followed by one or more base-30 digits.

  • Optional exponent, consisting of a plus or minus sign (+ or -) followed by one or more base-30 digits.

  • A forward slash (/).

Integer fields take a form identical to floating-point fields, but they may not contain a fraction.

String fields take the form of a integer field having value N, followed by exactly N characters, which are the string content.

Strings longer than 255 bytes exist in the corpus.

Splash Strings

Every portable file begins with 200 bytes of splash strings that serve to identify the file's type and its original character set. The 200 bytes are divided into five 40-byte sections, each of which is supposed to represent the string <CHARSET> SPSS PORT FILE in a different character set encoding1, where <CHARSET> is the name of the character set used in the file, e.g. ASCII or EBCDIC. Each string is padded on the right with spaces in its respective character set.

It appears that these strings exist only to inform those who might view the file on a screen, letting them know what character set the file is in regardless of how they are viewing it, and that they are not parsed by SPSS products. Thus, they can be safely ignored. It is reasonable to simply write out ASCII SPSS PORT FILE five times, each time padded to 40 bytes.

Translation Table

The splash strings are followed by a 256-byte character set translation table. This segment describes a mapping from the character set used in the portable file to a "portable character set" that does not correspond to any known single-byte character set or code page. Each byte in the table reports the byte value that corresponds to the character represented by its position. The following section lists the character at each position.

For example, position 0x4a (decimal 74) in the portable character set is uppercase letter A (as shown in the table in the following section), so the 75th byte in the table is the value that represents A in the file.

Any real character set will not necessarily include all of the characters in the portable character set. In the translation table, omitted characters are written as digit 02.

For example, in practice, all of the control character positions are always written as 0.

The following section describes how the translation table is supposed to act based on looking at the sources, and then the section after that describes what it actually contains in practice.

Theory

The table below shows the portable character set. The columns in the table are:

  • "Pos", a position within the portable character set, in hex, from 00 to FF.

  • "EBCDIC", the translation for the given position to EBCDIC, as written in pff.tar.Z.

  • "ASCII", the translation for the given position to ASCII, as written in pff.tar.Z.

  • "Unicode", a suggestion for the best translation from this position to Unicode.

  • "Notes", which links to additional information for some characters.

In addition to the sources previously cited, some of the information below is drawn from RFC 183, from 1971. This RFC shows many of the "EBCDIC" hex codes in pff.tar.Z as corresponding to the descriptions in the document, even though no known EBCDIC codepage contains those characters with those codes.

PosEBCDICASCIIUnicodeNotes
00003
01013
02023
03033
04043
0505U+0009 CHARACTER TABULATION3
06063
07073
08083
09093
0A0A3
0B0BU+000B LINE TABULATION3
0C0CU+000C FORM FEED3
0D0DU+000D CARRIAGE RETURN3
0E0E3
0F0F3
10103
11113
12123
13133
143C3
1515U+000A LINE FEED3
1616U+0008 BACKSPACE3
17173
18183
19193
1A1A3
1B1B3
1C1C3
1D1D3
1E1E3
1F2A3
20203
21213
22223
23233
242B3
2525U+000A LINE FEED3
26263
27273
281F3
29243
2A143
2B2D3
2C2E3
2D2FU+0007 BELL3
2E323
2F333
30343
31353
32363
33373
34383
35393
363A3
373B3
383D3
393F3
3A283
3B293
3C2C3
3D4
3E4
3F4
40F030U+0030 DIGIT ZERO0
...
49F939U+0039 DIGIT NINE9
4AC141U+0041 LATIN CAPITAL LETTER AA
...
52C949U+0049 LATIN CAPITAL LETTER II
53D14AU+004A LATIN CAPITAL LETTER JJ
...
5BD952U+0052 LATIN CAPITAL LETTER RR
5CE253U+0053 LATIN CAPITAL LETTER SS
...
63E95AU+005A LATIN CAPITAL LETTER ZZ
648161U+0061 LATIN SMALL LETTER Aa
...
7D8969U+0069 LATIN SMALL LETTER Ii
64916AU+006A LATIN SMALL LETTER Jj
...
7D9972U+0072 LATIN SMALL LETTER Rr
64A273U+0073 LATIN SMALL LETTER Ss
...
7DA97AU+007A LATIN SMALL LETTER Zz
7E4020U+0020 SPACE
7F4B2EU+002E FULL STOP.
804C3CU+003C LESS-THAN SIGN<
814D28U+0028 LEFT PARENTHESIS(
824E2BU+002B PLUS SIGN+
8359U+007C VERTICAL LINE|5
845026U+0026 AMPERSAND&
85AD5BU+005B LEFT SQUARE BRACKET[
86BD5DU+005D RIGHT SQUARE BRACKET]
875A21U+0021 EXCLAMATION MARK!
885B24U+0024 DOLLAR SIGN$
895C2AU+002A ASTERISK*
8A5D29U+0029 RIGHT PARENTHESIS)
8B5E3BU+003B SEMICOLON;
8C5F5EU+005E CIRCUMFLEX ACCENT^
8D602DU+002D HYPHEN-MINUS-
8E612FU+002F SOLIDUS/
8F6A76U+00A6 BROKEN BAR¦5
906B2CU+002C COMMA,
916C25U+0025 PERCENT SIGN%
926D5FU+005F LOW LINE_
936E3EU+003E GREATER-THAN SIGN>
946F3FU+003F QUESTION MARK?
957960U+0060 GRAVE ACCENT`
967A3AU+003A COLON:
977B23U+0023 NUMBER SIGN#
987C40U+0040 COMMERCIAL AT@
997D27U+0027 APOSTROPHE'
9A7E3DU+003D EQUALS SIGN=
9B7F22U+0022 QUOTATION MARK"
9C8CU+2264 LESS-THAN OR EQUAL TO
9D9CU+25A1 WHITE SQUARE6
9E9EU+00B1 PLUS-MINUS SIGN±
9F9FU+25A0 BLACK SQUARE7
A0U+00B0 DEGREE SIGN°
A18FU+2020 DAGGER
A2A17EU+007E TILDE~
A3A0U+2013 EN DASH
A4ABU+2514 BOX DRAWINGS LIGHT UP AND RIGHT8
A5ACU+250C BOX DRAWINGS LIGHT DOWN AND RIGHT8
A6AEU+2265 GREATER-THAN OR EQUAL TO
A7B0U+2070 SUPERSCRIPT ZERO8
...
B0B9U+2079 SUPERSCRIPT NINE8
B1BBU+2518 BOX DRAWINGS LIGHT UP AND LEFT8
B2BCU+2510 BOX DRAWINGS LIGHT DOWN AND LEFT8
B3BEU+2260 NOT EQUAL TO
B4BFU+2014 EM DASH
B58DU+2070 SUPERSCRIPT LEFT PARENTHESIS
B69DU+207E SUPERSCRIPT RIGHT PARENTHESIS
B7BEU+207A SUPERSCRIPT PLUS SIGN9
B8C07BU+007B LEFT CURLY BRACKET{
B9D07DU+007D RIGHT CURLY BRACKET}
BAE05CU+005C REVERSE SOLIDUS\
BB4A0+00A2 CENT SIGN¢
BCAFU+00B7 MIDDLE DOT·10
BD4
...
FF4

Summary:

RangeCharacters
40...4F0123456789ABCDEF
50...5FGHIJKLMNOPQRSTUV
60...6FWXYZabcdefghijkl
70...7Fmnopqrstuvwxyz .
80...8F<(+|&[]!$*);^-/¦
90...9F,%_>?`:#@'="≤□±■
A0...AF°†~–└┌≥⁰ⁱ⁲⁳⁴⁵⁶⁷⁸
B0...BC⁹┘┐≠—⁽⁾⁺{}\¢·

Practice: Character Set

The previous section described the translation table in theory. This section describes what it contains in the corpus.

Every file in the corpus is encoded in (extended) ASCII, although 31 of them indicate in their splash strings that they were recoded from EBCDIC. This also means that ASCII 0 indicates an unmapped character, that is, one not in the character set represented by the table.

The files are encoded in different ASCII extension. Some appear to be encoded in windows-1252, others in code page 437, others in unidentified character sets. The particular code page in use does not matter to a reader that uses the table for mapping.

  • There are some invariants across the translation tables for every file in the corpus:

    • All control codes (in the range 0 to 63) are unmapped.

      One consequence is that strings in the corpus can never contain new-lines. New-lines encoded literally would be problematic anyhow because readers must ignore them.

    • Digits 0 to 9 and letters A to Z and a to z are correctly mapped.

    • Punctuation for space as well as (+&$*);-/,%_?`:@'=\ are correctly mapped.

  • Characters <!^>\"~{} are mapped correctly in almost every file in the corpus, with a few outliers.

  • Characters [] are mostly correct with a few problems.

  • Position 97 is correctly # in most files, and wrongly $ in some.

  • The characters at positions 83 | and 8F ¦ have lots of issues, stemming from the history described on Wikipedia. In particular, EBCDIC and Unicode have separate characters for | and ¦, but ASCII does not.

    Most of the corpus leaves 83 | unmapped. Most of the rest map it correctly to |. The remainder map it to !.

    Most of the corpus maps 8F ¦ to |. Only a few map it correctly to ¦ in windows-1252 or (creatively) to in code page 437.

  • Characters at the following positions are almost always wrong. The table shows:

    • "Character", the character and its position in the portable character set.

    • "Unmapped", the number of files in the corpus that leave the character unmapped (that is, set to 0).

    • "windows-1252", the number of files that map the character correctly in windows-1252. If there is more than one plausible mapping, or if the mapping doesn't exactly match the preferred Unicode, the entry shows the mapped character.

    • "cp437", the number of files that map the character correctly in code page 437.

      In a few cases, a plausible mapping in the "windows-1252" column is an ASCII character. Those aren't separately counted in the "cp437" column, even though ASCII maps the same way in both encodings.

    • "Wrong", the number of files that map the character to nothing that makes sense in a known encoding.

    CharacterUnmappedwindows-1252cp437Wrong
    9C 136601028
    A6 137301021
    9F 137301021
    9E ±1353151523
    A3 (en dash)1302as -: 65as : 532
    B4 (em dash)1308as -: 65as : 1021
    A4 136701522
    A5 136701522
    B1 136701522
    B2 136701522
    A8 ¹1286as ¹: 15; as 1: 65038
    A9 ²1286as ²: 15; as 2: 651523
    AA ³1286as ³: 15; as 3: 65038
    AB 1308as 4: 65031
    ...............
    B0 1308as 9: 65031
    B3 13730as : 1021
    B6 13080096
    B7 13730031
    BB ¢1351161027
    BC ·1357as ·: 16; as ×: 1as : 1020
    A0 °1382as °: 15; as º: 156
  • Characters at the following positions are always unmapped or wrong:

    CharacterUnmappedwindows-1252cp437Wrong
    9D 13730as : 1021
    A1 13640as : 1030
    A7 1373as Ø: 1030
    B7 13730031
  • Sometimes the reserved characters are mapped (not in any obviously useful way).

Practice: Characters in Use

The previous section reported on the character sets defined in the translation table in the corpus. This section reports on the characters actually found in the corpus.

In practice, characters in the corpus are in ISO-8859-1, with very few exceptions. The exceptions are a handful of files that either use reserved characters from the portable character set, for unclear reasons, or declare surprising encodings for bytes in the normal ASCII range. These exceptions might be file corruption; they do not appear to be useful.

As a result, a portable file reader could reasonably ignore the translation table and simply interpret all portable files as ISO-8859-1 or windows-1252.

There is no visible distinction in practice between portable files in "communication" versus "tape" format. Neither kind contains control characters.

Files in the corpus have a mix of CRLF and LF-only line ends.

Tag String

The translation table is followed by an 8-byte tag string that consists of the exact characters SPSSPORT in the portable file's character set. This can be used to verify that the file is indeed a portable file.

Since every file in the corpus is encoded in (extended) ASCII, this string always appears in ASCII too.

Version and Date Info Record

This record does not have a tag code. It has the following structure:

  • A single character identifying the file format version. It is always A.

  • An 8-character string field giving the file creation date in the format YYYYMMDD.

  • A 6-character string field giving the file creation time in the format HHMMSS.

In the corpus, there is some variation for file creation dates and times by product:

  • STAT/TRANSFER often writes dates that are invalid (e.g. 20040931) or obviously wrong (e.g. 19040823, 19000607).

  • STAT/TRANSFER often writes the time as all spaces.

  • IBM SPSS Statistics 19.0 (and probably other versions) writes HH as H for single-digit hours.

  • SPSS 6.1 for the Power Macintosh writes invalid dates such as 19:11010.

Identification Records

The product identification record has tag code 1. It consists of a single string field giving the name of the product that wrote the portable file.

The author identification record has tag code 2. It is optional and usually omitted. If present, it consists of a single string field giving the name of the person who caused the portable file to be written.

The corpus contains a few different kinds of authors:

  • Organizational names, such as the names of companies or universities or their departments.

  • Product names, such as SPSS for HP-UX.

  • Internet host names, such as icpsr.umich.edu.

The subproduct identification record has tag code 3. It is optional and usually omitted. If present, it consists of a single string field giving additional information on the product that wrote the portable file.

The corpus contains a few different kinds of subproduct:

  • x86_64-w64-mingw32 or another target triple (written by PSPP).

  • A file name for a .sav file.

  • SPSS/PC+ Studentware+ written by SPSS for MS WINDOWS Release 7.0 in 1996.

  • FILE BUILT VIA IMPORT written by SPSS RELEASE 4.1 FOR VAX/VMS in 1998.

  • SPSS/PC+ written by SPSS for MS WINDOWS Release 7.0 in 1996.

  • Multiple instances of SPSS/PC+ written by SPSS/PC+ on IBM PC, but with several spaces padding out both product and subproduct fields.

  • PFF TEST FILE written by SPSS-X RELEASE 2.1 FOR IBM VM/CMS in 1986.

Variable Count Record

The variable count record has tag code 4. It consists of a single integer field giving the number of variables in the file dictionary.

Precision Record

The precision record has tag code 5. It consists of a single integer field specifying the maximum number of base-30 digits used in data in the file.

Case Weight Variable Record

The case weight variable record is optional. If it is present, it indicates the variable used for weighting cases; if it is absent, cases are unweighted. It has tag code 6. It consists of a single string field that names the weighting variable.

Variable Records

Each variable record represents a single variable. Variable records have tag code 7. They have the following structure:

  • Width (integer). This is 0 for a numeric variable. For portability to old versions of SPSS, it should be between 1 and 255 for a string variable.

    Portable files in the corpus contain strings as wide as 32000 bytes. None of these was written by SPSS itself, but by a variety of third-party products: STAT/TRANSFER, inquery export tool (c) inworks GmbH, QDATA Data Entry System for the IBM PC. The creation dates in the files range from 2016 to 2024.

  • Name (string). 1-8 characters long. Must be in all capitals.

    A few portable files that contain duplicate variable names have been spotted in the wild. PSPP handles these by renaming the duplicates with numeric extensions: VAR001, VAR002, and so on.

  • Print format. This is a set of three integer fields:

    • Format type encoded the same as in system files.

    • Format width. 1-40.

    • Number of decimal places. 1-40.

    A few portable files with invalid format types or formats that are not of the appropriate width or decimals for their variables have been spotted in the wild. PSPP assigns a default F or A format to a variable with an invalid format.

  • Write format. Same structure as the print format described above.

Each variable record can optionally be followed by a missing value record, which has tag code 8. A missing value record has one field, the missing value itself (a floating-point or string, as appropriate). Up to three of these missing value records can be used.

There are also records for missing value ranges:

  • Tag code B for X THRU Y ranges. It is followed by two floating-point values representing X and Y.

  • Tag code 9 for LO THRU Y ranges, followed by a floating-point number representing Y.

  • Tag code A for X THRU HI ranges, followed by a floating-point number representing X.

If a missing value range is present, it may be followed by a single missing value record.

In addition, each variable record can optionally be followed by a variable label record, which has tag code C. A variable label record has one field, the variable label itself (string).

Value Label Records

Value label records have tag code D. They have the following format:

  • Variable count (integer).

  • List of variables (strings). The variable count specifies the number in the list. Variables are specified by their names. All variables must be of the same type (numeric or string), but string variables do not necessarily have the same width.

  • Label count (integer).

  • List of (value, label) tuples. The label count specifies the number of tuples. Each tuple consists of a value, which is numeric or string as appropriate to the variables, followed by a label (string).

The corpus contains a few portable files that specify duplicate value labels, that is, two different labels for a single value of a single variable. PSPP uses the last value label specified in these cases.

Document Record

One document record may optionally follow the value label record. The document record consists of tag code E, following by the number of document lines as an integer, followed by that number of strings, each of which represents one document line. Document lines must be 80 bytes long or shorter.

Portable File Data

The data record has tag code F. There is only one tag for all the data; thus, all the data must follow the dictionary. The data is terminated by the end-of-file marker Z, which is not valid as the beginning of a data element.

Data elements are output in the same order as the variable records describing them. String variables are output as string fields, and numeric variables are output as floating-point fields.


  1. The strings are supposed to be in EBCDIC, 7-bit ASCII, CDC 6-bit ASCII, 6-bit ASCII, and Honeywell 6-bit ASCII. (It is somewhat astonishing that anyone considered the possibility of 6-bit "ASCII", or that there were at least three incompatible version of it.)

  2. Character 0, not NUL or byte zero.

  3. From the EBCDIC translation table in pff.tar.Z. The ASCII translation table leaves all of them undefined. Code points are only listed for common control characters with some modern relevance. ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22 ↩23 ↩24 ↩25 ↩26 ↩27 ↩28 ↩29 ↩30 ↩31 ↩32 ↩33 ↩34 ↩35 ↩36 ↩37 ↩38 ↩39 ↩40 ↩41 ↩42 ↩43 ↩44 ↩45 ↩46 ↩47 ↩48 ↩49 ↩50 ↩51 ↩52 ↩53 ↩54 ↩55 ↩56 ↩57 ↩58 ↩59 ↩60 ↩61

  4. Reserved ↩2 ↩3 ↩4 ↩5

  5. The document describes 83 as "a solid vertical pipe" and 8F as "a broken vertical pipe". Even though the ASCII translation table in pff.tar.Z leaves position 83 undefined and translates 8F to U+007C VERTICAL LINE, using U+007C VERTICAL LINE and U+00A6 BROKEN BAR, respectively, seem more accurate in a Unicode environment. ↩2

  6. Unicode inferred from document description as "empty box".

  7. Unicode inferred from document description as "filled box".

  8. These characters are as described in the document. Some of these don't appear in any known EBCDIC code page, but the EBCDIC translations given in pff.tar.Z match the graphics shown in RFC 183 with those hex codes. ↩2 ↩3 ↩4 ↩5 ↩6

  9. Described in document as "horizontal dagger", which doesn't appear in Unicode or any known code page. This interpretation from RFC 183 seems more likely.

  10. Unicode inferred from document description as "centered dot, or bullet"