SPV Light Member Formats (PSPP)

D.2.8 Formats

Formats =>
    int32[n-widths] int32*[n-widths]
    string[locale]
    int32[current-layer]
    bool[x7] bool[x8] bool[x9]
    Y0
    CustomCurrency
    count(
      v1(X0?)
      v3(count(X1 count(X2)) count(X3)))
Y0 => int32[epoch] byte[decimal] byte[grouping]
CustomCurrency => int32[n-ccs] string*[n-ccs]

If n-widths is nonzero, then the accompanying integers are column widths as manually adjusted by the user.

locale is a locale including an encoding, such as en_US.windows-1252 or it_IT.windows-1252. (locale is often duplicated in Y1, described below).

epoch is the year that starts the epoch. A 2-digit year is interpreted as belonging to the 100 years beginning at the epoch. The default epoch year is 69 years prior to the current year; thus, in 2017 this field by default contains 1948. In the corpus, epoch ranges from 1943 to 1948, plus some contain -1.

decimal is the decimal point character. The observed values are ‘.’ and ‘,’.

grouping is the grouping character. Usually, it is ‘,’ if decimal is ‘.’, and vice versa. Other observed values are ‘'’ (apostrophe), ‘ ’ (space), and zero (presumably indicating that digits should not be grouped).

n-ccs is observed as either 0 or 5. When it is 5, the following strings are CCA through CCE format strings. See Custom Currency Formats in PSPP. Most commonly these are all -,,, but other strings occur.

A writer may safely use false for x7, x8, and x9.

X0

X0 only appears, optionally, in version 1 members.

X0 => byte*14 Y1 Y2
Y1 =>
    string[command] string[command-local]
    string[language] string[charset] string[locale]
    bool bool bool bool
    Y0
Y2 => CustomCurrency byte[missing] bool[x17]

command describes the statistical procedure that generated the output, in English. It is not necessarily the literal syntax name of the procedure: for example, NPAR TESTS becomes “Nonparametric Tests.” command-local is the procedure’s name, translated into the output language; it is often empty and, when it is not, sometimes the same as command.

missing is the character used to indicate that a cell contains a missing value. It is always observed as ‘.’.

A writer may safely use false for x17.

X1

X1 only appears in version 3 members.

X1 =>
    bool[x14]
    byte[show-title]
    bool[x16]
    byte[lang]
    byte[show-variables]
    byte[show-values]
    int32[x18] int32[x19]
    00*17
    bool[x20]
    bool[show-caption]

lang may indicate the language in use. Some values seem to be 0: en, 1: de, 2: es, 3: it, 5: ko, 6: pl, 8: zh-tw, 10: pt_BR, 11: fr.

show-variables determines how variables are displayed by default. A value of 1 means to display variable names, 2 to display variable labels when available, 3 to display both (name followed by label, separated by a space). The most common value is 0, which probably means to use a global default.

show-values is a similar setting for values. A value of 1 means to display the value, 2 to display the value label when available, 3 to display both. Again, the most common value is 0, which probably means to use a global default.

show-title is 1 to show the caption, 10 to hide it.

show-caption is true to show the caption, false to hide it.

A writer may safely use false for x14, false for x16, 0 for lang, -1 for x18 and x19, and false for x20.

X2

X2 only appears in version 3 members.

X2 =>
    int32[n-row-heights] int32*[n-row-heights]
    int32[n-style-map] StyleMap*[n-style-map]
    int32[n-styles] StylePair*[n-styles]
    count((i0 i0)?)
StyleMap => int64[cell-index] int16[style-index]

If present, n-row-heights and the accompanying integers are row heights as manually adjusted by the user.

The rest of X2 specifies styles for data cells. At first glance this is odd, because each data cell can have its own style embedded as part of the data, but in practice X2 specifies a style for a cell only if that cell is empty (and thus does not appear in the data at all). Each StyleMap specifies the index of a blank cell, calculated the same was as in the Cells (see Cells), along with a 0-based index into the accompanying StylePair array.

A writer may safely omit the optional i0 i0 inside the count(…).

X3

X3 only appears in version 3 members.

X3 =>
    01 00 byte[x21] 00 00 00
    Y1
    double[small] 01
    (string[dataset] string[datafile] i0 int32[date] i0)?
    Y2
    (int32[x22] i0)?

small is a small real number. In the corpus, it overwhelmingly takes the value 0.0001, with zero occasionally seen. Nonzero numbers with format 40 (see Value) whose magnitudes are smaller than displayed in scientific notation. (Thus, a small of zero prevents scientific notation from being chosen.)

dataset is the name of the dataset analyzed to produce the output, e.g. DataSet1, and datafile the name of the file it was read from, e.g. C:\Users\foo\bar.sav. The latter is sometimes the empty string.

date is a date, as seconds since the epoch, i.e. since January 1, 1970. Pivot tables within an SPV file often have dates a few minutes apart, so this is probably a creation date for the table rather than for the file.

Sometimes dataset, datafile, and date are present and other times they are absent. The reader can distinguish by assuming that they are present and then checking whether the presumptive dataset contains a null byte (a valid string never will).

x22 is usually 0 or 2000000.

A writer may safely use 4 for x21 and omit x22 and the other optional bytes at the end.

Encoding

Formats contains several indications of character encoding:

locale in Formats itself.
locale in Y1 (in version 1, Y1 is optionally nested inside X0; in version 3, Y1 is nested inside X3).
charset in version 3, in Y1.
lang in X1, in version 3.

charset, if present, is a good indication of character encoding, and in its absence the encoding suffix on locale in Formats will work.

locale in Y1 can be disregarded: it is normally the same as locale in Formats, and it is only present if charset is also.

lang is not helpful and should be ignored for character encoding purposes.

However, the corpus contains many examples of light members whose strings are encoded in UTF-8 despite declaring some other character set. Furthermore, the corpus contains several examples of light members in which some strings are encoded in UTF-8 (and contain multibyte characters) and other strings are encoded in another character set (and contain non-ASCII characters). PSPP treats any valid UTF-8 string as UTF-8 and only falls back to the declared encoding for strings that are not valid UTF-8.

The pspp-output program’s strings command can help analyze the encoding in an SPV light member. Use pspp-output --help-dev to see its usage.