Population modeling data set description

Data set structure

The data set structure contains for each subject measurements, dose regimen, covariates etc … i.e. all the information collected during the trial. This information is organized by line (i.e. each line contains a piece of information) and each column shall be associated to a column-type (there are fifteen different column-types which will be described in the other articles) for the software to read the data set. It is very similar and compatible with the structure used by the Nonmem software (the differences are listed here). One of the first thing that the software does is to define the line type. Indeed, a line can be:

  • dose-line: a line that contains information about the dose’s regimen,
  • response-line: a line that contains a measure,
  • A regression-line: a line that contains regression value(s) (since it is possible to have several regression variables),
  • covariate-line: a line that contains covariate values(s) (since it is possible to have several covariates),
  • comment-line: any line containing character ‘#’,
  • header or title-line: it is the first line of the data set which can be used to define column-names.

Combinations are possible and a line can be both a dose-line and a regression-line (in other words it is possible to define in a same line a dose regimen and the regression values). However, a line cannot be both a dose-line and a response-line. In other words, two lines will be necessary to define a dose-regimen and a measure at the same time-stamp.

Description of column-types

The title-line is the first line of the data set. It is free and can be used to specify column-names. It is important to understand the difference between column-names and a column-types: as already stated the column-names are totally free but the column-types shall belong to a list of pre-defined keywords. They are used to identify the column’s role. For instance, in the previous example, the fourth column of the sample data set contains measurement information and will then have column-type Y. A name (CONC) has been entered to indicate that the measurement corresponds to a concentration. It is possible to group the column-types based on their functionality:

  • Subject identification headers: column-types ID and OCC are used to identify subjects.
  • Time headers: column-types TIME and DATE (or DAT1, DAT2, DAT3) are used to time stamp data.
  • Dose headers: column-types AMT (alias: DOSE, for dose amounts), ADM (administration type identifiers),  RATE (administration rates), TINF (administration durations), SS (mark doses as steady-state), II (alias: TAU, for inter-dose intervals) and ADDL (additional doses) are used to define doses.
  • Response headers: column-types Y (alias: DV and CONC, for response values), YTYPE (response type identifiers), CENS (mark responses as censored), LIMIT (response limit) are used to define responses.
  • Covariate headers: column-types COV (continuous covariate) and CAT (categorical covariate) are used to define continuous and categorial covariates.
  • Regression headers: column-type X (alias: REG, XX) is used to define regression variables.
  • Control and event headers: column-types MDV (to control data by tagging lines as dose-lines, response-lines or regression-lines) and EVID (to mark unusual events).

Character definition for data set elements naming

Only alphanumeric characters and the underscore “_” character are allowed in the strings of your data set (headers, categorical covariate names, etc). Special characters such as spaces ” “, stars “*”, parentheses “(“, brackets “[“, dashes “-“, dots “.”, quotes ” and slashes “/” are not supported.

These characters restrictions are impacting

  • The headers.
  • The strings that can be used in ID, YTYPE, and CAT columns.

If your data set includes unsupported characters, loading your data may work the first time, but you will encounter a parsing error when loading a saved project.