Select Page

Version 2016R1

This documentation is for the data set for the MonolixSuite 2016R1. It corresponds to typical data set for population modeling application used for Datxplore and Monolix typically. This data set documentation is also valid for previous Monolix versions.

Data set for population modeling

The dataset is a key element for parameter estimation and to summarize experimental data in a file. The purpose of these pages is to present the general structure of a data set, the details for each column type, and provide some examples of some real data set (continuous, discrete, time-to-event, censored, several outputs, …).

The considered data set are dedicated to population modeling application Therefore, columns of this matrix contain (in any order). It contains for each subject measurements, dose regimen, covariates etc … i.e. all the information collected during the trial. These informations are organized by line (i.e. each line contains a piece of information) and each column shall be associated to a column type (there are fifteen different column types which will be described in the other articles) for the software to read the data set. The format should be .txt or .csv and a header line is nedded.  It is very similar and compatible with the structure used by the Nonmem software.

Columns of the data file can contain (in any order)

• The ID of the subjects (can be any string or number, not necessarily ordered), the occasions of this ID.
• The observations of the individual with ID at times, Notice that these observations can be continuous measurements, counts, or events.
• The time of the observations and of the administrations.
• The covariates (continuous or categorical).
• Additional information (censoring, rate, …).

Thus, a data set contains at least IDs, time and some observations.

Data set structure

The data set structure contains for each subject measurements, dose regimen, covariates etc … i.e. all the information collected during the trial. This information is organized by line (i.e. each line contains a piece of information) and each column shall be associated to a column-type (there are fifteen different column-types which will be described in the other articles) for the software to read the data set. It is very similar and compatible with the structure used by the Nonmem software (the differences are listed here). One of the first thing that the software does is to define the line type. Indeed, a line can be:

• dose-line: a line that contains information about the dose’s regimen,
• response-line: a line that contains a measure,
• A regression-line: a line that contains regression value(s) (since it is possible to have several regression variables),
• covariate-line: a line that contains covariate values(s) (since it is possible to have several covariates),
• comment-line: any line containing character ‘#’,
• header or title-line: it is the first line of the data set which can be used to define column-names.

Combinations are possible and a line can be both a dose-line and a regression-line (in other words it is possible to define in a same line a dose regimen and the regression values). However, a line cannot be both a dose-line and a response-line. In other words, two lines will be necessary to define a dose-regimen and a measure at the same time-stamp.

Description of column-types

The title-line is the first line of the data set. It is free and can be used to specify column-names. It is important to understand the difference between column-names and a column-types: as already stated the column-names are totally free but the column-types shall belong to a list of pre-defined keywords. They are used to identify the column’s role. For instance, in the previous example, the fourth column of the sample data set contains measurement information and will then have column-type Y. A name (CONC) has been entered to indicate that the measurement corresponds to a concentration. It is possible to group the column-types based on their functionality:

• Subject identification headers: column-types ID and OCC are used to identify subjects.
• Time headers: column-types TIME and DATE (or DAT1, DAT2, DAT3) are used to time stamp data.
• Response headers: column-types Y (alias: DV and CONC, for response values), YTYPE (response type identifiers), CENS (mark responses as censored), (response limit) are used to define responses.
• Covariate headers: column-types COV (continuous covariate) and CAT (categorical covariate) are used to define continuous and categorial covariates.
• Regression headers: column-type X (alias: REG, XX) is used to define regression variables.
• Control and event headers: column-types MDV (to control data by tagging lines as dose-lines, response-lines or regression-lines) and EVID (to mark unusual events).

Character definition for data set elements naming

Only alphanumeric characters and the underscore “_” character are allowed in the strings of your data set (headers, categorical covariate names, etc). Special characters such as spaces ” “, stars “*”, parentheses “(“, brackets “[“, dashes “-“, dots “.”, quotes ” and slashes “/” are not supported.

These characters restrictions are impacting

• The strings that can be used in ID, YTYPE, and CAT columns.

ID: subject identifier

The column is used to identify the different subjects and its content is totally free: numbers / strings… This column is mandatory. Notice that string ‘.’ will not be interpreted as a repetition of the previous line. As a consequence a data set of the form

ID * *
John * *
John * *
Mike * *
. * *


contains 3 different subjects : ‘John’, ‘Mike’ and ‘.’. It does not generate another occasion for Mike. Even if numbers and strings are allowed, we encourage the user to define Ids using integers for readability and usage simplicity.
Contrarily to NONMEM, the lines corresponding to the same subject do not need to be next to each other. Thus, the following file contains 2 subjects with IDs “1” and “2”.

ID * *
1 * *
1 * *
2 * *
2 * *
1 * *


The IDs are not sorted lexicographical order but by order of appearance in the data set.

Format restrictions (an exception will be thrown otherwise):

• A data set shall contain one and only one column ID.

OCC: occasion identifiers

It is possible to have, in a data set, one or several columns with the column-type OCC. It corresponds to the same subject (ID should remain the same) but under different circonstances, occasions. For example, if the same subject has two successive different treatments, it should be considered as the same subject with two occasions. The OCC columns can contain only integers. For instance:

ID * OCC
John * 1
John * 2
John * 3

How occasions can appear while no OCC column is defined ?

Occasions can be generated even if no OCC column is defined in the data set. In that case, a button arises in the Monolix interface allowing the possibility to add inter occasion variability to the model. This can happen in two cases.

• Firstly, if there is an EVID column with a value 4 then Monolix defines a washout and create an occasion. Thus, if there is several times where EVID equals 4 for a subject, it will create the same number of occasions. Notice that if EVID equals 4 happens only once at the beginning, only one occasion will be defined and no inter occasion variability would be possible. Thus, the following data set are equivalent.
ID TIME Y OCC
1 0 0 1
1 1 2 1
1 2 2 1
1 0 0 2
1 4 1 2
1 5 2 2

ID TIME Y EVID
1 0 0 0
1 1 2 0
1 2 2 0
1 0 0 4
1 4 1 0
1 5 2 0

• Secondly, if there is a SS column, each steady state creates an occasion. Thus, if two steady states are defined for one subject, then it will generate two occasions. This option will be obsolete and not used in future version of MonolixSuite.
What kind of occasions can be defined?

There are three kinds of occasions

• Cross over study: In that case, data are collected for each patient during two independent treatment periods of time, there is an overlap on the time definition of the periods. A column OCC can be used used to identify the period. See here for an example.
• Occasions with whashout: In that case, data are collected for each patient during one period and there are no overlap between the periods. The time is increasing but the dynamical system (i.e. the compartments) is reset when the second period starts. In particular, EVID=4 indicates that the system is reset (washout) for example, when a new dose is administrated. See here for an example.
• Occasions without whashout: In that case, data are collected for each patient during one period and there are no overlap between the periods. The time is increasing and we want to differentiate periods in terms of occasions without any reset of the dynamical system. For example on the example defined here, multiple doses are administrated to each patient. each period of time between successive doses is defined as a statistical occasion. A column OCC is therefore necessary in the data file to define it.
Frequently asked questions on occasions in the data set
• Do all the individual need to share the same sequence of occasion? No, the number of occasions and the times defining the occasions can differ from one individual to another.
• Is there any limit in terms of number of occasions? No.
• Is it possible to have several levels of occasions? Yes, it can be extended on several level of occasions, see an example here.

Format restrictions (an exception will be thrown otherwise):

• The OCC columns should contain only integers.

TIME: data time stamp

Time can be defined as:

• A double.
• Using hours, minutes format: hh:mm.

Notice that when a subject has time under the format  hh:mm, all the time are converted into relative hours, as on the following example

 TIME Reconstructed time 10:00 10 10:30 10.5 14:00 14 08:59 8.983333

When there is no column-type TIME, the column-type DATE is used to time-stamp data.

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column with the column-type TIME.
• String “.” will not be interpreted as a repetition of the previous line and is then non-compliant with formats listed here-above.

DATE/DAT1/DAT2/DAT3: date information

The difference between all this date information corresponds to variation of date format as summarized in the following table.

Format and associated date column name
DATE DAT1 DAT2 DAT3
Day, month and year mm/dd/yy or mm/dd/yyyy dd/mm/yy or dd/mm/yyyy yy/mm/dd or yyyy/mm/dd yy/dd/mm or yyyy/dd/mm

Several points have to be noticed. First, the day month year separator should be the character “/”. Secondly, by default, when the year is coded with two digits, it is then interpreted as 20xx. For instance, using format DAT2, 41/12/07 is interpreted as December the 7th 2041.

If both a TIME column-type and a DATE column-type are present, the DATE column is considered to represent the day and the TIME column the hour within this day.

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column-type DATE / DAT1 / DAT2 / DAT3.
• Year, day, and month shall be integers.
• The separator must be “/”.
• Character “.” will not be interpreted as a repetition of the previous line but will throw an exception as any non-compliance with formats listed here-above.
• All the lines with valid subjects (non empty ID,OCC) should be filled correctly within the same delimiter, according to the specified date format: i.e., no empty year, no empty month, no empty day, no mix of delimiters.

Timestamp summary

As can be seen there are several ways to define the timestamp of the data set depending if there is a time column or not and if there is a DATE column or not.

 TIME column present TIME column not present DATE column present DATE column is considered to represent the day and the TIME column the hour within this day Date column is considered to represent the time DATE column not present DATE column is considered to represent the time First regression-column will be used to timestamp data

What happens if neither TIME nor DATE is defined ?

We strongly encourage the user to be careful on the TIME definition. However, if there is neither TIME nor DATE column-type, first regression-column (i.e. first column with column-type X) will be used to timestamp data. Moreover, if there is neither TIME, nor DATE/REGRESSION column-type, an arbitrary time is computed.

Y: response

The Y column-type can be used for continuous, categorical, count or time-to-event data.
When there is no EVID or MDV column (see hereunder), a line is considered as a response-line if it contains a value and there is no dose-column (i.e. content of the column with dose-type AMT) or if the dose-column contains either string ‘.’ or a 0. As a consequence, when there are null values in both dose-column and response-column, line is considered as a response-line. The following table sums up the different situations

Notice that in the case of the definition of both a non null amount and a measurement, the choice was made to favor the measurementTo solve it without any EVID column, the user should provide two distinct lines to provide both a dose-line and a response line. For instance, in the following data set

TIME ID AMT Y
12.1 John 1.1 12.6


the line is considered as a response-line, a measurement is set at 12.6 at time 12.1 and no dose is added. Of course it is possible to specify a response and a dose at same time but lines shall be duplicated as in the following data set

TIME ID AMT Y
12.1 Tom . 12.6
12.1 Tom 1.1 .


In that case, the first line is again considered as a response-line, a measurement is set at 12.6 at time 12.1. But the second line is considered as a dose amount at time 12.1 with an amount 1.1.

For continuous data:

For continuous data, the time and value of each observation for each subject is given, as in the following example:

ID TIME AMT Y
1 0 50 .
1 0.5 . 1.1
1 1 . 10.2
1 1.5 . 8.5
1 2 . 6.3
1 2.5 . 5.5

One can see theophylline data set, the warfarin data set, and the HIV data set for example for more practical examples on continuous outputs data set.

For categorical data:

In case of categorical data, the observations at each time point can only take values in a fixed and finite set of nominal categories. In the data set, the output categories must be coded as integers, as in the following example:

ID TIME Y
1 0.5 3
1 1 0
1 1.5 2
1 2 2
1 2.5 3

One can see the warfarin data set for example for more practical examples on a joint continuous and categorical data set.

For count data:

Count data can take only non-negative integer values that come from counting something, e.g., the number of trials required for completing a given task. The task can for instance be repeated several times and the individuals performance followed. In the following data set:

ID TIME Y
1 0 10
1 24 6
1 48 5
1 72 2

10 trials are necessary the first day (t=0), 6 the second day (t=24), etc.

Count data can also represent the number of events happening in regularly spaced intervals, e.g the number of seizures every week. If the time intervals are not regular, the data may be considered as repeated time-to-event interval censored, or the interval length can be given as regressor to be used to define the probability distribution in the model.

For (repeated) time-to-event data:

In this case, the observations are the “times at which events occur“. An event may be on-off (e.g., death) or repeated (e.g., epileptic seizures, mechanical incidents, strikes). In addition, an event can be exactly observed, interval censored or right censored.

For single events exactly observed:

One must indicated the start time of the observation period with Y=0, and the time of event (Y=1) or the time of the end of the observation period if no event has occurred (Y=0). In the following example:

ID TIME Y
1 0 0
1 34 1
2 0 0
2 80 0

the observation period last from starting time t=0 to the final time t=80. For individual 1, the event is observed at t=34, and for individual 2, no event is observed during the period. Thus it is noticed that at the final time (t=80), no event occurred. One can see PBC data set and Oropharynx data set for practical example of time-to-event data set.

For repeated events exactly observed:

One must indicate the start time of the observation period (Y=0), the end time (Y=0) and the time of each event (Y=1). In the following example:

ID TIME Y
1 0 0
1 34 1
1 76 1
1 80 0
2 0 0
2 80 0

Again, the observation period last from starting time t=0 to the final time t=80. For individual 1, two events are observed at t=34 and t=76, and for individual 2, no event is observed during the period.

For single events interval censored:

When the exact time of the event is not known, but only an interval can be given, the start time of this interval is given with Y=0, and the end time with Y=1. As before, the start time of the observation period must be given with Y=0. In the following example:

ID TIME Y
1 0 0
1 32 0
1 35 1

we only know that the event has happened between t=32 and t=35.

For repeated events interval censored:

In this case, we do not know the exact event times, but only the number of events that occurred for each individual in each interval of time. The column-type Y can now take values greater than 1, if several events occurred during an interval. In the following example:

ID TIME Y
1 0 0
1 32 0
1 35 1
1 50 1
1 56 0
1 78 2
1 80 1

No event occurred between t=0 and t=32, 1 event occurred between t=32 and t=35, 1 between t=35 and t=50, none between t=50 and t=56, 2 between t=56 and t=78 and finally 1 between t=78 and t=80.

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column with column-type Y.
• Response-column shall contain double value or string “.”.
• If there is a non null double value in dose-column, there must be a non null double value in the response-column.

Warning

• If a subject or a subject/occasion has no observation, a warning message arises telling which individuals, subjects/individuals have no measurement.

YTYPE: response type

If observations are recorded on several quantities (several concentrations, effects, etc), the column-type YTYPE permits to assign names to the observations of the column-type Y, for mapping with the quantities outputted by the model. Notice that in case of a dose line, the value in the YTYPE column will not be read, thus the user can set any value (‘.’; the same as a concentration, …)
Entries in the column-type YTYPE can be strings or integers however, we strongly recommend to use only alphanumeric characters. The underscore “_” character is allowed in the strings of your data set. The mapping of the YTYPE to the model output (in the OUTPUT block of the Mlxtran model file) is done following alphabetical order (and not name matching). In the following data set:

TIME DOSE Y Y_TYPE
0 . 12 conc
5 . 6 conc
10 . 4 effect
15 . 3 effect
20 . 2.1 conc
25 . 2 conc

with the following OUTPUT block in the Mlxtran model file:

OUTPUT:
output = {E, Cc}

the observations tagged with “conc” will be mapped to the first output “E”, and those tagged with “effect” will be mapped to the second output “Cc”, because in alphabetical order “conc” comes before “effect”. To avoid confusion, we recommend to use integers in the YTYPE column-type, with “1” corresponding to the first output, “2” to the second, etc… If you have more than 10 types of observations, notice that in alphabetical order “10” comes before “2”.
If you use strings, note that “.” is not considered as a repetition or previous line but as the name of a response. For instance, the following data set creates three different types of responses : “type1”, “.”, and “type2”:

TIME DOSE Y Y_TYPE
0 . 12 type1
5 . 6 type1
10 . 4 .
15 . 3 .
20 . 2.1 type2
25 . 2 type2


Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column with column-type YTYPE.

CENS: censored response

• CENS = 1 means that the value in response-column ($y_{obs}$), the content of the column with column-type Y) is an upper limit, true observation y verifies $y.
• CENS = 0 means the value in response-column corresponds to a valid observation (no interval associated).
• CENS = -1 means that the value in response-column ($y_{obs}$) is a lower bound, true observation y verifies $y>y_{obs}$.

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column with column-type CENS.
• There are only three possible values : -1, 0, and 1.
• String “.” is interpreted as 0.

LIMIT: limit for censored values

When column LIMIT contains a value and CENS is different that 0, then the value in the LIMIT column, it can be interpreted as the second bound of the observation interval. Thus, it implies that $y\in [y_{limit}, y_{obs}]$.

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column with column-type LIMIT.
• A data set shall not contain any column with column-type LIMIT if no column with column-type CENS is present.
• Column LIMIT shall contain either a string that can be converted to a double or “.”.

Example of censored data definition

The proposed example illustrates the case of upper and lower bound on a classical data set of a classical PK model (first order absorption and linear elimination). From the measurements point of view

• There is a lower bound at .5 as the censor is not able to measure lower concentrations, it corresponds to CENS=1 case. Moreover, the concentration can not be lower than 0, thus LIMIT=0.
• There is an upper bound at 5 as the censor is not able to measure higher concentrations, it corresponds to CENS=-1 case. Moreover, from the experimental/modeler point of view, the concentration can not be higher than 6, thus LIMIT=6.

The measurement is represented in the following figure

The measurement corresponds to the blue stars, the real values when censoring arises are in red and green. The corresponding data set is

ID Time Y CENS LIMIT
1  0  0.5 1 0
1  1  0.5 1 0
1  2  4.7 0 0
1  3  5.0 -1 6
1  4  5.0 -1 6
1  5  4.5 0 0
1  6  3.8 0 0
* * * * *
1 15  0.6 0 0
1 16  0.5 0 0
1 17  0.5 1 0
1 18  0.5 1 0
* *   *   * *


The mathematical handling of censored data is described here.

AMT: dose amount

The content of column AMT will be called the dose-column. It shall either contain a double value or string “.”. When there is no EVID or MDV column, when a dose-column contains a double value different from 0 then it will be considered as a dose-line (i.e. a line containing dose information). If the value of the dose is 0, then it will be interpreted as a dose-line if the response-column (i.e. the content of column with column-type Y) contains a string “.”. When a line contains both dose and response information, dose information is not taken into account, it is considered as a response-line. The following table sums up the different situations

Notice that in the case of the definition of both a non null amount and a measurement, the choice was made to favor the measurementTo solve it without any EVID column, the user should provide two distinct lines to provide both a dose-line and a response lineFor instance, in the following data set

TIME ID AMT Y
12.1 John 1.1 12.6


the line is considered as a response-line, a measurement is set at 12.6 at time 12.1 and no dose is added. Of course it is possible to specify a response and a dose at same time but lines shall be duplicated as in the following data set

TIME ID AMT Y
12.1 Tom . 12.6
12.1 Tom 1.1 .


In that case, the first line is again considered as a response-line, a measurement is set at 12.6 at time 12.1. But the second line is considered as a dose amount at time 12.1 with an amount 1.1.

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column-type AMT.
• AMT column shall either contain a double value or string “.”.

The goal of this column is to be able to define several types of administration (e.g. oral administration, intravenous,…). The integer in the ADM column works like a flag, which can be used in the model file to link the dose informations of the data set to a specific administration route in the model. For instance, with the following data set:

ID TIME AMT ADM Y
John 0 10 1 .
Eric 0 20 2 .

and the following PK block in the Mlxtran model file:

PK:
iv(type=1)
oral(type=2, ka)

the subject John will receive a dose of 10 via a bolus iv, while subject Eric will receive a dose of 20 orally with first-order rate constant ka. The identifier in the ADM column should match the “type=” field of the macro. We recommend using ADM to define the type of dose only, and set ADM=”.” for response-lines (in this case, the string “.” will not be interpreted as a repetition of the previous column).

Moreover, it is possible to combine the information of the type of response (as YTYPE) in case of response-lines. Thus, if there are several outputs and several administration routes it is possible to set all the information in the ADM column. The several possibilities using YTYPE and ADM are summarized in the following table

 Type of line \ Case YTYPE off / ADM off YTYPE on/ ADM off YTYPE off / ADM on YTYPE on/ ADM on Response line Only one output Defined using YTYPE Defined using ADM Defined using YTYPE Dose line Only one administration route (type = 1) Only one administration route (type = 1) Defined using ADM Defined using ADM

Notice that, for readability and better understanding), we strongly recommend to

• use ADM to define the type of dose only, and set ADM=”.” for response-lines
• use YTYPE to define the type of output, and set YTYPE = “.” or the first value for dose lines

Format restrictions (an exception will be thrown otherwise):

• For dose-lines, the column shall contain only positive integers. For response-lines strings or integers are allowed.
• A data set shall not contain more than one ADM column-type.

RATE, TINF: rate and infusion duration

These columns enable to define the rate (RATE column-type) or duration (TINF column-type) of doses administered as infusions. The column content is meaningful only for dose-lines. The rate and duration information is transferred to the model via the use of the iv macro. If a RATE is defined, the duration of the infusion will be AMT/RATE. If a TINF is defined, the rate will be AMT/TINF.
We strongly recommend to have small duration values (less than 10) to be able to manage it efficiently with analytical solutions. Indeed, if the duration is too long, the calculation of the exponential may produce NaN. Two workarounds:
– Either rescale your time to have durations relevants w.r.t. your time
– If not possible, you may use ODEs and not analytical solutions.

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column with column-type RATE or TINF.
• “.” or 0 means a bolus dose, without any infusion rate or time.
• Values can be any double value.
• If a negative value is used in combination with the iv macro, the administration will be a bolus.

SS, II: steady-state and inter-dose interval

Steady-state is used to specify that any transitory effect is over and that the system response is now a periodic function of doses. To do this, a fixed number of doses (by default 4) is added before the dose entered with the SS flag set to true (so 5 doses in total, by default). The period between doses is set to the interdose-interval II.

The number of added doses can be changed in the preferences.xmlx file, located in <home>/lixoft/monolix/config in the user folder. The number of doses is defined in the line <dosesToAddForSteadyState value="5"/>, and can for instance be changed to <dosesToAddForSteadyState value="20"/>.

On the following example:

ID TIME AMT SS II EVID Y
Tom 0 10 1 2 1 .

5 doses are applied, at times  -8, -6, -4, -2 and 0. The above data set is thus equivalent to:

ID TIME AMT SS II EVID Y
Tom  -8 10 0 0 4 .
Tom  -6 10 0 0 1 .
Tom  -4 10 0 0 1 .
Tom  -2 10 0 0 1 .
Tom   0 10 0 0 1 .


The first added dose will have a wash-out, thus for clarity an EVID column has been included in the previous example. But of course it is possible to specify a steady-state even if there is no EVID column in the data set. However an II column is mandatory to specify the period between the five added doses to reach steady-state. The absence of this column will throw an exception (see here under for the complete list of exceptions).

It is possible to find in a data set a mix of steady-state and non steady-state doses. To prevent doses from colliding, if a normal dose is present before a steady-state dose, a new occasion will be created for the steady-state dose. The following data set, with a normal dose at  t=0 and a steady-state dose at t=10 with an interdose-interval of 3:

ID TIME Y AMT SS  II
1 0 . 10 0 0
1 0 10 . . .
1 1 6 . . .
1 2 3.5 . . .
1 10 . 10 1 3
1 11 9 . . .
1 12 6 . . .
1 13 3 . . .
1 14 2 . . .

leads to the following simulation, with 2 occasions, such that the normal dose at t=0 does not collide with the doses at t=-2, 1, 4, 7, 10, added to be at steady-state at t=10:

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column with column-type SS.
• A data set shall not contain more than one column with column-type II.
• When a data set contains a column with column-type SS, there must be a column with column-type II.
• When a data set contains a column with column-type II and no column with column-type SS or ADDL then a SS column is created with:
• SS = 1 when inter-dose interval is strictly positive.
• SS = 0 otherwise.
• When a data set contains a column with column-type II and no column with column-type SS but a column with column-type ADDL then a SS column is created with:
• SS = 1 when inter-dose interval is strictly positive and ADDL = 0.
• SS = 0 otherwise.
• The column is meaningful only for dose-lines. Its format shall be (for all lines including response-lines for which SS information is not applicable) :
• SS shall be either 0 or 1 (‘.’ will be replaced by 0).
• II shall contain a double value and it shall be positive (or null).
• when SS = 0 then the value shall be null.
• when SS = 1, the value shall be strictly positive.

Additional dose lines is a useful shortcut to specify dose regimens with repetitive treatments. ADDL is the number of times the dose shall be repeated and column II contains the dose repetition interval. For instance to specify a dose of 10 every 12 hours during 3 days it is possible to write:

ID TIME AMT
Tom 0 10
Tom 12 10
Tom 24 10
Tom 36 10
Tom 48 10
Tom 60 10
Tom 72 10


but ADDL and II (interdose-interval) can also be used to specify the same information in a single line

ID TIME AMT ADDL II
Tom 0 10 6 12


Notice that in the proposed example, ADDL should be at 6 to have 6 additional administrations. This is very useful for periodic treatments. Two important remarks concerning regression values:

• If there is a regression-column (i.e. a column with column-type X), its value will also be repeated for added doses even though this value has not been specified but obtained via interpolation.
• When regression values are defined after the first added dose, warnings are generated. Indeed these values will not be repeated and can possibly interfere with automatically added regression values at dose time. So the warning is generated for the user to confirm that its data make sense.

Format restrictions (an exception will be thrown otherwise):

• ADDL shall only contain positive (or null) integers or “.” (which will be replaced by 0).
• When there is an ADDL column there must be an II (interdose interval) column to indicate the inter dose timing.
• For dose-lines with ADDL strictly positive, the II value must be strictly positive.

COV: continuous covariate

It is possible to have in a data set one or several columns with column-type COV. There must be one covariate defined per subject-occasion else wise. String “.” can be used to prevent multiple definitions of a covariate for a subject-occasion as it is interpreted as an absence of definition. Therefore, we encourage the user to either define the covariate at each line, or, more simply, at the first use of a subject for readability reasons (even if the covariate has not necessarily to be defined at first occurrence of subject-occasion in the data set).

Format restrictions (an exception will be thrown otherwise):

• Continuous covariate columns shall contain either strings that can be converted to double or “.”.
• The covariate must be defined at least each time per subject-occasion.
• The covariate must remain the same for all the lines within the same subject-occasion.

CAT: categorical covariate

It is possible to have in a data set one or several columns with column-type CAT. It is possible to enter in a CAT column any string and “.” has no special meaning. We strongly recommend to use only alphanumeric characters and the underscore “_” character in the strings of the CAT columns. In the MonolixSuite 2016R1, special characters such as spaces ” “, stars “*”, parentheses “(“, brackets “[“, dashes “-“, dots “.” and slashes “/” are not supported (this feature will be back in the next release).

Moreover, on the contrary to the continuous covariable, the following data set will generate an error

ID OCC CAT
Tom 1 M
Tom 1 .


Format restrictions (an exception will be thrown otherwise):

• The categorical covariable must be the same for all the lines with the same subject-occasion.

X: regression value

It is possible to have in a data set one or several columns with column-type X. Within a given subject-occasion, string “.” will be interpolated (nearest neighbor interpolation is used) for dose-lines only (N.B.: if there is an EVID column dose-lines correspond to EVID = 1 or EVID = 4). Else wise, for measurement line, no interpolation is performed. If no regressor is defined on such a line, it will be replaced by a NaN. Therefore, in the following data set example,

ID TIME X AMT Y EVID
Tom 0 . 1 . 1
Tom 5 1 . 12 0
Tom 10 . . 10 0
Tom 15 12 1.5 . 1
Tom 20 -6 . 8 0
Tom 25 . 0.2 . 4
Tom 30 . . 0.1 0


The evolution of X with respect to time is defined by the following figure.

Thus, X is set to

• X(0) = 1 (it is a dose-line so an interpolation is realized. The nearest interpolation is realized and here nearest sample corresponds to a response-line).
• X(5) = 1 (from direct reading of input file even if the line does not correspond to a dose).
• X(10) = NaN (regression is undefined in the input file but since it is not a dose-line, no interpolation is realized).
• X(15) = 2 (from direct reading of input file).
• X(20) = 3 (from direct reading of input file even though the line does not correspond to a dose).
• X(25) = 3 (it is a dose-line so an interpolation is realized. The nearest interpolation is realized and here nearest sample corresponds to a response-line).
• X(30) = NaN (regression is undefined in the input file but since it is not a dose-line, no interpolation is realized).

To add a valid information between time 10 and 15, for example X = 1.5, the data set should contain both a regressor value at time 10 along with the measurement value,

 Tom 10 1.5 . 10 0


Notice that if the line has a MDV value at 1, the regression is not taken into account.

Format restrictions (an exception will be thrown otherwise):

• The regression-columns (i.e. columns with column-type X) shall contain either strings that can be converted to double or “.”.
• Each subject-occasion must contain at least one non “.” value (since it is then impossible to interpolate values).
• When there are several lines with the same time, the value of the regressor column must be the same.

EVID: event identification data item.

EVID corresponds to the identification of an event. It is an integer between 0 and 4. It helps to define the type of line.

• EVID = 0: observation event, the line is a response-line.
• EVID = 1: dose event, the line is a dose-line.
• EVID = 2: other event. UNUSED (exception thrown). To define times for model predictions without corresponding observations, use MDV=2.
• EVID = 3: reset event. UNUSED (exception thrown).
• EVID = 4: reset + dose event, indicates a wash-out (i.e reset to initial values) immediately followed by a dose.

Format restrictions (an exception will be thrown otherwise):

• A data set shall not contain more than one column with column-type EVID.
• EVID shall contain an integer in [0, 4].
• when a line is tagged (EVID = 0), the observation contained in column Y shall be convertible to a double value.
• when a line is tagged (EVID = 1, EVID = 4), the value in dose-column (i.e. content of the column with column-type AMT) shall be convertible to a double.

MDV: missing dependent variable.

The MDV column-type enables to tag lines for which the information in the Y column-type is missing. Most of the time, this column is not necessary.

• MDV=0: when a line is tagged MDV = 0 AND if it contains a string convertible to a double value in response-column (the column with column-type Y), then the value in the Y column is taken into account. Values in dose-column (the column with column-type AMT) will not be taken into account.
• MDV=1: when a line is tagged MDV = 1 then the value in column Y will not be taken into account. The value in dose-column, if present, will be taken into account.
• MDV=2: when a line is tagged MDV = 2 then the value in the response-column is not taken into account. The value in dose-column, if present, will be taken into account. The time, covariates, regressors, etc will be taken into account to output a prediction at that time point.

If there are both a MDV and EVID columns, the EVID column is used in priority.
The MDV column is useful to ignore specific response-lines, for instance if the observation is obviously wrong. If a MDV column is added to the dataset, the response-lines to ignore should have MDV=1, but also the dose-lines should have MDV=1 (otherwise the dose will be ignored). MDV=2 permits to define times at which model predictions should be outputted, even if there is no corresponding observation.When there are multiple MDV columns, a synthetic value MDV is computed as:

• if MDV = 0 in all columns, then resulting synthetic MDV equals 0.
• if MDV = 1 in at least one column and the other equals 0, then the resulting synthetic MDV equals 1.
• if MDV = 2 in at least one column and the other equals 0, then the resulting synthetic MDVsynth equals 2.

Format restrictions (an exception will be thrown otherwise):

• MDV shall contain only integers belonging to interval [0, 2].
• When MDV=0, the value in the Y column should be convertible to a double value, otherwise an exception will be thrown.

Character definition

We recommend to use only alphanumeric characters and the underscore “_” character in the strings of your data set.

Unfortunately, in the Monolix2016R1 suite, special characters such as spaces ” “, stars “*”, parentheses “(“, brackets “[“, dashes “-“, dots “.” and slashes “/” are not supported in:

• The strings in CAT column.

This feature will be back in the next release.

Please be careful that if your data set includes unsupported characters, the error will only de detected and displayed when loading a saved project (and not when creating and saving the project).

On the use of “.”

The “.” can be used in almost all the lines of the data set but has several meaning depending on the context. The following table summarizes the use of it.

 Type of column Not allowed Considered as a regular string Considered as Not considered ID X OCC X TIME X DATE/DAT1/DAT2/DAT3 X Y On a response line On a dose line YTYPE On a response line On a dose line (not read) CENS 0 LIMIT -Inf if CENS =1 , +Inf if CENS = -1 AMT On a dose line On a response line (not read) ADM On a dose line On a response line SS 0 ADDL 0 II 0 COV Previously defined value of the COV (in the ID/OCC) CAT X X (regressor) Interpolation on a dose line, NaN on a response line EVID X MDV X

3.Data set examples

This section presents several data sets to show some concrete data set and see how to integrate censored data, covariates, …

Data sets with continuous outputs

• Theophylline data set: continuous outputs are taken into account along with categorical and continuous covariates (sex and weight respectively). Moreover, censored data are also managed.
• Tobramycin data set: continuous PK output are taken into account, along with categorical and continuous covariates.
• HIV data set: two continuous censored outputs are considered. No dose is used in the data set, and the treatment type is considered as a categorical covariate.
• Veralipride data set: continuous output with an interesting absorption variability being by far the most probable physiological explanation for the double peak phenomenon.
• Remifentanil data set: Remifentanil is an opioid analgesic drug with a rapid onset and rapid recovery time. Remifentanil concentration over 65 healthy adults is proposed.

Data sets with discrete count outputs

• Epilepsy attacks data set: count outputs are taken into account along with categorical and continuous covariates. The data arose from a clinical trial of 59 epileptics who were randomized to receive either the anti-epileptic drug progabide or a placebo, as an adjuvant to standard chemotherapy. Patients attended four successive post-randomisation clinic visits, where the number of seizures that occurred over the previous 2 weeks was reported.
• Crohn’s Disease Adverse Events data set: Data set issued from a study of the adverse events of a drug on 117 patients affected by Crohn’s disease (a chronic inflammatory disease of the intestines). In addition to the response variable number of adverse events, 7 explanatory variables were recorded for each patient.

Data sets with discrete categorical outputs

• Respiratory status data set: the respiratory status of patients under placebo or treatment is categorized as “poor” or “good” once per month during 5 months over 111 patients.
• Inpatient multidimensional psychiatric data set: categorical output with a categorical covariate (treatment) during 6 weeks. These data are from the National Institute of Mental Health Schizophrenia Collaborative Study and are available here. Patients were randomized to receive one of four medications, either placebo or one of three different anti-psychotic drugs. The primary outcome is item 79 on the Inpatient Multidimensional Psychiatric.
• Zylkene data set: The putative effects of a tryptic bovine αs1-casein hydrolysate on anxious disorders in cats was investigated using this data set over 24 cats. The score is a global score of emotional state.

Data sets with  time-to-event outputs

• PBC data set: PBC is a rare but fatal chronic liver disease of unknown cause, with a prevalence of about 50-cases-per-million population. Between January, 1974 and May, 1984, the Mayo Clinic conducted a double-blinded randomized trial in primary biliary cirrhosis of the liver (PBC), comparing the drug D-penicillamine (DPCA) with a placebo.
• Oropharynx data set: The following data set provides the data for a part of a large clinical trial carried out by the Radiation Therapy Oncology Group in the United States. One objective of the study was to compare the two treatment policies with respect to patient survival.
• Veterans’ Administration Lung Cancer data set: In this study conducted by the US Veterans Administration, time to death was recorded for 137 male patients with advanced inoperable lung cancer, which were given either a standard therapy or a test chemotherapy.
• NCCTG lung cancer data set: The North Central Cancer Treatment Group (NCCTG) data set records the survival (time-to-event output) of 228 patients with advanced lung cancer, together with assessments of the patients performance status measured either by the physician and by the patients themselves.
• Cardiovascular data set:  A subset of the fields was selected to model the differential length of stay for patients entering the hospital to receive one of two standard cardiovascular procedures: CABG and PTCA. The data set contains 3589 individuals.

Joint data sets

• Warfarin data set: Warfarin is an anticoagulant normally used in the prevention of thrombosis and thromboembolism.  Plasma warfarin concentrations and Prothrombin Complex Response in thirty normal subjects after a single loading dose are measured. Both measurements are continuous.
• Remifentanil data set: Remifentanil is an opoid analgesic drug with a rapid onset and rapid recovery time. Both remifentanil concentration and EEG measurement are proposed on 65 healthy adults. Both measurements are continuous.
• PSA and survival data set: PSA kinetics and survival data for 400 men with metastatic Castration-Resistant Prostate Cancer (mCRPC) treated with docetaxel and prednisone, the first-line reference chemotherapy, which constituted the control arm of a phase 3 clinical trial. In this context of advanced disease, the incidence of death is high and the PSA kinetics is closely monitored after treatment initiation to rapidly detect a breakthrough in PSA and propose rescue strategies.

Theophylline data set

The data considered here are courtesy of Dr. Robert A. Upton of the University of California, San Francisco. Theophylline is a methylxanthine drug used in therapy for respiratory diseases such as chronic obstructive pulmonary disease (COPD) and asthma under a variety of brand names. Theophylline was administered orally to 12 subjects whose serum concentrations were measured at 11 times over the next 25 hours. This is an example of a laboratory pharmacokinetic study characterized by many observations on a moderate number of individuals. The data set can be seen here, and the corresponding Datxplore project here (notice that both file should be in the same folder to be correctly linked). A representation of the concentration over time for each subject is presented on the following figure (notice, that this figure was generated using Datxplore).

The purpose of this page is to see the construction, the definition and the use of such a data set in Datxplore and Monolix. For sake of simplicity, we look only on one subject (corresponding to ID 1).

Simplified data set

The data set for subject one writes as follows

ID AMT TIME    CONC    WEIGHT  SEX
1   4.02    0   .   79.6    M
1   .   0.25    2.84    79.6    M
1   .   0.57    6.57    79.6    M
1   .   1.12    10.5    79.6    M
1   .   2.02    9.66    79.6    M
1   .   3.82    8.58    79.6    M
1   .   5.1 8.36    79.6    M
1   .   7.03    7.47    79.6    M
1   .   9.05    6.89    79.6    M
1   .   12.12   5.94    79.6    M
1   .   24.37   3.28    79.6    M

Interpretation

One can see the following columns

Several points can be noticed.

1. The first line corresponds to a dose, while the other ones are measurements. This explains the dot in the CONC column for the first line and the dots in the AMT column for the other ones.
2. The covariates columns (the continuous WEIGHT and the categorical SEX) are constant over the individual. Even though it is not necessary, we encourage the user to fill the columns for readability and usage reasons.
3. Finally, notice that no initial washout is needed at the beginning as by default, the null initial condition is used for parameter estimation.

Warfarin data set

This data set has been originally published in:

O’Reilly (1968). Studies on coumarin anticoagulant drugs. Initiation of warfarin therapy without a loading dose. Circulation 1968, 38:169-177.

Warfarin is an anticoagulant normally used in the prevention of thrombosis and thromboembolism, the formation of blood clots in the blood vessels and their migration elsewhere in the body, respectively. The data set provides set of plasma warfarin concentrations and Prothrombin Complex Response in thirty normal subjects after a single loading dose. A single large loading dose of warfarin sodium, 1.5 mg/kg of body weight, was administered orally to all subjects. Measurements were made each 12 or 24h.
On the two following figure, one could see the concentration and the effect with respect to time for all subjects.

The data set for subject one can be defined as follows

id time    amt dv  dvid    wt  age sex
1   0   100 .   1   66.7    50  1
1   0   .   100 2   66.7    50  1
1   24  .   9.2 1   66.7    50  1
1   24  .   49  2   66.7    50  1
1   36  .   8.5 1   66.7    50  1
1   36  .   32  2   66.7    50  1
1   48  .   6.4 1   66.7    50  1
1   48  .   26  2   66.7    50  1
1   72  .   4.8 1   66.7    50  1
1   72  .   22  2   66.7    50  1
1   96  .   3.1 1   66.7    50  1
1   96  .   28  2   66.7    50  1
1   120 .   2.5 1   66.7    50  1
1   120 .   33  2   66.7    50  1

Interpretation

One can see the following columns

Several points can be noticed.

1. The first line corresponds to a dose, while the other ones are measurements. This explains the dot in the CONC column for the first line and the dots in the AMT column for the other ones.
2. The covariates columns (the continuous wt and the categorical covariates age and sex) are filled with the same values. Even though it is not necessary, we encourage the user to fill the columns for readability and usage reasons.
3. In the presented case, both PK and PD measurements are at the same time, this is not required for data exploration using Datxplore, nor parameter estimation using Monolix.
4. Finally, notice that no initial washout is needed at the beginning as by default, the null initial condition is used for parameter estimation.

Interestingly, one can display the Effect with respect to the Concentration in order to have an idea on how to model the interaction between the PD and the PK part.

Then, the response does not seem to be direct. Notice that, as the observation times are no the same between the PK and the PD, interpolation is made to propose this kind of plot. One can also focus on one individual in particular as on the following figure

Notice that we also propose a red arrow to describe the evolution of time.

3.1.3.Tobramycin data set

This data set has been originally published in:

Aarons, L., Vozeh, S., Wenk, M., Weiss, P. H., & Follath, F. (1989). Population pharmacokinetics of tobramycin. British journal of clinical pharmacology, 28(3), 305-314.

Tobramycin is an antimicrobial agent of the aminoglycosides family, which is among others used against severe gram-negative infections. Because tobramycin does not pass the gastro-intestinal tract, it is usually administrated intravenously as intermittent bolus doses or short infusions. Tobramycin is a drug with a narrow therapeutic index.
Tobramycin bolus doses ranging from 20 to 140mg were administrated every 8 hours in 97 patients (45 females, 52 male) during 1 to 21 days (for most patients, during ~6 days). Age, weight (kg), sex and creatinine clearance (mL/min) were available as covariates. The tobramycin concentration (mg/L) was measured 1 to 9 times per patients (322 measures in total), most of the time between 2 and 6h post-dose. This sparse data set is presented on the figure below

Below is an extract of the data set:

The columns have the following meaning:

Several points can be noticed:

1. The four first lines correspond to doses, while the other ones are measurements, as indicated by the EVID column. The MDV column is not necessary. The zeros of the DOSE and CP columns could have been replaced by dots ‘.’ .
2. The covariates columns (WT, SEX and CLCR) are filled with the same value for each individual. Covariates must be constant within subjects (or subject-occasions when occasions are defined).

HIV data set

In the COPHAR II-ANRS 134 trial, an open prospective non-randomized interventional study, 115 HIV-infected patients adults started an antiviral therapy. 48 patients were treated with indinavir (and ritonavir as a booster) (treatment A), 38 with lopinavir (and ritonavir as a booster) (treatment B), and 35 with nelfinavir (Treatment C). patients were followed one year after treatment initialization.

Viral load and CD4 cell count were measured at screenin, at inclusion and at weeks 2 (or 4), 8, 16, 24, 36, and 48. Plasma HIV-1-RNA were measured by Roche monitored with a limit of quantification of 50 copies/ml. The results of this trial are reported in Duval and al. (2009). The data set can be seen here, and the corresponding Datxplore project here (notice that both file should be in the same folder to be correctly linked).

On the two following figures, one could see the two outputs with respect to time for all subjects split by treatments. The red circle corresponds to censored data.

Notice, that these figures were generated using Datxplore.

Simplified HIV data set

The data set for subject 2 can be defined as follows

ID TIME    Y_NCENS Y   CENS    YTYPE   TREATMENT
2   -2.43   4.9443  4.9443  0   1    A
2   -2.43   249 249 0   2    A
2   0   4.5245  4.5245  0   1    A
2   2   2.3546  2.3546  0   1    A
2   2   266 266 0   2    A
2   4.29    268 268 0   2    A
2   8   2.5585  2.5585  0   1    A
2   8   34  34  0   2    A
2   16  352 352 0   2    A
2   24  1.7981  2   1   1    A
2   24  385 385 0   2    A
2   32  348 348 0   2    A
2   43  415 415 0   2    A

Interpretation

One can see the following columns

Several points can be noticed.

1. There are no dose in the data set.
2. There is only a categorical covariate defining the treatment.
3. In the presented case, one does not necessary have both measurements at the same time. Indeed, this is not required for data export using Datxplore, nor parameter estimation using Monolix. Moreover, measurements for negative time is possible.

3.2.1.Epilepsy attacks data set

This data set has been originally published in:

Leppik, IE. et al. (1985) A double-blind crossover evaluation of progabide in partial seizures. Neurology 35, 285.

The data arose from a clinical trial of 59 epileptics who were randomized to receive either the anti-epileptic drug progabide or a placebo, as an adjuvant to standard chemotherapy. The hope was that progabide would help to reduce the number of seizures experienced by patients. Patients attended four successive post-randomisation clinic visits, where the number of seizures that occurred over the previous 2 weeks was reported. At baseline, information on the age of the patient and the 8-week pre-randomisation seizure count was recorded.

Below is an extract of the data set:

The columns have the following meaning:

Several points can be noticed:

1. There are several seizure counts for each individual, thus the time allows to define to which period it is related.
2. ID and TIME column are mandatory. Thus, if there is only one count measurement by individual, an additional column with TIME should be added (full of 0 for example).
3. The covariates columns (treatment, base and age) are filled with the same value for each individual. Covariates must be constant within subjects (or subject-occasions when occasions are defined).

Moreover, we can split by the covariate treatment and thus see the impact of the treatment

It seems the the subjects with the treatment have lower seizure rate. We can also display it grouped and not in a spaghetti display as in the following

Using that, we have a better understanding of the seizure_rate, and it seems that the treatment is effective.

PBC data set

PBC is a rare but fatal chronic liver disease of unknown cause, with a prevalence of about 50-cases-per-million population. The primary pathologic event appears to be the destruction of interlobular bile ducts, which may be mediated by immunologic mechanisms.
Between January, 1974 and May, 1984, the Mayo Clinic conducted a double-blinded randomized trial in primary biliary cirrhosis of the liver (PBC), comparing the drug D-penicillamine (DPCA) with a placebo. There were 424 patients who met the eligibility criteria seen at the Clinic while the trial was open for patient registration. Both the treating physician and the patient agreed to participate in the randomized trial in 312 of the 424 cases. The date of randomization and a large number of clinical, biochemical, serologic, and histologic parameters were recorded for each of the 312 clinical trial patients. The data from the trial were analyzed in 1986 for presentation in the clinical literature. For that analysis, disease and survival status as of July, 1986, were recorded for as many patients as possible. By that date, 125 of the 312 patients had died, with only 11 not attributable to PBC. Eight patients had been lost to follow up, and 19 had undergone liver transplantation.

The considered data set comes from Counting Processes and Survival Analysis by T. Fleming & D. Harrington, (1991), published by John Wiley & Sons. The data set can be seen here, and the corresponding Datxplore project here (notice that both file should be in the same folder to be correctly linked).

On the following figure, one could see the survival curve and the mean number of events with respect to time. Notice, that this figure was generated using Datxplore.

In this data set, there are a lot of available covariates

id       = case number
futime   = number of days between registration and the earlier of death,
transplantion, or study analysis time in July, 1986
status   = 0=alive, 1=liver transplant, 2=dead
drug     = 1= D-penicillamine, 2=placebo
age      = age in days
sex      = 0=male, 1=female
ascites  = presence of ascites: 0=no 1=yes
hepato   = presence of hepatomegaly 0=no 1=yes
spiders  = presence of spiders 0=no 1=yes
edema    = presence of edema 0=no edema and no diuretic therapy for edema;
.5 = edema present without diuretics, or edema resolved by diuretics;
1 = edema despite diuretic therapy
bili     = serum bilirubin in mg/dl
chol     = serum cholesterol in mg/dl
albumin  = albumin in gm/dl
copper   = urine copper in ug/day
alk_phos = alkaline phosphatase in U/liter
sgot     = SGOT in U/ml
trig     = triglicerides in mg/dl
platelet = platelets per cubic ml/1000
protime  = prothrombin time in seconds
stage    = histologic stage of disease

On the two following figure, one could see the survival curve and the mean number of events with respect to time for two groups, the first groups concerns the subjects younger than 52.3 years and the other group concerns the other one. Notice, that this figure was generated using Datxplore.

Simplified PBC data set

The data set for subjects 1 and 2 can be defined as follows

ID;TIME;Y;TRT;AGE;SEX;
1;0;0;1;58.7652;1;
1;400;1;1;58.7652;1;
2;0;0;1;56.4463;1;
2;4500;0;1;56.4463;1;


One must indicated the start time of the observation period with Y=0 (at line 1 and 3 for subject 1 and 2 respectively), and the time of event (Y=1) or the time of the end of the observation period if no event has occurred (Y=0). In this simplified data set, subject one had an event at time 400 leading to a line in the data set where Y=1. On the contrary, no event occurred for subject 2. Thus, at the end of the observation (TIME=4500), Y is set to 0.

Oropharynx data set

The following data set provides the data for a part of a large clinical trial carried out by the Radiation Therapy Oncology Group in the United States. The full study included patients with squamous carcinoma of 15 sites in the mouth and throat, with 16 participating institutions, though only data on three sites in the oropharynx reported by the six largest institutions are considered here. Patients entering the study were randomly assigned to one of two treatment groups, radiation therapy alone or radiation therapy together with a chemotherapeutic agent. One objective of the study was to  compare the two treatment policies with respect to patient survival. Approximately 30% of the survival times are censored owing primarily to patients surviving to the time of analysis. Some patients were lost to follow-up because the patient moved or transferred to an institution not participating in the study, though these cases were relatively rare.

The considered data set comes from The Statistical Analysis of Failure Time Data, by JD Kalbfleisch & RL Prentice, (1980), Published by John Wiley & Sons.The data set can be seen here, and the corresponding Datxplore project here (notice that both file should be in the same folder to be correctly linked).

On the following figure, one could see the survival curve and the mean number of events with respect to time. Notice, that this figure was generated using Datxplore.

This study included measurements of many covariates which would be expected to relate to survival experience. Six such variables are given in the data (sex, T staging, N staging, age, general condition, and grade). The site of the primary tumor and possible differences between participating institutions require consideration as well.

CASE          Case Number
INST          Participating Institution
SEX           1=male, 2=female
TX        Treatment: 1=standard, 2=test
3=poorly differentiated,  9=missing
AGE           In years at time of diagnosis
COND          Condition: 1=no disability, 2=restricted work, 3=requires assistance
with self care, 4=bed confined,  9=missing
SITE          1=faucial arch, 2=tonsillar fossa, 3=posterior pillar,
4=pharyngeal tongue, 5=posterior wall
T_STAGE       1=primary tumor measuring 2 cm or less in largest diameter,
2=primary tumor measuring 2 cm to 4 cm in largest diameter with
minimal infiltration in depth, 3=primary tumor measuring more
than 4 cm, 4=massive invasive tumor
N_STAGE       0=no clinical evidence of node metastases, 1=single positive
node 3 cm or less in diameter, not fixed, 2=single positive
node more than 3 cm in diameter, not fixed, 3=multiple
positive nodes or fixed positive nodes
ENTRY_DT      Date of study entry: Day of year and year, dddyy
TIME          Survival time in days from day of diagnosis

On the two following figure, one could see the survival curve and the mean number of events with respect to time for two groups, the first groups concerns the subjects younger than 55 years and the other group concerns the other one. Notice, that this figure was generated using Datxplore.

Simplified Oropharynx data set

The data set for subjects 47 and 48 can be defined as follows

ID;INST;SEX;TRT;GRADE;AGE;COND;SITE;T_STAGE;N_STAGE;ENTRY_DT;Y;Time
47;4;1;2;2;49;3;1;4;3;5669;0;0
47;4;1;2;2;49;3;1;4;3;5669;1;74
48;3;1;1;1;44;1;1;3;1;2769;0;0
48;3;1;1;1;44;1;1;3;1;2769;0;1609


One must indicated the start time of the observation period with Y=0 (at line 1 and 3 for subject 47 and 48 respectively), and the time of event (Y=1) or the time of the end of the observation period if no event has occurred (Y=0). In this simplified data set, subject 47 had an event at time 74 leading to a line in the data set where Y=1. On the contrary, no event occurred for subject 48. Thus, at the end of the observation (TIME=1609), Y is set to 0.

4.1.FAQ

• Which file formats are supported? Text and comma-separated values file are allowed. The file extension should preferably be .txt or .csv.
• Should I have a header line? Yes, having a header line is mandatory.
• Are there restrictions on header names? No, there is no limitation in terms of names nor on character number. However, some characters are not allowed as in the rest of the data file (see here).
• Which column types are mandatory? The ID, TIME and Y column-types are mandatory. All others are optional.
• Which column-types are possible? The complete list of supported column-types can be found here.
• Which separators are allowed? The supported separators are comma (“,”), semicolon (“;”), space (” “), and tab (“\t”).
• Which characters are allowed in strings? The list of allowed characters can be found here.
• What does “.” mean? The “.” can be used in almost all the lines of the data set but has several meaning depending on the context. A summary can be found here.
• How can I ignore certain response-lines of my data set? Use MDV=1 for that.
• Can I specify time in hour or in days? Yes, all the possible formats are defined here.
• Can the data by split into several files (for instance one file for dosing and one for observations)? No. All the data must be grouped into a single file.

Questions about format difference with NONMEM

• What are the differences between Monolix and Nonmem in terms of data set? The few differences are listed here.
• What is the equivalent of the NONMEM CMT column? Depending on the usage of the CMT column, it can correspond to the YTYPE column-type, the ADM column-type or to both. All differences between NONMEM and the MonolixSuite are listed here.

• Must all lines corresponding to the same individual be grouped? No, this is not necessary. All lines with the same ID will be assigned to the same individual, whatever their order or grouping.
• How can I define occasions? For that, you can use the OCC column-type as explained here.

• Must the times be in ascending order? For a given individual, the times do not need to be in order. The sorting will be done automatically.
• Can I specify time in hour or in days? Yes, all the possible formats are defined here.
• Can time have negative values? Yes.
• For time-to-event data, do I have to indicate the start time? Yes, it must be explicitly stated, for instance with TIME=0 and Y=0. Guidelines for data set formatting for time-to-event data are given here.

• Are non-continuous data types (such as count, time-to-event and categorical data) supported? Yes. Exemples of data set for non-continuous data types are presented here.
• Which value should I enter in the Y column-type for BLQ values? In the Y column-type, give the limit of quantification (LOQ). To mark the observations as being BLQ, use the CENS column-type. To indicate a censoring interval, use the LIMIT column-type (in addition to the CENS and Y columns).
• Can my data set contain different types of observations? Yes, use the YTYPE column to define to which type of data the line corresponds. An example data set with different types of observations is presented here.
• What happen if I define both a dose and a response in the same line? Depending on the values, it can be a dose, a response. To see all the configurations, see here.

• What happen if I define both a dose and a response in the same line? Depending on the values, it can be a dose, a response. To see all the configurations, see here.
• For dose-lines, should I specify the compartment into which the dose is introduced? No. In the MonolixSuite, the matching between the data (dose and observation lines) and the model (administrations and predictions) is done using identifiers, not based on compartment numbers. To assign a dose to a specific administration of the model (oral or iv macros for classical PK models, depot macro for more complex ODEs), the column ADM is used. The identifier in the ADM column should match the “type=” field of the macro.
• If I have several outputs, should I duplicate the dosing information? No.

• Can I have time-varying covariates? Continuous (COV) and categorical (CAT) covariates must be contant within a subject-occasion. Yet continuous covariates can be tagged as regressors (column-type X). However, if a continuous covariate varies with respect to time, the first value declared will be used for the entire subject-occasion.

• How can I ignore certain response-lines of my data set? Use MDV=1 for that.
• Are the MDV and EVID columns necessary? These columns are not mandatory and most of the time not necessary.
• Which values are allowed for EVID? EVID can takes the values EVID=0 (observation), EVID=1 (dose) and EVID=4 (reset followed by a dose). EVID=2 and EVID=3 are not supported.
• How can I define a time at which I which the output the predictions, even if I have no observation? Use MDV=2 for this purpose.

4.2.Translating your dataset from NONMEM format to the Monolix Suite format

The required format for the data set in NONMEM and in the Monolix Suite is very similar. Usually only few changes (if any) are required to go from one format to the other one.

General formatting

• Column names: in the Monolix Suite column names are not restricted in length, and not restricted to uppercase format. Yet, only alphanumeric and the underscore “_” characters are allowed. Special characters such as spaces ” “, stars “*”, parenthèses or brackets “(“, dashes “-“, slashes “/” are not supported.
• Header line: no need to start the header line with the “#” character in the Monolix Suite, the column headers line will be recognized automatically.
• Number of columns: there is no limitation of the number of columns in the Monolix Suite

Dose column-types

• SS column: SS=2 and SS=3 are not supported in the Monolix Suite. When a data set contains a column with column-type SS, there must be a column with column-type II. If SS=1, then the value in the II column must be strictly positive. In case of steady-state, steady-state formulas are not used. Instead, additional doses (5 by default) are added before the SS dose to reach steady-state.
• RATE and TINF: in case of an infusion, in the Monolix Suite, it is possible to define either the rate (RATE column-type) or the duration (TINF column-type). The rate and the duration are related to each other via the amount: TINF=RATE/AMT. Negative values in the RATE column-type result in a bolus, when used in combination with the iv macro (and models from the library with iv). When used in combination with the oral macro (and models from the library with oral0 or oral1), the RATE column is ignored if the value is negative and an error is triggered if the value is positive. If infusion duration is defined in the model (parameter or fixed value), the RATE column is not necessary (in opposition to NONMEM, where RATE=-1 and RATE=-2 are used).
• CMT column: in NONMEM, for observation-lines, CMT specifies the compartment from which the predicted value of the observation is obtained. For dose-lines, CMT specifies the compartment into which the dose is introduced. In the MonolixSuite, the matching between the data (dose and observation lines) and the model (administrations and predictions) is done using identifiers, not based on compartment numbers. To assign a dose to a specific administration of the model (oral or iv macros for classical PK models, depot macro for more complex ODEs), the column ADM is used. The identifier in the ADM column should match the “type=” field of the macro. To assign an observation to a prediction, the column YTYPE is used. Observation lines with YTYPE=1 will be assigned to the first output (output = {…} statement in the model file), lines with YTYPE=2 to the second output, etc. Note that the default values for the administration type in administration macros or in the pkmodel macro is type=1. Similarly, in case of a single output, YTYPE=1 by default (while in NONMEM, the central compartment may have number 1 or 2). In the ADM column-type, negative values are not allowed. Turning off compartments should instead be defined in the model file.

Control and event columns-types

• EVID column-type: in the Monolix Suite, the EVID column is not mandatory, since dose-events (EVID=1) and response-events (EVID=0) are automatically recognized. Note that the Monolix Suite does not recognize EVID=2 (“Other event”) and EVID=3 (“Reset event”), but recognizes EVID=4 (which corresponds to a reset to initial values immediately followed by a dose). EVID=4 creates a new occasion for the individual. In NONMEM, EVID=2 is sometimes used to define a time point at which one would like to predict a concentration, without having an observation. In the Monolix Suite, this is done using MDV=2 (see below).
• MDV column-type: the MDV column is not mandatory in the Monolix Suite. Dose-lines and observation-lines will be recognized automatically. Yet the MDV column can be useful to force a response-line to be ignored (MDV=1). Several MDV columns are allowed, in this case a synthetic MDV value is computed. In the Monolix Suite, MDV can in addition take the value MDV=2, which permits to define a time point (and possibly a regressor value) to output a prediction without having the corresponding observation. In Monolix, the time points tagged MDV=2 will for instance appear in the table “fulltimes.txt”, outputted when selecting “All times” in the “Outputs to save” window.

Response column-types

• Censored data: in the Monolix Suite, censored data should be tagged in the data set using additional columns with CENS (mark as censored observation) and if necessary LIMIT (give other interval boundary) column-types. The LOQ value is indicated in the Y column. Censored data are then automatically taken into account in the likelihood in a rigorous statistical way. If only the CENS column is used, the method in the MonolisSuite is equivalent to the so-called M3 method. When both CENS and LIMIT are used, the method in equivalent to M4.

Subject identification columns-types

• ID column-type: in NONMEM all lines related to a single individual must be in one block, which is not the case in the Monolix Suite. If the ID column contains the following IDs: [1,1,1,2,2,1,1], NONMEM will consider that the dataset comprise three individuals with IDs 1 (with 3 observations), 2 (with 2 observations) and 1_1 (with 2 observations). In the Monolix Suite, two individuals are considered, with IDs 1 (with 5 observations) and 2 (with 2 observations).

Time column-types

• TIME column-type: values in the time column can be negative in the Monolix Suite.

Covariates and regression column-types

• Covariates: in the Monolix Suite, columns corresponding to continuous covariates must be set to the COV column-type, and categorical covariates to the CAT column-type
• Regression variables: in the Monolix Suite, regression variables must be set to the X column-type. If several regression variables are used, their order must be the same in the dataset and in the “input” field of the model file.

Unsupported column-types

• The PCMT, CONT, CALL, MRG_, RAW_, RPT_, L1, and L2 column-types are not supported in the MonolixSuite.
Suggest Edit