TREMIN DATA - An analyst's user manual
Quick links
TREMIN public web site -- where you get the data and much documentation
Variable Browser -- a online browser of all variables in the TREMIN database.
TREMIN Discussion -- online discussion group for announcements, questions, etc.
Table of Contents
- TREMIN DATA - An analyst's user manual
- Quick links
- Resources
- Final report
- TREMIN web site
- TREMIN discussion list-serve
- Codebooks
- Overview Tool: Variable Browser
- Variable Browser introduction
- Description of variable
- Data categories
- Analysis of variable
- Missing data
- Variable List
- Data checking of variable
- Common tasks and some tips
- Access to non-public variables
- Creating a relational database
- Creating a composite event file
Resources
Final report
The
Final Report about the TREMIN conversion project is an overview of
the work preformed to make the TREMIN data available to
researchers, the general structure of the database and the types of
resources available. In addition, that report lists the primary
sources
of information about TREMIN. Read that MS Word document here.
TREMIN web site
The TREMIN website is the main
web location to download data and many pieces of available
documentation. It requires a password from the TREMIN
director.
Getting the data from this site is as simple as: 1) navigating
to the download area, 2) click on the file you want to download, 3)
select the variables you want in the downloaded file (or leave the
select options as is to get all varaibles (recommended)), 4)
click on the 'Download as Spreadsheet' button. The server will
then download a flat file to your computer in comma-delimited
format. This format is a common, all-text format that can be read
by virtually any database, spreadsheet or statistical package.
Note that some of these files are VERY large and may take a while to
process and download. In particular, the events files are
huge: calevts.csv is 76 Mb and pre80calevt.csv is 226 Mb.
TREMIN discussion list-serve
Starting in February 2005, a web-based discussion tool was made available to
all researchers using the TREMIN data. It is by invitation only
and not publicly readable. This forum can be used to ask questions
of other researchers, post announcements, propose research ideas to
interested people, etc. The forum is reachable at TreminDiscussion.
Codebooks
Codebooks in the TREMIN
collection of data and documentation are typically either PDF files of
the actual survey forms with handwritten codes or Word documents with
summaries of the codes for a particular data file. For many
variables, codes are also shown directly in the Variable Browser. Codebooks of the surveys can be found at the TREMIN web site in the 'Surveys' section.
Overview Tool: Variable Browser
Variable Browser introduction
The
Variable Browser is an extensive web-based browser of all variables of
the TREMIN
database. It includes basic statistics, descriptions, and
limitations of these variables. The three frames of the browser
are 1, the upper left, where you select the set of variables of
interest (usually one of the data files); 2, the lower left,
where select which of the variables in that set you want to examine;
and 3, the right frame, that shows the summary of that variable.
The size of the frames can be adjusted by clicking and dragging the
bars that separate them.
Description of variable
For many variables we have
tried to supply a short description. These will often be enough
for you to determine if the variable is useful for your research
question. However, if you decide that a variable may be useful,
especially in the Health Report Surveys (HRF), it is important to
carefully examine the full question that was actually asked of the
participants -- the Browser only supplies a paraphase of the survey
question. You can do that by referring to the PDF versions of
the original surveys. You may find that questions (or how answers are
coded) varies slightly from year to year, and this may have
important implications for your interpretation of the data. For
the Mid-life Women Health Survey (MLWHS), it was unfeasible to provide
text description of all the 1200 questions. To identify the
questions of interest in that survey, the best strategy is to start
with the PDF of the original survey, find the questions of interest,
note
their codenames, and then look them up in the Variable Browser -- the
MLW variables are indexed both by the variable names in the file and
the codebook names in the codebook (click on the 'MLW Codebook' link in
the upper left panel).
Another note about the HRF surveys: the year in the file name has
different meanings: for some years (1980-1997) it labels the year(s)
that the participant is answering questions about, and other years
(2000-2002), it indicates the year in which the survey was completed
and returned. For the year(s) that the survey actually covers,
look at the 'Survey Year' field in the variable browser.
Data categories
All variables in the TREMIN
data base have been assigned a data type that best fits their
characteristics. In most cases, these 'types' are
uncomplicated. For example, for the type called 'zeroone', we
expect data to be a '1' (typically meaning 'yes'), a '0' (typically
meaning 'no') or some code for missing data. 'Counting_number' is
a type that can contains integers greater than or equal to 0.
(Cells in a variable different from what we expect are called
'non-conforming' cells; more on this below.) But some types are
more complicated with dozens of possible codes. For example, the
'evt_major' code has close to 100 code possibilities, each with a
unique meaning. The Browser displays the actual number of data cells
that use each of the unique possibilities and, when feasible, a short
description of what the possibility means.
Analysis of variable
Analysis
of most numeric variables was of two types: first, the frequency of
each unique data category within the variable type (mentioned above)
and second, a graphical analysis of the distribution of the data in the
variable.
Missing data
Most variables in the TREMIN dataset have some
missing cells, that is, cells that do not have valid data. There are various reasons why data may be
missing for a given cell: a participant did not reply to a question, a data
value is not appropriate for a given record, a data value was illegible or
otherwise unrecoverable, or other such reasons.
Some missiong codes were added during data entry, others during data clean
up. Throughout the history of the TREMIN
project, different missing codes have been used in different places for missing
data. Therefore, for a given variable,
there may be several potential missing value designators. For example, sometimes ‘9999’ is used for
missing data in the MLWHS dataset, ‘0’ is used throughout the HRF data and
blank space (or completely empty cells) and ‘.’ (a single dot) are common
missing designators throughout all of TREMIN.
Because of the tremendous heterogeneity of data types in TREMIN, we have
not tried to force a consistent missing type across the database. Rather, we have kept the diverse missing
types and give analysts the tools to determine which are used for a given
variable.
To determine which missing data designators are used for
your variables of interest, use the Variable Browser: the browser reports which missing types are found in a variable and the overall count of
missing designators in that variable. Note
that completely blank cells are always considered
missing values, but are not included in the count of missing designators or in
the list of designators. Therefore, the
number of counts of designators maybe zero even though there is a non-zero
count of missing values for a variable.
Variable List
A single file with a list of
all variables is also available to make scanning for particular
variables more convenient. An html version is found here, and a comma-delimited version (that you can sort, for example) is available here.
Data checking of variable
In the process of compiling the Variable Browser, all data are checked
against their expected data types and ranges. Those results are
displayed in the browser for each variable, but they are also summarized in a single file, here.
These results include a summary of all non-conforming cells within a
variable. Note that most of these non-conforming cells are small
deviations, probably created during the data entry process. Most,
if not all, are beyond our ability to correct, but we leave it up to
the analyst to decide how to handle. In most cases, we suspect
that you will want to convert these non-conformists to 'missing'.
Common tasks and some tips
Access to non-public variables
The TREMIN database
contains much information of a sensitive nature. To protect the
privacy of participants, several precautions have been taken.
- First, access to the
data is limited to researchers given permission by the TREMIN
adminstrators.
- Second,
for most research needs, the dataset made
available to researchers (via the TREMIN website) has several variables
removed that could be used to identify the participants -- for example,
date of birth. (However, the more coarse measure of age at a
particular time in the survey is still available.) The code used
to identify participants within the
TREMIN program ('cardnumber') is also not included and is replaced with
a Public Identification Number ('pid') so that researchers can
still track individuals throughout the dataset. For research questions
that require access to more specific participant information, such as
date of birth, datasets with additional variables can be requested from
the TREMIN administrators. Note, however, that access to such
data is contingent on certification of NIH's Human Subjects training.
- Finally, 'cardnumber' information, that is, the actual
participant's name etc., is available only to the TREMIN
administrators.
Creating a relational database
Although the data are served from the TREMIN web site in a 'flat file'
manner, creating relational databases (such as Microsoft Access or
MySQL) is quite simple. All files that are download can be
treated as tables in a relational database and the primary key for each
table is simply the Public Identification Number (pid). The pid
is attached to all records throughout the database. Note,
however, that this will create a VERY large database file.
Creating a composite event file
As you probably have
discovered by now, there are two events files: pre1980 and
post1980. We have not merged these two into a single events file
because the data sources were different enough that in many cases we
needed to treat the files differently. However, the formats of
the two files are the same and for some analysis purposes, it makes
sense to create a composite file.
To create a single event file, you must either recode the records with major=93
& 94 (calevt) first OR "remove or recode"
with the ONE conflicting code=94 records in pre80calevt. This is necessary because some of the event codes used in pre80calevt has
single 'end of record' events (major = 94) without matching 'start of
record' events. Also, be aware that some event codes in pre80 file
could not be recoded to the post80 code scheme: in particular, major
codes 5-8 are bleeds which do not have an equal meaning in
the post80 context (look for the 'old UTAH code' label in the Variable Browser when examining the 'major' variable.
Furthermore, watch for other duplicate events.
In some years, there are a large number of
duplicate events. This is mostly an issue for the participants in the
MLW survey because of the way that they recorded data on the
calendar cards (for some months they recorded events on two separate
cards). These duplicates have not been purged from the dataset
because 1) we wanted the data files to accurately represent what the
participants recorded and 2) the events are often not identical.
For this latter issue, 'duplicate' events may have the same odate, cdate, and major codes but differ in minor
codes. We did not want to make the decision for the analyst of
which of the two events was 'correct.' In general, however,
we feel that keeping the data that was recorded on the earlier calendar
card and discarding the later data is the best strategy to filtering
these duplicates (look in the sourceminor variable year of the calendar card).
It is also very important to recognize that the beyond bleed data, the
two files contain different levels of 'health' events. All of the
health data that could be gleaned from the HRF surveys before 1980 have
been incorporated in the pre80calevt file. But the post 1980
health data from the HRF surveys are contained in the separate HRF data
files (hrf1980 through hrf2002) and NOT directly in the calevt
file. If you need to use these HRF data, it may be convenient to
keep the events files separate.