TREMIN DATA - An analyst's user manual

Quick links

TREMIN public web site -- where you get the data and much documentation
Variable Browser -- a online browser of all variables in the TREMIN database.
TREMIN Discussion -- online discussion group for announcements, questions, etc.

Table of Contents
  1. TREMIN DATA - An analyst's user manual
    1. Quick links
    2. Resources
      1. Final report
      2. TREMIN web site
      3. TREMIN discussion list-serve
      4. Codebooks
    3. Overview Tool: Variable Browser
      1. Variable Browser introduction
      2. Description of variable
      3. Data categories
      4. Analysis of variable
      5. Missing data
      6. Variable List
      7. Data checking of variable
    4. Common tasks and some tips
      1. Access to non-public variables
      2. Creating a relational database
      3. Creating a composite event file

Resources

Final report

The Final Report about the TREMIN conversion project is an overview of the work preformed to make the TREMIN data available to researchers, the general structure of the database and the types of resources available.  In addition, that report lists the primary sources of information about TREMIN.  Read that MS Word document here.

TREMIN web site

The TREMIN website is the main web location to download data and many pieces of available documentation.  It requires a password from the TREMIN director. 
Getting the data from this site is as simple as: 1) navigating to the download area, 2) click on the file you want to download, 3) select the variables you want in the downloaded file (or leave the select options as is to get all varaibles (recommended)),  4) click on the 'Download as Spreadsheet' button.  The server will then download a flat file to your computer in comma-delimited format.  This format is a common, all-text format that can be read by virtually any database, spreadsheet or statistical package.  Note that some of these files are VERY large and may take a while to process and download.  In particular, the events files are huge:  calevts.csv is 76 Mb and pre80calevt.csv is 226 Mb.

TREMIN discussion list-serve

Starting in February 2005, a web-based discussion tool was made available to all researchers using the TREMIN data.  It is by invitation only and not publicly readable.  This forum can be used to ask questions of other researchers, post announcements, propose research ideas to interested people, etc.  The forum is reachable at TreminDiscussion.


Codebooks

Codebooks in the TREMIN collection of data and documentation are typically either PDF files of the actual survey forms with handwritten codes or Word documents with summaries of the codes for a particular data file.  For many variables, codes are also shown directly in the Variable Browser.  Codebooks of the surveys can be found at the TREMIN web site in the 'Surveys' section.


Overview Tool: Variable Browser

Variable Browser introduction

The Variable Browser is an extensive web-based browser of all variables of the TREMIN database.  It includes basic statistics, descriptions, and limitations of these variables.  The three frames of the browser are 1, the upper left, where you select the set of variables of interest (usually one of the data files);  2, the lower left, where select which of the variables in that set you want to examine; and 3, the right frame, that shows the summary of that variable.  The size of the frames can be adjusted by clicking and dragging the bars that separate them.

Description of variable

For many variables we have tried to supply a short description.  These will often be enough for you to determine if the variable is useful for your research question.  However, if you decide that a variable may be useful, especially in the Health Report Surveys (HRF), it is important to carefully examine the full question that was actually asked of the participants -- the Browser only supplies a paraphase of the survey question.  You can do that by referring to the PDF versions of the original surveys. You may find that questions (or how answers are coded) varies slightly from year to year,  and this may have important implications for your interpretation of the data.  For the Mid-life Women Health Survey (MLWHS), it was unfeasible to provide text description of all the 1200 questions.  To identify the questions of interest in that survey, the best strategy is to start with the PDF of the original survey, find the questions of interest, note their codenames, and then look them up in the Variable Browser -- the MLW variables are indexed both by the variable names in the file and the codebook names in the codebook (click on the 'MLW Codebook' link in the upper left panel).

Another note about the HRF surveys: the year in the file name has different meanings: for some years (1980-1997) it labels the year(s) that the participant is answering questions about, and other years (2000-2002), it indicates the year in which the survey was completed and returned.  For the year(s) that the survey actually covers, look at the 'Survey Year' field in the variable browser.

Data categories

All variables in the TREMIN data base have been assigned a data type that best fits their characteristics.  In most cases, these 'types' are uncomplicated.  For example, for the type called 'zeroone', we expect data to be a '1' (typically meaning 'yes'), a '0' (typically meaning 'no') or some code for missing data.  'Counting_number' is a type that can contains integers greater than or equal to 0.  (Cells in a variable different from what we expect are called 'non-conforming' cells; more on this below.)  But some types are more complicated with dozens of possible codes.  For example, the 'evt_major' code has close to 100 code possibilities, each with a unique meaning.  The Browser displays the actual number of data cells that use each of the unique possibilities and, when feasible, a short description of what the possibility means. 

Analysis of variable

Analysis of most numeric variables was of two types: first, the frequency of each unique data category within the variable type (mentioned above) and second, a graphical analysis of the distribution of the data in the variable.

Missing data

Most variables in the TREMIN dataset have some missing cells, that is, cells that do not have valid data.  There are various reasons why data may be missing for a given cell: a participant did not reply to a question, a data value is not appropriate for a given record, a data value was illegible or otherwise unrecoverable, or other such reasons.  Some missiong codes were added during data entry, others during data clean up.  Throughout the history of the TREMIN project, different missing codes have been used in different places for missing data.  Therefore, for a given variable, there may be several potential missing value designators.  For example, sometimes ‘9999’ is used for missing data in the MLWHS dataset, ‘0’ is used throughout the HRF data and blank space (or completely empty cells) and ‘.’ (a single dot) are common missing designators throughout all of TREMIN.  Because of the tremendous heterogeneity of data types in TREMIN, we have not tried to force a consistent missing type across the database.  Rather, we have kept the diverse missing types and give analysts the tools to determine which are used for a given variable.

To determine which missing data designators are used for your variables of interest, use the Variable Browser:  the browser reports which missing types are found in a variable and the overall count of missing designators in that variable.  Note that completely blank cells are always considered missing values, but are not included in the count of missing designators or in the list of designators.  Therefore, the number of counts of designators maybe zero even though there is a non-zero count of missing values for a variable.

Variable List

A single file with a list of all variables is also available to make scanning for particular variables more convenient.  An html version is found here,  and a comma-delimited version (that you can sort, for example) is available here.

Data checking of variable

In the process of compiling the Variable Browser, all data are checked against their expected data types and ranges.  Those results are displayed in the browser for each variable, but they are also summarized in a single file, here.  These results include a summary of all non-conforming cells within a variable.  Note that most of these non-conforming cells are small deviations, probably created during the data entry process.  Most, if not all, are beyond our ability to correct, but we leave it up to the analyst to decide how to handle.  In most cases, we suspect that you will want to convert these non-conformists to 'missing'.

Common tasks and some tips

Access to non-public variables

The TREMIN database contains much information of a sensitive nature.  To protect the privacy of participants, several precautions have been taken. 

Creating a relational database

Although the data are served from the TREMIN web site in a 'flat file' manner, creating relational databases (such as Microsoft Access or MySQL) is quite simple.  All files that are download can be treated as tables in a relational database and the primary key for each table is simply the Public Identification Number (pid).  The pid is attached to all records throughout the database.  Note, however,  that this will create a VERY large database file. 

Creating a composite event file

As you probably have discovered by now, there are two events files: pre1980 and post1980.  We have not merged these two into a single events file because the data sources were different enough that in many cases we needed to treat the files differently.  However, the formats of the two files are the same and for some analysis purposes, it makes sense to create a composite file.

To create a single event file, you must either recode the records with major=93 & 94 (calevt) first  OR  "remove or recode" with the ONE conflicting code=94 records in pre80calevt. This is necessary because some of the event codes used in pre80calevt
has single 'end of record' events (major = 94) without matching 'start of record' events.  Also, be aware that some event codes in pre80 file could not be recoded to the post80 code scheme: in particular, major codes 5-8 are bleeds which do not have an equal meaning in the post80 context (look for the 'old UTAH code' label in the Variable Browser when examining the 'major' variable.

Furthermore, watch for other duplicate events.  In some years, there are a large number of duplicate events.  This is mostly an issue for the participants in the MLW survey because of the way that they recorded data on the calendar cards (for some months they recorded events on two separate cards).  These duplicates have not been purged from the dataset because 1) we wanted the data files to accurately represent what the participants recorded and 2) the events are often not identical.  For this latter issue, 'duplicate' events may have the same odate, cdate, and major codes but differ in minor codes.  We did not want to make the decision for the analyst of which of the two events was 'correct.'   In general, however, we feel that keeping the data that was recorded on the earlier calendar card and discarding the later data is the best strategy to filtering these duplicates (look in the sourceminor variable year of the calendar card).

It is also very important to recognize that the beyond bleed data, the two files contain different levels of 'health' events.  All of the health data that could be gleaned from the HRF surveys before 1980 have been incorporated in the pre80calevt file.  But the post 1980 health data from the HRF surveys are contained in the separate HRF data files (hrf1980 through hrf2002) and NOT directly in the calevt file.  If you need to use these HRF data, it may be convenient to keep the events files separate.