FINAL REPORT
“Expanding a Web-Accessible Women’s Health Data Archive”
a. Specific Aims
The stated specific aim, to merge two parts of the TREMIN Research Program on Women’s Health and create a single, web-accessible electronic data archive,
remains unchanged and has now been achieved. Prior to the inception of this project, TREMIN had collected prospectively recorded menstrual diaries and annual
health report surveys from 1934 to present, but because of difficulties gaining access to and linking together much of the TREMIN data, this rich resource has been
under-utilized. To make the data accessible to as many researchers as possible, the data had to be cleaned, recoded and relabeled, and the variables linked across
the years of study. In addition, it was necessary to provide documentation about the data, and the process for gaining access to the data had to be simplified. At the
completion of this project, the data collected between 1934 and 1980 have now been incorporated with the post-1980 data to create an archive that can track the
menstrual cycle patterns and reproductive health of over 5000 women who have been TREMIN participants, many of them from their late teens through their
post-menopausal years. By accomplishing our aim, we have created one of the largest, if not the largest, data set of its kind in the world and have made it
accessible to qualified researchers.
b. Results
Overview
The TREMIN data are a collection of more than 3.5 million data records over many decades. The data consist of reports of menstrual bleeding recorded
daily by participants on a menstrual calendar card (MCC) and responses to questions on an annual health report form (HRF) primarily regarding reproductive health
status (e.g., births, birth control use, exogenous hormone use) and general health status (e.g., occurrence of diseases). Some of these records are single events, such
as a bleed event on a single day for a participant. Other records document the detailed response of a participant to a particular question on the HRF. While
connecting these disparate sources of data has huge promise for a number of research questions, until this project, bringing together these separate data sources was
too arduous for anything but small, manually created datasets. Indeed, the work to transform the TREMIN data into a coherent, easy-to-analyze dataset had to
overcome a large set of obstacles.
The long history of the TREMIN project includes periods when data collection and archiving were quite primitive by today’s standards. Through the
decades, as researchers learned more, the focus of data collection shifted with the most important questions of the era receiving the most attention in the data.
Furthermore, data have been passed among 3 institutions, a handful of directors, and hordes of actual data handlers. Additionally, the format of the data collection
has changed considerably over time and therefore one of the major goals of this effort has been to transform these different formats into a single comprehensive
format without losing any of the original content. Most researchers involved during many of these periods are long gone so any questions we have had about the
data we have to resolve by indirect means.
Despite these obstacles, although the full dataset contains a heterogeneous mix of data types, and with some variables, data integrity, the TREMIN dataset is
an unprecedented and unique resource to researchers of women’s menstrual health and promises to be a goldmine of valuable information for decades to come.
To minimize uncertainty and errors in the dataset, researchers on this project used several techniques. The primary work was done with a combination of
manual inspection of records and a suite of internal data checking scripts. Errors were corrected, often by returning to original data sheets (sometimes dating to the
1930’s), and making the data as internally consistent as possible. Furthermore, because these data are 1) available on a public web site (although through
password protection) and 2) some variables of the data could be used by an analyst to identify a participant (variables such as date of birth), we have made the
decision to convert certain variables (for instance, birth date to age) and to withhold other variables to be made available upon special request.
However, we recognized that error reduction and internal consistency were not the only requirements for a useable database. Our goal was to make the
TREMIN database a vital resource to a broad range of researchers and analysts. Therefore, in addition to this primary database creation, an analyst performed a
number of data manipulations and checks on the dataset – much like our target audience will do. This work was external to the primary database cleanup efforts
with the aim to uncover inconsistencies in data format, further errors in data and even inconvenient database structures. Furthermore, another target of this external
work was to document the plethora of variables in the dataset in a manner most helpful to analysts.
Creating the data archive
Data sources
The data for building the data base come from 4 distinct physical sources. As a result each file had unique issues to be dealt with.
The sources are
University of Minnesota (“Menstrual and Reproductive History Program”, years 1934-1984) and University of Utah (“Tremin Trust”, years
1984-1998): Data collected during these years are configured so that event data are records of when each event started and when each event stopped
occurring. Event records of vaginal bleeds generally consist of 2 records. The data from health surveys can also be represented as records in the data file,
where they can be represented by 1 or 2 records.
Pennsylvania State University (“TREMIN”, years 1998-present): Event records of vaginal bleeds are records for each day a bleed occurs; in the record
the bleed quantity, the date of the bleed, and a classification of the type of vaginal bleeding (e.g., menstrual bleeding, post-surgical bleeding) are given.
Other events are a single record containing a starting date and an ending date in the record. Generally the event record format is confined to the events
found on the annual Menstrual Calendar Card (MCC) used by the women to prospectively record every day of vaginal bleeding during the year and any
other events that may affect their menstrual cycles (e.g., surgery, extreme stress). Data from the Health Report Forms (HRF) form a separate file whose
structure is a record for each TREMIN woman with a variable for each response.
Pennsylvania State University, Midlife Women’s Health Survey (“MWHS”, years 1990-1998): In this special annual survey of midlife TREMIN
women’s health and menopause experiences and concerns, along with daily reports of menstrual bleeding on a calendar card, event records of vaginal
bleeds are records for each bleed cycle, up to a maximum of 13 days per bleed cycle. Each record contains the bleed quantity for a menstrual day (a code
from 1-6 on a scale developed by the author and colleagues), the date it occurs, as well as the total duration of the bleed cycle and the interval between
bleed cycles. Other noted events on the MCC were not recorded in the events file. The MWHS annual survey questionnaire is structured as one record for
each woman with variable(s) that correspond to the response(s).
Data Files
The preliminary data organization consisted of:
A master data file of the participant’s base information, such as date of birth, date of menarche, date of death (when applicable), childbirth dates, date of
menopause, and a list of years with the types of data the woman provided that year;
An events file - a longitudinal repository of the bleed events from all physical sources in the general day record arrangement for the bleed events and any
associated events already represented in their original format;
The HRFs in their original annual form;
A longitudinal data file of the most relevant health and menstrual-related variables from the HRFs from 1934 to present;
A longitudinal data file of the most relevant health and menstrual-related variables from the MWHS questionnaire from 1990 to 1998.
Data from each source were initially reviewed. It was clear that some of the original data structures would have to be modified. The data referring to the
bleed events were categorized into 2 types. Data collected at the Universities of Minnesota and Utah were for the period of 1934 to 1998. However, only the data
up until 1980 were entered at the time they were collected. The remaining data (beginning with the1980 data) were entered at the Pennsylvania State University
when the TREMIN Program moved there in 1998 (with the exception that all the Midlife Surveys were collected and entered by Pennsylvania State University, from 1990-1998).
Thus, for this report, the data that had been entered prior to 1980 will be refer to as “Pre-1980 or Pre80”, while the Pennsylvania State University-
entered data will be refer to as “Post-1980 or Post80”. The distinction is important because different data entry procedures were followed for the two sets of data
and part of the current project was to reformat the data into a data set with the same structure.
Master data file
The first step was to assemble a master data file of all persons who have ever participated in the TREMIN Program. Besides the data from the various
parts of the project, the Master File contains a unique identification number for each participant, information on her date of birth, date of menarche, date of death if
applicable, race, her cohort (Cohort One was recruited in the 1930s, Cohort Two in the 1960s), family group number (if other relatives were participants),
indicators of the years she participated and the types of information provided. This data file helps identify erroneously entered records in the subsequent steps.
Generally a woman was added to the master data file if a minimum amount of data was provided and/or her vital information could be verified. This process was
time-intensive and required searching through paper files and micro fiches.
Event data files
Pre-1980 event data file
The most problematic data files were the early data that had been stored on magnetic reel data tapes at the University of Utah; many of those tapes had
deteriorated to the point of not being readable by conventional methods. The largest data file was readable and this file contained the bulk of the bleed event
information, mostly from the MCCs but occasionally from the HRFs. The remaining data tapes provided extracts or insufficient data to be incorporated for this
particular era of data collection.
The data file revealed that there were 63 participants enrolled prior to the official start of the Program in 1934; it is thought that these were the years that
Alan Treloar, the Program’s first director, was formulating his study and testing his initial hypotheses in regard to collecting and analyzing women’s menstrual cycles.
The first significant recruitment (Cohort One) began in 1934-5 and the pre-1980 data file begins with 1935 data.
The pre-1980 data were cleaned in this order:
Isolate and pair off the on/off bleed event records
Identify non bleed events
Identify non-paired bleed event records
Identify all possible locational records
Identify the breaks in data history records
Identify the locational bleed events
Correct the breaks in data history associated with the locational records
Investigate non-paired bleed events
Examine non-bleed events for reporting errors and then recode when possible again
Investigate and correct locational records
Combine all the isolated parts back into one file in date order
In some cases, the codes assigned to events in the pre-1980 file were not identical to codes assigned in the post-1980 data. In most cases, recoding to the
post-1980 code format was possible, but there were a few events that were unique and these were left with their original code assignment. Only one code conflict
arose with this decision, old code 94 – menopause confounded by hormone treatment versus new code 94 – the end of a calendar card. Also, the pre-1980 data
contained some records that were coded incompletely and required a great deal of searching through paper and microfiche files; correcting this problem proved to
be one of the most time-consuming aspects of preparing the data archive.
Bleeding information was coded differently for the pre-1980 and post-1980 data. Pre-1980 data contained more limited information: bleeds were not
described according to quality or quantity but only as a record of the dates a bleed event started and ended. Post-1980 bleed events were entered as a daily
record into the data file to provide more information about the quantity/quality of the bleed event. These 2 records had to be reconciled to
the post-1980 structure for continuity between the two event files.
Post-1980 Event data file
The post-1980 Events data file was easier to create; the post-1980 data were entered over a period of five years with NIH funding, a well-trained staff
and a uniform method of data entry. The computerized data entry greatly increased the accuracy of the post-1980 events data file.
MWHS Event data file
The Midlife Women’s Health Survey, begun in 1990, recruited midlife TREMIN participants and also a new sample of midlife women to keep a daily
menstrual diary and complete an annual survey focusing on midlife and menopause. The MWHS employed an expanded version of the TREMIN calendar card; for
each bleeding day, participants recorded an estimated quantity of menstrual blood loss. Initially, MWHS participants began recording their bleeding when they
received their calendar in the spring, rather than on January 1, the traditional beginning date for using the TREMIN calendar. This method resulted in some initial
confusion as data entry persons were sometimes unable to determine when entried began and ended in that year. The data were eventually restructured to conform
to the post- 1980 event data file so that the files could be merged together. Although the return of a blank MCC card (indicating no bleeding for a year) was not
coded in the original file, provisions were made to capture the information so it could be merged into the MWHS event data file.
Common criteria applied to all the events files include:
Bleed events, menstrual periods, cannot overlap;
Bleed events cannot occur after a hysterectomy or pan-hysterectomy or death;
Menstrual bleeding during pregnancy is questionable and requires investigation;
Bad identification number records should be identified and, if possible, corrected;
Breaks in history and bleed events cannot overlap.
These criteria were applied to the TREMIN pre-1980, TREMIN post-1980, and Midlife Women’s Health Survey bleed event files. Conflicts were
investigated and corrections were made.
HRF data files
TREMIN Health Report Form data file
To some extent, pre-1980 HRF data exist as part of the pre-1980 events data file. However, these data were not consistently available, and trying to
identify the source for an event record proved in many cases to be impossible or extremely time-consuming. On the other hand, the post-1980 HRF data were
entered accurately with data entry programs that mimic the appearance of the forms themselves and as a result the errors generated were few. Most of the errors
found were simple typing errors of dates and numbers, and these were quickly corrected. The HRF data file is structured as one record per participant for each
survey year response. Variables generally correspond to the question number with a few exceptions. Programs were written to check certain variables for extreme
values, to check for valid date entries and to check the skip patterns of the question responses. Records were identified by these programs for further investigation
of the data. Quite often the skip pattern flagged responses made by participants when they should have skipped a question, and corrections were made as warranted.
Midlife Women’s Health Survey data file
The MWHS data were received in a tabular format. It did not appear that the data was entered through a data entry program. The errors that occurred
were typing errors in numeric variables and date variables. Programs were written to check certain variables for extreme values and to check for valid date entries
and the skip pattern responses. The flagged records from these programs were investigated and corrected as needed.
Longitudinal HRF & MWHS data files
The longitudinal data file for the HRF and MWHS is composed of the variables most tied to a woman’s menstrual bleeding characteristics. Besides status
information such as menstrual status, symptoms experienced, and pregnancy status, information on hormone use, birth control use, chronic illness, radiation
exposures, and stress indices, to name a few, are included from the years of HRF and MWHS data files.
Data preparations to post data to the web site
Initially all the data were prepared with variable labels which would offer considerable information as to what the label represents and, in the case of the
HRF data, the source/question number on the survey form. However, once the data were downloaded, it became apparent that the labels would be a problem for
most statistical programs to interpret or create errors upon input. The labels were therefore removed and variable naming conventions were checked to insure they
provided some basic information about the variable source.
The event data files and the longitudinal files are quite large and require a different program link to process the volume of data for downloading; however the
individual HRFs can be downloaded adequately with the original download program.
The process of checking each data file against the documentation for accurate descriptions and representation is entirely dependent upon visual inspection
and comparisons between data file content outputs and the documentation text. Corrections have been made as they are caught.
Finally the data files will change as more data is added from each survey year processed, so the website and data base will constantly be added to,
changing and evolving as performance is improved to accommodate the growing repository of information.
There are currently two web sites that contain the TREMIN data. The first is a public informational web site maintained by the Population Research
Institute (PRI) at Pennsylvania State University. PRI houses a staff of data specialists, analysts, and other support personnel and is the site for numerous large,
multidisciplinary research projects on campus. The PI is a research associate in PRI. This web site explains the history of the TREMIN project and its design. It
also contains sample files: www.pop.psu.edu/tremin/index.htm. The second web site is password protected and contains the TREMIN data base,
documentation, and is the repository for the original HRF survey forms, codebooks and more. The URL for this web site, location of the electronic archive, is:
sodapop.pop.psu.edu/data-collections/tremin.
Once access is granted to the investigator, the password-protected TREMIN web site would be the gateway to the data files and information about
the TREMIN project through the years. An investigator would have the ability to
Download the data files, with some selection queries available;
Access the data file codebooks, which would describe each variable’s source, data format, the frequency of the possible values for the file and notes about
the variables;
Access the original survey forms for information on how and what types of data were collected;
Access a glossary of terminology used for this particular project;
Note changes in procedure, focus, etc. of any background information that might be important to an investigator;
Report any data error reporting and correction information that can be disseminated in a link of the web site.
External Data Checking – from an analyst’s perspective
Creating external data structures
The target of this project was to make the TREMIN data easily available over the web. To do that, data files are served in a standard ‘flat-file’ format and
users can choose which variables (or all) to download. Once the data are downloaded, analysts will probably want to create a more accessible data structure that
works best with their research questions and analytical tools. We tested the TREMIN dataset with some common and custom type data structures. This range of
tests makes us confident that analysts will be able to transform the data to fulfill a variety of needs with minimal effort.
Relational and analytical databases
A Microsoft Access relational database was created easily by using the simple keys available in the TREMIN dataset. Importing these data into other
relational databases should be similarly uncomplicated. Furthermore, creating specific data structures in the common analytical tools SAS and S-Plus/R was a
simple task.
Custom structures
As part of the PI’s research program on sex and menopause, one analyst created a set of programs to transform the flat files into a custom designed
‘object-oriented’ data structure. The purpose of this structure was to easily collect and query the data from the perspective of the participants and to integrate more
completely with other custom tools the analyst had designed. The upshot of this work for this project was that the structure of the data available on the web were
quite amenable to such transformations and the few minor issues that the effort uncovered were fixed in the database.
Producing synthetic datasets to address specific questions
A handful of datasets were created from the master TREMIN dataset both to test specific research questions but also to test the usefulness and integrity of
the database structure itself. Across these tests, these checks uncovered several aspects of the master database not easily found directly and these were
subsequently corrected.
Graphical checks
Although manual and automatic checks of datasets of this nature are essential to ensure data integrity, visual inspection often uncovers additional issues
because graphs usually integrate several database components that are not easy to test directly. One integrated graphical check that was particularly useful was a
plot of the entire menstrual history of each participant, including bleeding events, births, and data drop-outs. This crosscutting snapshot was useful as a way to
quickly spot database problems, and incidentally was also very illustrative of the huge variation in menstrual patterns that exist.
Analytical summaries
Similar to graphical checks, analytical summaries pull together numerous database components and therefore indirectly test database integrity. Because
these summaries are numerically intensive to compute on such a large dataset, they are not something that most analysts will construct unless directly related to
targeted research questions. We performed them to build our confidence in the integrity of the database. For example, one useful test we performed was to
calculate and graph the World Health Association’s recommended 90-day summaries of bleeding events.
Automated checks
Finally, automated tests have been used to test the data at several stages of the manual cleaning and transformation process. For every variable in the data
set (over 3000), the program verifies that every record is within expected range and of an appropriate format type and flags all inconsistencies. The results of these
checks are available to analysts in the Variable Browser of the online documentation.
Documentation
Introduction to the Variable Browser
A central piece of the documentation for TREMIN is an online variable browser. In addition to a quick description of all variables within the database, the
browser provides a data characterization of each variable. This added information has been generated in the recognition that much of an analyst’s initial effort with
databases is in understanding the scope, format, breadth and distribution of every variable that may be potentially useful to a research question. Without the
Variable Browser, such an effort is daunting for the TREMIN dataset because of the extremely large number of variables and often heterogeneous nature of the
data. By using the Variable Browser, a researcher can often quickly determine if a variable will be useful to a question without even downloading the dataset.
Web structure
The variable browser has a simple hierarchical, point-and-click structure on the web (we hope to make a downloadable version available that could be
installed on a user’s own computer). The user is presented with three frames. The user chooses which data file to examine in the first frame. This displays in the
second frame a list of all variables in that file. By clicking on any of those variables, a full analysis and documentation of the variable appears in the third frame.
This frame structure allows the user to more easily browse the huge number of variables in the dataset.
Data types
In an effort to perform detailed range and format checking, we defined a large set of ‘data types.’ A variable may actually contain more than one data type
(for example, a date variable will probably have some values of the ‘MM/DD/YYYY’ format, but also some missing value formats as well). Implicit in each data
type is a data range (for example, an ‘age’ data type is an integer that ranges from 0 to 98). For every variable, the browser displays which data types are used as
well as the counts of records within each type. For some central variables, such as the event classification variables, additional information is included such as the
range of years that the variable was actually used. This gives the analyst a quick overview of the usefulness of a particular variable.
Distributions
In addition to checking for ranges within each data type, the browser provides the graphical display of the numeric distribution for the dominant data type.
This gives the analyst a quick snapshot of the statistical characteristics of a variable and helps the analyst determine what sorts of transformations must be performed
on a variable.
Non-conforming data cells
As with most databases that span many decades, there are many cases where we cannot correct inconsistencies; for example, hard copy data records may be
illegible, original data entry personnel are long gone, and participants are either dead or not likely to remember what happened many decades ago. In many
cases, we have not removed these ambiguous data simply because we recognize that some research questions may find even these data valuable. Instead, our
procedure has been to make analysts aware that such ‘non-conforming’ data conditions exist in the dataset. A non-conforming datum is simple a piece of data that
does not conform to the data types that have been assigned to it. Many analysts will want to filter non-conforming data, so the variable browser lists all
non-conforming values for a particular variable. These can be used by an analyst as a filter in their data manipulation phase of analysis. All variables with non-conforming
data cells are flagged in the second frame of the browser for quick identification.
Publicizing the web-accessible TREMIN data archive
Several researchers have contacted TREMIN about using the data for their research. Applications have been approved from Brandeis University, Boston
University and Hershey School of Medicine of Penn State University.
We have publicized this ongoing project at several national and college-wide conferences, including the Society for Menstrual Cycle Research biennial meeting in
Pittsburgh PA in June 2003 and a TREMIN conference at Penn State University in October, 2003. News of the completed data archive will be presented at a
TREMIN symposium at the upcoming meeting of the Society for Menstrual Cycle Research in Boulder, CO in June, 2005.
We are preparing two grant proposals to NIH; both propose to use the archival data set.
c. Publications
(Although these papers were written before the entire data archive was completed, we were able to extract certain data for the analyses as needed.)
Barsom, S, Mansfield, PK, Koch, PB, Gierach, G, & West, S (2004). Association between psychological stress and menstrual cycle characteristics in
perimenopausal women. Women’s Health Issues, 14(6), 235-242.
Koch, PB, Mansfield, PK, & Thureau, D (in press). ‘Feeling frumpy’: The relationship between body image and sexual response in midlife women. Journal of Sex Research.
Mansfield, PK, Carey, M, Anderson, A, Barsom, SH, & Koch, PB (2004). Staging the menopausal transition: Data from the TREMIN Research Program on Women’s Health.
Women’s Health Issues, 14(6), 220-226.
Mansfield, PK, Voda, A & Allison, G (2004). Validating a pencil-and-paper measure of perimenopausal menstrual blood loss. Women’s Health Issues, 14(6), 242-247..
Mansfield, PK & Bracken, S (2003). The TREMIN Program: Sixty-eight years of research on menstruation and women’s health. Women’s Studies Quarterly, 31(1,2), 25-41.
Mansfield, PK & Bracken, S (2003). TREMIN: A history of the world’s oldest ongoing study of menstruation and women’s health. Lemont, PA: East Rim Publishers.
Mansfield, PK, Koch, PB, & Gierach, G. (2003). Husbands’ support of their perimenopausal wives. Women and Health, 38(3), 97-112.
Matchock, R, Susman, EJ, & Brown, F. (2004). Seasonal rhythms of menarche in the US: Correlates to menarcheal age, birth age, and birth month. Women’s Health Issues, 14(6), 184-192.
d. Significance
Although the field of women's health has expanded dramatically in the last decade or two, there are still key gaps in knowledge and unresolved questions.
Some of the areas of research that investigators will be able to address now that the data are cleaned, coded, and readily available as an electronic archive include
menstrual cycle patterning, how perimenopausal menstrual patterns relate to prior experience and later-life health problems and experiences, and women's health and
aging. TREMIN data, made available by the electronic data archive, will play a significant role in resolving these issues. Specific questions to be addressed include:
What is the range of normal menstrual cycle variation during the menopausal transition? Are the menstrual change patterns related to particular health problems in
later life?
What are the underlying causal mechanisms for the observed relationship between menstrual cycle length and increased risk of breast cancer and coronary heart
disease? Do the hormonal characteristics of extremely long or extremely short cycles influence cancer risk?
When does the perimenopause begin and what are its underlying hormonal correlates?
What role do genes play in age at menarche, age at menopause, or other characteristics of the menstrual cycle and the menopausal transition?
What are the secular changes in characteristics of the menstrual cycle and menopausal transition?