Details

Understanding the Hormonal Regulation of Mouse Lactogenesis by Transcriptomics and Literature Analysis

by Ling, Maurice Han Tong, BS

Abstract (Summary)
The mammary explant culture model has been a major experimental tool for studying hormonal requirements for milk protein gene expression as markers of secretory differentiation. Experiments with mammary explants from pregnant animals from many species have established that insulin, prolactin, and glucocorticoid are the minimal set of hormones required for the induction of maximal milk protein gene expression. However, the extent to which mammary explants mimic the response of the mammary gland in vivo is not clear. Recent studies have used microarray technology to study the transcriptome of mouse lactation cycle. It was demonstrated that the each phase of mouse lactation has a distinct transcriptional profile but making sense of microarray results requires analysis of large amounts of biological information which is increasingly difficult to access as the amount of literature increases. The first objective is to examine the possibility of combining literature and genomic analysis to elucidate potentially novel hypotheses for further research into lactation biology. The second objective is to evaluate the strengths and limitations of the murine mammary explant culture for the study and understanding of murine lactogenesis. The underlying question to this objective is whether the mouse mammary explant culture is a good model or representation to study mouse lactogenesis. The exponential increase in publication rate of new articles is limiting access of researchers to relevant literature. This has prompted the use of text mining tools to extract key biological information. Previous studies have reported extensive modification of existing generic text processors to process biological text. However, this requirement for modification had not been examined. We have constructed Muscorian, using MontyLingua, a generic text processor. It uses a two-layered generalization-specialization paradigm previously proposed where text was generically processed to a suitable intermediate format before domain-specific data extraction techniques are applied at the specialization layer. Evaluation using a corpus and experts indicated 86-90% precision and approximately 30% recall in extracting protein-protein interactions, which was comparable to previous studies using either specialized biological text processing tools or modified existing tools. This study also demonstrated the flexibility of the two-layered generalization-specialization paradigm by using the same generalization layer for two specialized information extraction tasks. The performance of Muscorian was unexpected since potential errors from a series of text analysis processes is likely to adversely affect the outcome of the entire process. Most biomedical entity relationship extraction tools have used biomedical-specific parts-of-speech (POS) tagger as errors in POS tagging and are likely to affect subsequent semantic analysis of the text, such as shallow parsing. A comparative study between MontyTagger, a generic POS tagger, and MedPost, a tagger trained in biomedical text, was carried out. Our results demonstrated that MontyTagger, Muscorian's POS tagger, has a POS tagging accuracy of 83.1% when tested on biomedical text. Replacing MontyTagger with MedPost did not result in a significant improvement in entity relationship extraction from text; precision of 55.6% from MontyTagger versus 56.8% from MedPost on directional relationships and 86.1% from MontyTagger compared to 81.8% from MedPost on un-directional relationships. This is unexpected as the potential for poor POS tagging by MontyTagger is likely to affect the outcome of the information extraction. An analysis of POS tagging errors demonstrated that 78.5% of tagging errors are being compensated by shallow parsing. Thus, despite 83.1% tagging accuracy, MontyTagger has a functional tagging accuracy of 94.6%. This suggests that POS tagging error does not adversely affect the information extraction task if the errors were resolved in shallow parsing through alternative POS tag use. Microarrays had been used to examine the transcriptome of mouse lactation and a simple method for microarray analysis is correlation studies where functionally related genes exhibit similar expression profiles. However, there has been no study to date using text mining to sieve microarray analysis to generate new hypotheses for further research in the field of lactational biology. Our results demonstrated that a previously reported protein name co-occurrence method (5-mention PubGene) which was not based on a hypothesis testing framework, is generally more stringent than the 99th percentile of Poisson distribution-based method of calculating co-occurrence. It agrees with previous methods using natural language processing to extract protein-protein interaction from text as more than 96% of the interactions found by natural language processing methods coincide with the results from 5-mention PubGene method. However, less than 2% of the gene co-expressions analyzed by microarray were found from direct co-occurrence or interaction information extraction from the literature. At the same time, combining microarray and literature analyses, we derive a novel set of 7 potential functional protein-protein interactions that had not been previously described in the literature. We conclude that the 5-mention PubGene method is more stringent than the 99th percentile of Poisson distribution method for extracting protein-protein interactions by co-occurrence of entity names and literature analysis may be a potential filter for microarray analysis to isolate potentially novel hypotheses for further research. The availability of transcriptomics data from time-course experiments on mouse mammary glands examined during the lactation cycle and hormone-induced lactogenesis in mammary explants has permitted an assessment of similarity of gene expression at the transcriptional level. Global transcriptome analysis using exact Wilconox signed-rank test with continuity correction and hierarchical clustering of Spearman coefficient demonstrated that hormone-induced mammary explants behave differently to mammary glands at secretory differentiation. Our results demonstrated that the mammary explant culture model mimics in vivo glands in immediate responses, such as hormone-responsive gene transcription, but generally did not mimic responses to prolonged hormonal stimulus, such as the extensive development of secretory pathways and immune responses normally associated with lactating mammary tissue. Hence, although the explant model is useful to study the immediate effects of stimulating secretory differentiation in mammary glands, it is unlikely to be suitable for the study of secretory activation.
Full Text Links

Main Document: View

10-page Sections: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Next >

Bibliographical Information:

Advisor:Kevin R. Nicholas, Christophe Lefevre

School:University of Melbourne

School Location:Australia

Source Type:Doctoral Dissertation

Keywords:biomedical literature analysis, microarray, lactation, transcriptomics, hormonal activity

ISBN:

Date of Publication:02/01/2010

Document Text (Pages 1-10)

Understanding the Hormonal Regulation of
Mouse Lactogenesis by
Transcriptomics and Literature Analysis

Maurice Han Tong LING, BSc(Hons)

Submitted in total fulfilment of the requirements
of the degree of Doctor of Philosophy

January 2010

Department of Zoology

Faculty of Science
The University of Melbourne
Australia


Page 2

Page 3

Abstract

The mammary explant culture model has been a major experimental tool for studying
hormonal requirements for milk protein gene expression as markers of secretory
differentiation. Experiments with mammary explants from pregnant animals from many
species have established that insulin, prolactin, and glucocorticoid are the minimal set of
hormones required for the induction of maximal milk protein gene expression.
However, the extent to which mammary explants mimic the response of the mammary
gland in vivo is not clear. Recent studies have used microarray technology to study the
transcriptome of mouse lactation cycle. It was demonstrated that the each phase of
mouse lactation has a distinct transcriptional profile but making sense of microarray
results requires analysis of large amounts of biological information which is
increasingly difficult to access as the amount of literature increases.

The first objective is to examine the possibility of combining literature and genomic
analysis to elucidate potentially novel hypotheses for further research into lactation
biology. The second objective is to evaluate the strengths and limitations of the murine
mammary explant culture for the study and understanding of murine lactogenesis. The
underlying question to this objective is whether the mouse mammary explant culture is
a good model or representation to study mouse lactogenesis.

The exponential increase in publication rate of new articles is limiting access of
researchers to relevant literature. This has prompted the use of text mining tools to
extract key biological information. Previous studies have reported extensive
modification of existing generic text processors to process biological text. However, this
requirement for modification had not been examined. We have constructed Muscorian,
using MontyLingua, a generic text processor. It uses a two-layered generalizationspecialization
paradigm previously proposed where text was generically processed to a
suitable intermediate format before domain-specific data extraction techniques are
applied at the specialization layer. Evaluation using a corpus and experts indicated 86-
90% precision and approximately 30% recall in extracting protein-protein interactions,
which was comparable to previous studies using either specialized biological text
processing tools or modified existing tools. This study also demonstrated the flexibility

i


Page 4

of the two-layered generalization-specialization paradigm by using the same
generalization layer for two specialized information extraction tasks.

The performance of Muscorian was unexpected since potential errors from a series of
text analysis processes is likely to adversely affect the outcome of the entire process.
Most biomedical entity relationship extraction tools have used biomedical-specific
parts-of-speech (POS) tagger as errors in POS tagging and are likely to affect
subsequent semantic analysis of the text, such as shallow parsing. A comparative study
between MontyTagger, a generic POS tagger, and MedPost, a tagger trained in
biomedical text, was carried out. Our results demonstrated that MontyTagger,
Muscorian's POS tagger, has a POS tagging accuracy of 83.1% when tested on
biomedical text. Replacing MontyTagger with MedPost did not result in a significant
improvement in entity relationship extraction from text; precision of 55.6% from
MontyTagger versus 56.8% from MedPost on directional relationships and 86.1% from
MontyTagger compared to 81.8% from MedPost on un-directional relationships. This is
unexpected as the potential for poor POS tagging by MontyTagger is likely to affect the
outcome of the information extraction. An analysis of POS tagging errors demonstrated
that 78.5% of tagging errors are being compensated by shallow parsing. Thus, despite
83.1% tagging accuracy, MontyTagger has a functional tagging accuracy of 94.6%. This
suggests that POS tagging error does not adversely affect the information extraction task
if the errors were resolved in shallow parsing through alternative POS tag use.

Microarrays had been used to examine the transcriptome of mouse lactation and a
simple method for microarray analysis is correlation studies where functionally related
genes exhibit similar expression profiles. However, there has been no study to date
using text mining to sieve microarray analysis to generate new hypotheses for further
research in the field of lactational biology. Our results demonstrated that a previously
reported protein name co-occurrence method (5-mention PubGene) which was not
based on a hypothesis testing framework, is generally more stringent than the 99th
percentile of Poisson distribution-based method of calculating co-occurrence. It agrees

with previous methods using natural language processing to extract protein-protein
interaction from text as more than 96% of the interactions found by natural language
processing methods coincide with the results from 5-mention PubGene method.
However, less than 2% of the gene co-expressions analyzed by microarray were found
ii


Page 5

from direct co-occurrence or interaction information extraction from the literature. At
the same time, combining microarray and literature analyses, we derive a novel set of 7
potential functional protein-protein interactions that had not been previously described
in the literature. We conclude that the 5-mention PubGene method is more stringent
than the 99th percentile of Poisson distribution method for extracting protein-protein
interactions by co-occurrence of entity names and literature analysis may be a potential
filter for microarray analysis to isolate potentially novel hypotheses for further research.

The availability of transcriptomics data from time-course experiments on mouse
mammary glands examined during the lactation cycle and hormone-induced
lactogenesis in mammary explants has permitted an assessment of similarity of gene
expression at the transcriptional level. Global transcriptome analysis using exact
Wilconox signed-rank test with continuity correction and hierarchical clustering of
Spearman coefficient demonstrated that hormone-induced mammary explants behave
differently to mammary glands at secretory differentiation. Our results demonstrated
that the mammary explant culture model mimics in vivo glands in immediate responses,
such as hormone-responsive gene transcription, but generally did not mimic responses
to prolonged hormonal stimulus, such as the extensive development of secretory
pathways and immune responses normally associated with lactating mammary tissue.
Hence, although the explant model is useful to study the immediate effects of
stimulating secretory differentiation in mammary glands, it is unlikely to be suitable for
the study of secretory activation.

iii


Page 6

Declaration

This is to certify that
(i) the thesis comprises only my original work towards the PhD except where indicated

in the Preface,
(ii) due acknowledgement has been made in the text to all other material used,
(iii) the thesis is less than 100,000 words in length, exclusive of tables, maps,

bibliographies and appendices.

________________________________
Maurice Han Tong LING, BSc(Hons)
Department of Zoology
The University of Melbourne
Australia
SEN: 139520

iv


Page 7

Preface

The background literature related to this thesis had been reviewed to March 2009.

Data chapters in this thesis have been written as a series of five papers (Chapters 2 - 6)
which has resulted in repetition in the introductions and discussions of chapters on
similar topics. At the time of printing, Chapter 2 has been published in Lecture Notes in
Bioinformatics, volume 4774; Chapters 3 and 4 have been published in The Python
Papers. These papers were jointly authored by Kevin Nicholas, Christophe Lefevre but
in each case, I am the first author. All work described in this thesis were performed by
myself except the planning and conduct of the microarray experiments on mouse
mammary explants described in Chapter 5.

This research was funded by The University of Melbourne's Melbourne International
Fee Remission Scholarship (MIRFS), Science Faculty Scholarship, and Cooperative
Research Centre for Innovative Dairy Product's PhD studentship. Technical and
computing resources were generously contributed in-kind by Victorian Bioinformatics
Consortium, Monash University, Australia; High Performance Computing Unit, The
University of Melbourne, Australia; Institute of Medical Informatics, National Yang
Ming University, Taiwan; Bioinformatics Research Centre, Nanyang Technological
University, Singapore. Conference travel to present results from this thesis was
generously funded by The University of Melbourne's Postgraduate Overseas Research
Experience Scholarship (PORES), Melbourne Abroad Travelling Scholarship (MATS),
F.H. Drummond Travel Award, Department of Zoology at The University of Melbourne,
and Cooperative Research Centre for Innovative Dairy Product.

v


Page 8

Acknowledgements

Paraphrasing from an old saying “It takes a village to raise a child; it takes a community
to write a thesis”. It is without doubt in my mind that I am indebted to many people.
First and foremost, I will like to express my deep gratitude my PhD supervisors, Dr.
Kevin Nicholas, Dr. Christophe Lefevre and Dr. Andrew Lonie, of The University of
Melbourne, and Associate Professor Lin Feng, of Nanyang Technological University. I
cannot imagine a better mentor and friend for my PhD than Kevin, for the dedication
and interest in my project, and certainly for all your support and encouragements during
my difficult times. For Christophe, your valuable suggestions and challenges had
encouraged me to grow intellectually and yet providing a firm support. I extend many
thanks to Andrew and Feng for your constructive criticisms and valuable suggestions to
point me in a correct direction. This thesis would never been completed without your
encouragement, guidance and admirable patience throughout the years.
I would like to acknowledge the computing support provided by the Victorian
Bioinformatics Consortium, Monash University; Institute of Biomedical Informatics,
National Yang Ming University, Taiwan; and High-Performance Computing Unit, The
University of Melbourne. This work would not have been possible without these in-kind
gestures.
I wish to thank the CRC for Innovative Dairy Products and The University of
Melbourne for providing financial support for this work. The generous scholarships and
training grants had given me the chance to present my work in multiple conferences and
the chance to grow professionally.
Numerous constructive criticisms and suggestions from many people had gone into the
preparation of this thesis: Associate Professor Peter Thompson and Professor Frank
Nicholas of University of Sydney, for your insights into statistical modelling and
microarray analysis; Professor Thomas Rindflesch of NIH and Professor Jonathan Wren
of University of Oklahoma, for your taking time and effort for an informal review and
critic of my text analysis system; Dr. Hugo Liu of MIT, for your guidance on using
MontyLingua; Professor Hsu Chunnan and Professor Hsu Wenlian of Academia Sinica
in Taiwan, for your insights into text mining; Dr. Jeffrey Chang of Duke University, for
your assistance in abbreviation recognition and your kind gesture to allow me to
replicate your system locally. Thank you very much.
The Department of Zoology and the Nicholas' group has provided me with marvelous
working environment to complete my PhD. It is a privilege to work along-side with
many of the staff and students. I wish everyone the best in the future endeavours. My
sincerest thanks are extended to Phil Au, Joly Kwek and Edwin Wong, for your endless
support and friendship had given me the energy to push on when I had doubted myself;
Sonia Mailer, Dr. Mary Familari, Josie D'Alessandro, for your encouragement along
every step that I take; Professor David MacMillian, for letting me know that you are a
fan of my work – that meant a lot to me.
I am deeply thankful for the constant love, support, patience and at times, blind faith,
my friends and my team of Resident Advisers in University College, especially
Genevieve Leach, have showered upon me to get me through all the trials and
vi


Page 9

tribulations that I have to face. You have given me a home away from home. Your love
and support told me, silently but insistently, that I am never alone on this track.
I am deeply grateful to my parents and my family who had supported me silently but
truly and concretely over these years overseas. Any of this would not have been
possible without your support. To my brother, Melvin, affectionally known as my 'cat',
thank you for tolerating my irritations, before and ever. Sorry for letting you go through
your teenage years on your own, maybe that is for your best.
Perhaps my only regret is not able to finish this thesis fast enough for my grandmother
to witness my graduation. She had passed on with dignity by removing her own oxygen
supply on the afternoon of 22nd June 2008, 8 months after being diagnosed with terminal
breast cancer caused by MAP kinase mutation in the insulin signalling pathway – a
subject that I know intimately from this work. This thesis is for you.

vii


Page 10

Table of Contents

Abstract...............................................................................................................................i
Declaration.......................................................................................................................iv
Preface...............................................................................................................................v
Acknowledgement............................................................................................................vi

Chapter One: Introduction and Literature Review
1.1. Mammary Gland Development and Lactation.......................................................2
1.2. Mammary Explant Culture Studies to Determine the Role of Insulin,

Prolactin and Glucocorticoid in Lactogenesis.......................................................6
1.3. Microarray Analyses of Mammary Gland Development and Lactation Cycle.....8
1.4. Analysis of Gene Expression Microarray Results...............................................10
1.5. Biomedical Literature Analysis...........................................................................13

1.5.1. Brief History of Biomedical Literature Analysis.........................................14
1.5.2. Current Areas of Research...........................................................................16
1.5.3. Information Retrieval: Finding the Papers...................................................17

1.5.3.1. Brief Descriptions of Information Retrieval Systems..........................18
1.5.4. Abbreviation Recognition: Aliasing the Names..........................................19
1.5.5. Named Entity Recognition: Identifying the Players....................................20
1.5.6. Information Extraction: Getting the Facts....................................................23

1.5.6.1. Co-occurrence.......................................................................................23

1.5.6.1.1. Brief Descriptions of Co-Occurrence Systems.............................24
1.5.6.2. Natural Language Processing / Template Matching.............................25

1.5.6.2.1. Brief Descriptions of Natural Language Processing /
Template Matching Systems.........................................................27

1.5.6.3. Applications of Biomedical Information Extraction............................31
1.5.7. Text Mining: Finding Hypotheses................................................................34
1.5.8. Related Areas of Importance........................................................................35

1.5.8.1. Corpora.................................................................................................36
1.5.8.2. Databases..............................................................................................37
1.5.8.3. Evaluation Strategies............................................................................38

1.5.9. Challenges of Biomedical Literature Analysis.............................................40
1.6. Microarray Analyses Assisted by Biomedical Literature Analyses....................44
1.7. Tools for Visualizing Large Graphs.....................................................................45
1.8. Objectives and Organization of this Thesis ........................................................46
Chapter Two: Reconstruction of Protein-Protein Interaction Pathways
by Mining Subject-Verb-Objects Intermediates

2.1 Introduction.........................................................................................................48
2.2 System Description..............................................................................................49

2.2.1 Entity Normalization.....................................................................................50
2.2.2 Text Analysis.....................................................................................................51

2.2.3 Protein-Protein Binding Finding...................................................................55
2.3 Experimental Results............................................................................................56

2.3.1 Benchmarking Muscorian Performance........................................................56
2.3.2 Verifying Protein-Protein Binding Interactions............................................57
viii

© 2009 OpenThesis.org. All Rights Reserved.