Home
Blog
BCHS3201: Microarray Paper (Spring 2021)

BCHS3201: Microarray Paper (Spring 2021)

Daniel Kevins

0 comments

Background

You will be working with data generated using Affymetrix Arabidopsis thaliana (ATH1) full genome chips. Please watch the microarray lecture posted in Blackboard for information on how the chips are constructed and how they are used. Step-by-step instructions are provided here for managing the data. While I have provided details here, keep in mind that in a real research lab, you would have to decide for yourself how to organize the data and make sense of it.

Arabidopsis thaliana

Arabidopsis thaliana is a small, flowering plant found all over the world. It is commonly considered a weed in the United States and can be found in the Midwest (Texas is too hot; the plant likes temperatures around 68°F). Arabidopsis serves as a model plant because it has a number of characteristics that make it amenable to study. The plant is small, reaching only 30 cm in height when full grown. It grows well grows well in both soil and nutrient media making it easier to develop carefully controlled studies (Meyerowitz, 1989). It is easily grown indoors in a laboratory. Crop plants require much larger facilities and land to study. The life cycle of Arabidoposis is only 6 weeks from seed to seed-producing. This allows a much faster pace for experiments than most crop plants where only one generation of plants can be grown in a calendar year (unless your university is fortunate enough to have land on two hemispheres so you can get two growing seasons in). Arabidopsis plants produce thousands of seeds per plant and these seeds are tiny making them easy to store in microcentrifuge tubes in the freezer (Meyerowitz, 1989).

Arabidopsis has a haploid genome of 5 chromosomes consisting of approximately 125 megabases (The Arabidopsis Genome Initiative, 2000). This is a very small genome compared to that of crop species. Maize, for example, is around 2,500 megabases in size (Adam, 2000). Most genes in Arabidopsis exist at a single locus in the genome. Crop plant genomes are large in part because their genomes contain large sections that are duplicated. This makes creating complete knock-outs of a particular gene difficult. Arabidopsis is amenable to genetic manipulations either through traditional cross-breeding techniques or more modern genetic modification techniques (mutation through T-DNA inserts, chemical agents, or CRISPR-CAS9). Studies conducted in Arabidopsis are often directly transferable to crop species as many of the genes have homologues in crop plants. Studying them first in Arabidopsis is easier, cheaper, and faster.

Sugar and Phytohormone Signaling Pathways

Sugars have a role in basic plant metabolism as a carbon source and also play a role as signaling molecules, contributing to the regulation of a number of pathways in plants. The expression of genes involved in mobilization of starch and lipid reserves is usually repressed by the presence of high sugar levels in the plant while genes involved in storage of carbohydrates are upregulated (Jang & Sheen, 1997; Yu, 1999). Soluble sugar levels in plants also play a role in a number of developmental processes including time to flowering (Bernier et al., 1993), shoot to root ratios (Wilson, 1988), and senescence (cells stop dividing and normal biological processes begin to deteriorate) (Dai et al., 1999). The DNA chip data you will be analyzing for class is part of a larger study to elucidate the full impact of sugar signaling in Arabidopsis and to identify potential components of signaling pathways for future study.

Phytohormones are involved in a wide array of plant responses. The plant phytohormones ethylene and abscisic acid are also intertwined with the sugar response signaling pathways.

Ethylene plays a role in a plant’s development as well as its response to environmental conditions. Ethylene has a role in shoot and root elongation, sex determination, petal senescence, and fruit ripening. It also is involved in the plant’s response to flooding and pathogens.

Abscisic acid is involved in preventing pre-mature germination of seeds, root elongation, and stomatal closure. Stomata are pores in the leaf epidermis which control the rate of gas exchange. The pore is surrounded by two bean-shaped guard cells that regulate the size of the pore opening. Abscisic acid plays a critical role in the closure of the guard cells. Plants with mutations in the abscisic acid biosynthesis pathway have a “wilty” phenotype because they are unable to close their stomata during the day when loss of water to evaporative processes is high. The mutant, aba2, has been found to allelic to the glucose insensitive 1 (gin1) mutant (meaning the mutation for both aba2 and gin1 lie in the same gene).

Signaling pathways often work together to fine-tune plant development and responses. Seed germination, for example is finely controlled by antagonist interactions between sugar and abscisic acid which inhibit germination and gibberellin and ethylene which promote germination (figure 1).

Figure 1. Seed germination is controlled by a combination of signals from sugar levels, abscisic acid, gibberellin, and ethylene.

The sugar-insensitive 6 (sis6) mutant is slightly resistant to the inhibitory effects of abscisic acid on germination (Pattison, 2004). When seeds are grown in a petri plate with nutrient medium supplemented with abscisic acid, germination is delayed in wild-type plants. The sugar-insensitive 3 (sis3) mutant is slightly resistant to the effect of abscisic acid in comparison to wild-type (Columbia ecotype) seeds. The abscisic acid insensitive 4-1 (abi4-1) mutant displays precocious seed germination in the presence of abscisic acid, germinating despite the presence of exogenous ABA which should significantly delay germination (figure 2).

Figure 2. The sis6 mutant is insensitive to the inhibitory effects of ABA on germination. Seeds were sown on the indicated media and grown in continuous white fluorescent light. Germination was scored every 12 hours for four days and then every 24 hours thereafter. Error bar represent the mean ± standard deviation (n=3). This experiment was conducted three times with similar results. From Pattison, 2004.

How the Data was Collected for this set of Experiments

In order to conduct a chip experiment, RNA must be collected from the samples. In our experiments, wild-type Arabidopsis seeds (ecotype Columbia) were surface sterilized, cold treated at 4° C in the dark for three days and then plated on Nytex mesh screens placed in petri dishes containing minimal nutrient media. After 20 hour under continuous light at 21° the nytex meshes were transferred to plates containing minimal media supplemented with 100 mM sorbitol (control) or 100 mM glucose. Seeds were grown on the new media for 12.5 hours and then frozen in liquid nitrogen. RNA was extracted using a phenol/chloroform extraction (Verwoerd et al., 1989). RNA samples were sent to the Molecular Genomics Core Facility at the University of Texas Medical Branch in Galveston for processing.

Control versus Experimental Conditions

Minimal media is a basic growth media. For experiments utilizing glucose, the sugar itself creates osmotic stress on the plant. To differentiate between the impact of glucose versus the impact of osmotic stress, sorbitol is used as the control. Sorbitol is a sugar-alcohol which is not metabolized by the plant. It should mimic the osmotic stress created by the glucose but not impact sugar-regulated metabolism or signaling to any great extent.

Part 1. Identifying differences in gene regulation between control and experimental conditions.

1. Download the spreadsheet corresponding to your selected control and experimental conditions to your computer.

2. Take a few minutes to familiarize yourself with the spreadsheet layout.

Column A: AGI#. AGI stands for Arabidopsis Genome Initiative. Every gene in the Arabidopsis

was assigned a unique identifier during the genome sequencing project. The Affymetrix DNA

chip contains over 22,000 genes representing nearly every known gene in the genome of Arabidopsis.

Column B: Affy Probe Index #.The Affymetrix probe index # refers to the probe array that corresponds to each gene. Each probe array contains 11 pairs of probe to the same gene. One probe in each pair is a perfect match to the gene and the other contains a mismatch in the center of the probe. The software uses the data from the perfect match sets and the mismatch sets to subtract out signal that may have arisen from near (but not quite perfect) matches. The names of the probe sets are based on what was known about the gene sequence at the time the chip was created.

Names ending in means

_at all probes match one known transcript

_a all probes match alternate transcripts from the same gene

_s all probes match transcripts from different genes

_x some probes match transcripts from different genes

Notice that rows 2 through 65 do not have AGI#’s and the Probe Index #’s all begin with AFFX. These are the quality control probe arrays for the chip. They are included so that researchers know that there were not technical issues with the chip or samples. A mix of probes that will result in positive and absent calls are included. There are also some cells in the AGI#’s column that are listed as a “0” instead of an Arabidopsis Genome Initiative number. We will not be utilizing these rows.

Signal Columns: Each experiment in this data set was conducted 5 times. The

columns that contain the word “Signal” in the header represent the value for the signal reads.

Detection Columns: The column to the right of each signal column is the Detection Column.

P= present

A=absent

M=marginal

Present means the gene was expressed in the sample, resulting in a measurable signal above a

minimal detection threshold. Absent means the gene was not expressed under the experimental conditions. Marginal means the expression was very near the detection threshold. Marginal calls require further investigation and experimentation to confirm.

Converted Detection Columns: The column to the right of each Detection Column is the Converted Detection Column. The PMA calls are converted to a numeric value which allows the researcher to average the detection calls and decide whether or not to include a particular gene in the data set.

P=2

A=0

M=1

Descriptions: what was known about the gene at the gene identity or function at the time the Chip was created.

3. Open the WT on sorbitol_germinating seeds (control) and the WT on glucose_germinating (experimental) seeds Excel files found on Blackboard. For both experimental and control conditions, delete the rows containing the controls. These will be the rows that lack an AGI# or have a “0” in the AG# column. You can highlight the entire spreadsheet and use the custom sort feature to sort on column A from smallest to largest. This will group your “0”’s and your blank cells to make it easy to delete them as a block. The 0’s will end up at the top and the blank cells at the bottom.

4. Open a new Excel file and name it as follows: Lastname_firstname_microarray.

5. Change the name of Sheet 1 to “control” by right clicking on the tab and selecting “rename” from the pop up menu. Copy and paste Row 1 to capture the headers and all the rows assigned for your group (see list below) from your control sheet (WT seeds on sorbitol) into the “control tab”.

Group 1: Rows 2-3765

Group 2: Rows 3766-7531

Group 3: Rows 7532-11296

Group 4: Rows 11297-15062

Group 5: Rows 15063-18828

Group 6: Rows 18829-22592

6. Click the “+” sign to add another tab at the bottom of the Excel sheet. Rename the new sheet “experimental”. Copy and paste Row 1 to capture the headers and all the rows assigned for your group (see list above) all the data from your experimental sheet into the “experimental tab”.

7. Scroll to the right. Skip a column after the “Descriptions” column. Label the next column to the right “AVG control PMA” or “AVG experimental PMA”. Calculate the average PMA call for each gene using the converted detection column values for each condition. For example, if converted PMA detection calls are located in cells E2, I2, M2, an Q2, the formula you enter into the cell would be “=(E2+I2+M2+Q2)/4”. Do this for both your control and experimental sheets. Enter the formula and copy/paste it down the column. The row numbers will change automatically.

8. Click the “+” sign to add another tab to the bottom of the Excel sheet. Rename the new sheet “combined”.

9. Copy the following columns into the “combined” data sheet. You will need to paste “values” for any columns containing formulas. It’s under paste options.

a. AGI#

b. AGI probe number

c. Description

d. Signal columns for the control

e. Leave a blank column

f. Signal columns for the experimental

g. Leave a blank column

h. AVG control PMA column

i. AVG experimental PMA column

10. In the combined data sheet, add another column to the right of your AVG control PMA and AVG Experimental PMA columns.. Label this one “final PMA call”. Type in the formula “=MAX(XX2:XY2) where XX is the column labeled “AVG control PMA” and XY is the column labeled “AVG exp PMA” (substitute your actual column letters for XX and XY). This formula will transfer the maximum value for the two columns to the new “final PMA call column”. The point of doing this is to preserve genes in the data set where there was signal in one of the two conditions. For example, you would not want to delete a gene from the data set because it had an absent call in the control but was upregulated 15 fold in the experimental conditions. By looking at the results using the final column, we can eliminate genes where the signal was not detected in BOTH conditions.

11. In the combined spreadsheet, highlight your entire data set. Make sure you pick up all the cells with data. Click “Sort & Filter” in the toolbar. Click custom sort. Check the box on the right in pop-up box that says “My data has headers”. Sort by the “final PMA call” column from smallest to largest. Delete all rows that have a value of zero for final PMA call. This will eliminate all genes that were not expressed in either the control or experimental condition from the data set.

12. Add a column to the right of the “Final PMA call” column labeled “AVG control signal” in your combined spreadsheet. Average the values for the signal columns in your control data set. Use the formula =AVERAGE(X2:Y2) where X is the first column with the control signal data and Y is the last column of control signal data. Copy and paste the formula from row 2 all the way down the column. The row numbers will automatically change in the formula.

13. Add a column to the right of the “AVG control signal” column labeled “AVG experimental signal” in your combined spreadsheet. Average the values for the signal columns in your experimental data set. Use the formula =AVERAGE(X2:Y2) where X is the first column with the control signal data and Y is the last column of control signal data. Copy and paste the formula from row 2 all the way down the column. The row numbers will automatically change in the formula.

14. Add a column to the right labeled of the “AVG experimental signal” column labeled “AVG control/AVG experimental”. You will divide the average control signal value by the average experimental value using the formula “=XX2/XY2” [where XX is your AVG control signal column (row 2) and XY is your AVG experimental signal column (row2)]. Copy the formula down the column.

15. Add a column to the right of the “AVG control/AVG experimental” column labeled T-test. You will calculate whether there is a statistically significant difference between the two conditions. The syntax for this formula is T.Test(array1,array2, tails, type). Array 1 will be the cells containing the signal values for the control. Array 2 will be the cells containing the signal values for the experimental samples. These are NOT the averaged signals but the original values on the left-hand side of your spreadsheet. We will use a 2-tailed T-test. The type will be a two-sample equal variance test which Excel designates as “2”. For example, if the control signal columns were B, C, D and the experimental signal columns were E, F, and G, then the formula to set up in row 2 for the T-Test would be “=TTEST(B2:D2, E2:G2,2,2). Copy the formula down the row to calculate the p-values for the T-Test for each gene.

16. Click the “+” sign to add another tab to the bottom of the Excel sheet. Rename the new sheet “final”. Copy all the data from the “combined” spreadsheet into your “final” spreadsheet using the copy and paste value option. This will allow you to go back to the combined sheet to relax the stringency of your data selection if you find you end up with no genes at all in your data set when you complete the following steps.

17. Highlight your entire spreadsheet. Click “Sort & Filter” in the toolbox. Click custom sort. Click the “my data has headers” box on the right of the pop-up box. Sort by T-test value from largest to smallest. Delete all genes that have a p-value greater than 0.05. The expression of these genes is not significantly different between the control and experimental conditions and can be eliminated from the data set.

18. Highlight your entire spreadsheet again. Click “Sort & Filter” in the toolbox. Click custom sort. Click the “my data has headers” box on the right in the pop-up box. Sort by AVG control/AVG experimental from smallest to largest. Delete all genes that have a fold change between 1.99999 and 0.499999. What you are looking for are genes where the change in expression is two-fold above or below the level for the control condition. You want to keep genes in the data set where the AVG control/AVG experimental value is below 0. These are genes that are UPREGULATED in the experimental compared to the control. The larger number is in your denominator so the numbers are less than 1. You also want to keep genes in the data set where the AVG control/AVG experimental value is 2 or higher. In this case, the genes are DOWNREGULATED in the experimental condition compared to the control condition. Since the larger number is in the numerator, the value is greater than 1. If you do not have any genes with at least a two-fold difference in expression, between control and experimental, relax your conditions and select genes with fold changes between 1.5 and 0.66.

19. Change the font color for all of the down-regulated genes to red [AVG control/AVG experimental values above 2 (or 1.5 if you relaxed the conditions)].

20. Change the font color for all of the up-regulated genes to green [AVG control/AVG experimental values below 0.5 (or 0.66 if you relaxed the conditions)].

21. Add a column to the right of the AVG control/AVG experimental column. Label it Fold Change. Multiply the value for all of your downregulated genes by -1 so that your down-regulated genes are clearly negative and down-regulated. Use the formula “=XY*-1” where X is the column letter and Y is the row number. For upregulated genes, you will take the inverse of the value located in the AVG control/AVG experimental column. Use the formula “=1/XY”. For example, if the value was 0.03, then the value in the Fold change column will be 33.33.

22. Determine how many genes were up-regulated and how many were down-regulated.

Part 2. Gene Ontology (GO) Biological Process

1. Copy the first column with the AGI#’s into a new Excel sheet. Do not copy the column header. Save the file as a comma delimited file (CSV).

2. Go to https://www.arabidopsis.org/ . Click Search and select Gene Ontology annotations from the drop down menu.

3. Click Choose file. Select your CSV file. Click Functional Categorization.

4. Click Draw next to “Annotation Pie Chart”. This will generate 3 pie charts: GO Cellular Component, GO Biological Process, and GO Molecular Function. You will include the GO Biological Process chart in your paper. Copy and paste that into your Word file for your paper. When you write your paper, you should discuss anything that stands out to you as particularly interesting given your chosen topic. You do not need to discuss every single category of information appearing in these charts. You may include the other two charts in your paper if there is something in particular that you wish to highlight or tie into your discussion section of the paper but you are not required to do so.

Part 3.

Selecting a gene of interest for detailed study.

Information is continuously being added to our knowledge base. Many genes have been identified since the data in this particular data set was first collected. If you want to see if more information is available for a particular gene that has a particularly striking fold change, you can check TAIR, the Arabidopisis Information Resource at https://www.arabidopsis.org/.

Click Search:

Click Microarray Element from the dropdown box. Enter your locus identifier in the box (example: At5g01810). Make sure Affymetrix ATH1 is selected (this is the type of chip our data set is from) and click “Get Microarray Elements”.

To get detailed information on a gene of interest.

In this example, information about the gene can be found under the annotation.

You will want to select a gene that from your dataset that is strongly up or down-regulated (a fold change of 3 is preferred but you may go as low as 1.5-fold if necessary for the purpose of this assignment). You need to select a gene that has been studied in the past. Skip ones that are listed as unknown function in both our data set and when you look it up in the search above.

Next, click the search box in the top left corner again and this time select “Genes”. Enter your locus ID (example At5g01810) in the “starts with” box under the Search by Name or Phenotype section. Scroll to the bottom of the page and hit “Submit Query”. Select your locus from the list by clicking on the blue locus identifier.

If the gene has been previously studied, a wealth of information will be available on the next page. Information to include in your paper:

1. Gene locus

2. Other names for the gene:

3. Biological Processes in which the gene plays a role (GO Biological Process)

4. The cellular component in which the protein product is expressed (GO Cellular Component)

5. Growth and developmental stages when the gene is expressed

6. The plant structures where the protein product of the gene is expressed

Take a look at the BAR eFP (The Bio-Analytic Resource for Plant Biology electronic fluorescent pictograph) data. This is a browser engine that “paints” data from genomic data sets, such as microarrays, ont pictographs that repsent the experimental samples used to generat the data set. The purpose of the tool is to help researchers develop testable hypothesis based on the enormous amount of data generated by genomics projects. If you click the Data source you have options you can select that will provide you with information on experimental work others have conducted to study this gene. The informationwill be in a nicely illustrated summary form. The original reference will be included on the page as well.

Another example for the same gene:

This is a great place to look for information on your gene to use in your narrative. You should cite the original papers if you use the information in this section. You may need to go back to the original paper for details or clarity. You may not copy and paste the graphics from this website into your paper. You can only use work published by other with permission from both the original authors and the publishing company. You should synthesize the information presented in your text.

Under the Protein Data section, you will find the following information to include in your paper:

1. Protein Length

2. Molecular weight

3. Isoelectric point

4. List of InterPro domains: Create a table of the domains and their function (if the function is known). Click on the links. This will take you out to the InterPro site where you will find info on the domain. The information in the description might provide some useful information to include in your manuscript. In the table, you should indicate a very BRIEF description of whatever you think is most relevant about this particular domain (think about what your microarray experiment was to help you decide what might be the most useful information to include in the table) and the biological process, molecular function or cellular component that is applicable to the domain (see under GO terms). If no information is available, record “none” in your table. Example:

Domain	Brief Description	Biological Process	Molecular Function	Cellular Component
NAF/FISL_domain: IPR018451	Serine-threonine protein kinase that itneracts with calcineurin B-like calsium sensor proteins	Signal transduction	None	none

Table 1. Domain ontology from http://www.ebi.ac.uk/interpro/entry/InterPro/IPR018451/.

All the way at the bottom of the TAIR page, you will find a list of publications related to the gene. Use these publications as references for your paper.

Part 4. Write your microarray paper.

Your microarray paper should contain the following components:

1. Title: The title should contain the species name of the organism (Arabadosis thaliana), your topic of experimentation, and a statement about what you were looking for or what data you were generating.

2. Introduction: Be sure to state the purpose of the study, why the experiment was conducted, review previous works of others in the field (integrated seamlessly, not one reference after another). How a microarray works is not needed here. Assume your reader is familiar with this now long-standing, common-place technique. Focus on your topic (sugar signaling or the interplay between sugar and phytohormone signaling).

3. Results:

a. Report the # of genes upregulated and downregulated by 2-fold or higher.

b. Include a table of top ten most highly genes up-regulated and the top ten most highly down-

regulated genes in your experimental condition compared to the control (use your combined

spreadsheet). Also include any genes that you want to discuss in your discussion section that

are clearly implicated in the literature as being involved in sugar metabolism, phytohormone

biosynthesis, or sugar or phytohormone signaling. You may highlight genes in the discussion

that show a change in regulation in your experiment but didn’t make the top 10. Example:

AGI #	Affymetrix Probe #	Description	Fold change	p-value in Student’s T-Test
At1g20340	255886_at	Plastocyanin, putative	-4.11044	3.33E-03
At1g79040	264092_at	Photosystem II polypeptide, putative	-4.02826	6.66E-04
At1g32900	261191_at	Starch synthase, putative	+3.324657	1.19E-05

c. Gene ontology data

d. All data collected from Part B about your selected gene for deeper study.

e. All figures should be labeled and be accompanied by figure legends. The figure should be

referenced in the text; example (see figure 1).

f. Text (in addition to the figure legends) should be present to inform the reader what you did

and to summarize the results collected. No interpretation of the data is included here. Save

that for the discussion.

4. Discussion:

a. Recap you results. Take a look at the descriptions for the genes that are up or down regulated. Now look at the review of literature you selected for homework. Are there genes on the list that you would expect to see based on the literature? Looking at the descriptions, are there genes that make sense to see? If you are looking at sugar, are there genes that are obviously part of sugar metabolic pathways or involved in photosynthesis? If you are looking at phytohormones, do the receptors to your chosen phytohormone appear on the list? You might want to pull up journal articles on some of the genes appearing on the list to explain why they might be appearing on your list. Include a few suggestions for future experiments that could be conducted to expand our understanding of your topic based on your results.

5. References: You should use no fewer than 6 journal articles (literature review or primary literature) as references. Use APA format. See the “Practical Guide to Scientific Writing”

6. Appendix: You will upload your Excel spreadsheet separately to the Google Drive. Be sure to drop it in the folder for your TA.

General Information:

Your paper should be in Times Roman or Calibri font, size 12. Paper margins should be 1 inch. Please double-space the paper. The paper should not contain figures or images from any published work. In order to include previously published images, not only must you cite the source, you must also seek permission from both the original authors and the publisher. Unless you are prepared to submit the documentation for these permissions, do not include figures or images that you did not generate using the TAIR page or create yourself.
The grading rubric is in Blackboard.

References:

Adam, D. (2000). Now for the hard ones. Nature 408, 792-793.

The Arabidopsis Genome Initiative (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796-815.

Bernier, G., Havelange, A., Houssa, c., Petitjean, A., and Lejeune. P. (1993). Phsyiological signals that induce flowering. Plant Cell. 5, 1147-1155.

Dai, N., Schaffer, A., Petreikov, M., Shahak, Y., Giller, Y., Ratner, K, Levine, A., and Granot, D. (1999). Overexpression of Arabidopsis hexokinase in tomato plants inhibits growth, reduces photo synthesis, and induces rapid senescence. Palnt Cell 11, 1253-1266.

Jang, J.-C., and Sheen, J. (1997). Sugar sensing in higher plants. Trends Plant Sci. 2, 208-214.

Meyerowitz, E.M. (1989). Arabidopsis, a useful weed. Cell 56, 263-269.

Pattison, D. (2004) Characterization of sugar-insensitive mutants and analysis of sugar-regulated gene expression in Arabidopsis thaliana. [Doctoral dissertation, Rice University]. Rice University Graduate Electronic Theses and Dissertations.https://scholarship.rice.edu/handle/1911/18679

Verwoerd, T.C., Dekker, B.M. M., and Hoekema, A. (1989). A small-scale procedure for the rapid isolation of plant RNAs. Nucleic Acids Res. 17, 2362.

Wilson, J. B. (1988). A Review of evidence on the control of shoot: root ration, in relation to models. Annals of Botany. 61 (4) 433-449.

Yu, S.-M. (1999). Cellular and genetic responses of plants to sugar starvation. Plant Physiol. 121, 687-693.

About the Author

Daniel Kevins

Follow me