Highthroughput technologies are rapidly establishing themselves as indispensable tools for the study of biological systems, from gene expression level changes, protein concentrations, to their modifications and interactions in complex diseases and systems – cells and complete organisms. The field of this type of data analysis is not well served by standard biostatistical techniques traditionally applied to biomedical experiments and clinical data sets. While a conventional clinical or basic science research study might involve tens to hundreds or thousands of subject observations with tens to hundreds of variables to describe them, highthroughput data typically consists of a small number of samples versus a large number of attributes. The data characterization must be different from those applied to the traditional data that a biostatistician would encounter, and the analysis required to obtain optimal knowledge discovery on this data also differs, and involves more questions than traditional statistical inference can handle. In array data, for example, most of the variables except the treatment class descriptors are numeric and the class variable usually has few (two or three) possible values (class labels, or treatments, or observed conditions).
Knowledge, the primary goal of our data analysis and exploration, is most often discovered by generating information (structure) from data, and then abstracting nontrivial patterns (rules or associations, for example) from the information. The discovery process can be done using numerous means that share the same goal: statistical analysis, visualization, data mining, neural networks, mathematical modeling and simulation, pathway analysis, just to name a few. Visualization is different from the rest, however, in that it is also the actual mechanism by which the analyses and their results can be presented to the user, harnessing the perceptual and cognitive capabilities of the user, who is still one of the most powerful pattern recognizers and inference engines. Visualization of data sets is best known for the statistical representations of data, such as statistical computations plotted in a variety of ways: histograms, time series and scatter plots, etc. Modern approaches (such as the VizSOM suite of algorithms developed by Drs. Trutschl and Cvek [2]) extend the scatter plot concept and such generalizations enable us to display larger, higherdimensional data sets in a meaningful, more descriptive manner. Tools for data analysis are generally used in the hypothesis generation and during the exploratory analysis stage (finding patterns, outliers, and other meaningful domainspecific insights), as well as in the confirmatory stages. Many techniques make use of information visualization, such as scatter plots and histograms. However, most of them can only represent two or three variables at a time. High dimensionality is one of the challenging traits of the data. In many cases, projection transformation functions are used in order to map highdimensional data to a lowdimensional display space/area. While useful, such techniques usually result in perceptual ambiguities. We complement the current publicly available tools (Spotfire, Rproject, Matlab, etc.) with proprietary tools that we developed to address some of the shortcomings of the commercial and opensource packages. A set of proprietary algorithms we named VizSOM enhance traditional visualizations (scatter plots, parallel coordinates, radviz, as well as visualizations of polygonal nature such as maps, just to name a few). These algorithms take advantage of neuralnetworks and dimensionally rich data and present it in a meaningful format [2; see appendix]. Figure 2 (also see appendix) is a sample visualization of a highdimensional microarray polysomal data set visualized as a traditional scatter plot (a) and rotated 3D VizSOM scatter plot (b). Selected records mapping to the same spot in 2D scatter plot are shown to be different in the highdimensional space (c).
a) b) c)
Figure 2. (a) Scatter plot of a microarray data set representing polysomal behavior in C. elegans. (b) 3D visualization of the same data set after we processed it using our neuralnetworkenhanced visualization. The overlapped records that appeared to be the same in (a) are now placed at different locations along the zaxis according to all (or a subset, if so desired) dimensions in the data set. (c) This line plot shows all dimensional values and the reason why the records can not and should not be colocated in the scatter plot. Larger images are available in the appendix.
a) b) c)
Figure 3: VizSOM Radviz with a sample microRNA data set and 50node stack of SOM output nodes; (a) 3D view from side, Radviz projection base at the bottom; (b) bottom projection; (c) top projection, all with color scale based on selected variable. Larger images are available in the appendix.
Figure 3 shows a sample 27dimensional microRNA data set generated by miRANDA and postprocessed using a neuralnetworkaugmented VizSOM Radviz algorithm developed by Drs. Trutschl and Cvek. This technology goes beyond currently available visual representation of results from the computational approaches such as miRANDA, which is limited to lengthy tables containing scores and annotations.
Researchdriven development of algorithms and other computationallyoriented techniques are an important component of the interdisciplinary collaboration between LSUS and LSUHSCShreveport.
The project described was supported by NIH Grant Number
P20RR018724 from the National Center for Research Resources.
