MATH/CAAM/STAT 498/698
Friday 3-3:50 in KH101. 1-3 credits
Instructors:
Steve Cox (CAAM) email:
cox@rice.edu; Homepage:
http://www.caam.rice.edu/~cox/
Robert Hardt (MATH) email: hardt@rice.edu
Homepage: http://math.rice.edu/~hardt/
David Scott (STAT) email: scottdw@rice.edu;Homepage:
http://www.stat.rice.edu/~scottdw/
Description:
Because of the rapid improvement of instrumentation (image
detection, measurement) and the great increase of computer processing
and storage capabilities, numerous research institutions have
experienced an explosive growth in available data.
What can one
do with all this raw information?
One seeks, through statistical analysis, theoretical modeling, and numerical simulation, new ways to process, correlate, understand, and use the data. This course will involve some samples of such analyses.
MATH/CAAM/STAT 498/698 will involve first a short course on "data mining", an important recently developed tool of statistics. Here is a simple example of a typical use (or misuse) of data. Suppose 2 separate tests, with positive scores x,y are done on a group of patients, and one finds that if x>1 or y>1, then the patient is definitely healthy. The cautious doctor may insist on treatment for all patients with x ≤ 1 and y ≤ 1. However further analysis of the same data may reveal that patients are actually healthy if x^2 + y^2 > 1 . In this case the healthy patients "in the corners" are exposed to the risks of an unnecessary treatment. For a similar problem with a large number N of tests and a similar spherical relation, the cautious doctor's strategy becomes even more damaging. This is based on the mathematical fact that, as N -> oo , "most" of the points of an N dimensional cube lie outside the inscribed ball, "in the corners".
Another related area of interest that will be discussed is "statistical forensics", such as lead bullet and fingerprint identification. Some mention of streaming data may be included, time permitting.
A second part of the course will involve some results on "random graphs" and their applications to neuroscience. Here a random graph comes from a large number N of vertices (corresponding roughly to neurons of a brain), pairs of which have a fixed probability p of being connected by an edge corresponding to a synapse. One is interested in properties of the random graph as N grows. For example, is the random graph connected? Does it contain a triangle? For a fixed integer k , does it contain a "k-core", i.e. a subgraph such that each vertex borders at least k edges in the subgraph. The latter question is relevant for neuroscience where, for an appropriate k , every vertex in a k-core may "fire" and allow the whole k-core to become an "associated memory". More precisely, for a fixed k , we will calculate the probability of finding a k-core in a graph of size N. The answer, i.e., k-cores are rare when N is below a (computable) threshold and extremely likely when N is above, is representative of many of the results in this beautiful field.
Prerequisite: MATH 102 .
References: Bella Bollobas, Random Graphs, Cambridge
Press.
Nisbet, Elder, and Miner, Statistical Analysis;
Data Mining Applications, Academic Press.
Finally, we also anticipate having some guest speakers who will provide lectures on other aspects of the analysis of large data sets.
Credit: Students who choose to work on a project may opt for more than 1 credit hour. Projects are not required. Discussion is welcome.