Data Science

CS 627 Data Science Programming

This course is designed to give you awareness and initial working knowledge of some of the most fundamental computational tools for performing a wide variety of academic research. As such, it will focus on providing breadth instead of depth, which means that for each concept we will talk about motivation, key concepts, and concrete usage scenarios, but without mathematical background or proofs, which can be acquired in more specialized classes. In this class we will: become familiar with the UNIX/LINUX environment, learn how to version control files with git, write programs inpython, perform numeric tasks using numpy and scipy, analyze data using pandas, apply machine learning algorithms using scikit-learn, visualize data using matplotlib and pyqtgraph, use QT/pyside to build graphical user interfaces, and finally we will address performance issues via compilation/profiling/parallelization tools.

Course Website

CS 631 Data Visualization

This course will give students a foundation in the principles of data visualization, particularly as applied to scientific and technical data, as well as provide students with hands-on experience using modern software tools for developing visualizations. Lecture topics will include an overview of visual perception, color theory and practice, different types of graphs and their purposes, visualizations for specialized forms of data including time-series and geospatial data sets, strategies for working with multidimensional data, etc. There will also be lecture content on ethical issues surrounding data visualization. Weekly lab sessions will introduce students to popular data visualization tools such as R's ggplot and Shiny, Tableau, etc.

Scale Free Computer Networks CS/EE 679 Problem Solving with Large Clusters

Many real-world computational problems involve data sets that are too large to process on a single computer, or that have other characteristics--- fault-tolerance, etc.--- that require multiple computers working together. Examples include analysis of high-throughput genomic or proteomic data, data analytics over very large data sets, large-scale social network analysis, training machine learning models on "web scale" data sets, and so forth. In Problem Solving with Large Clusters, we will explore a variety of approaches to solving these kinds of problems through a mixture of lectures and student-led discussions of the research literature in the field. We will also hear from several guest lecturers with practical experience applying cluster computing algorithms in both academia and industry. In addition to reading and discussing articles, students will learn how to program in the Hadoop map-reduce environment as well as in several other such systems through class assignments. There will also be a final project on a subject of the student's choice involving cluster computing. 

Prerequisites: A graduate level course on machine learning or probability and statistics. Students should be comfortable coding in at least one programming language, and familiar with the UNIX command-line environment.

MATH 530/630
Probability & Statistical Inference for Scientists and Engineers

This course will introduce fundamental concepts underlying statistical data display, analysis, inference and statistical decision making. The topics include presentation and description of data, basic concepts of probability, Bayes' theorem, discrete and continuous probability distributions, estimation, sampling distributions, classical tests of hypotheses on means, variances and proportions, maximum likelihood estimation, Bayesian inference and estimation, linear models, examples of nonlinear models and introduction to simple experimental designs. One of the key notions underlying this course is the role of mathematical modeling in science and engineering with a particular focus on the need for an understanding of variability and uncertainty. Examples are chosen from a wide range of engineering, clinical, and social domains.