The Capstone experience covers two semesters of course work. The two semesters mirror the structure of the data science pipeline. There are many examples of this pipeline, but a simple version is the following:

  • Q: Frame the Question
  • D: Establish the Data
  • A: Perform the Analytics
  • P: Produce the Product

General Remarks

  • Each phase of the pipeline covers a wide range of activities; these are broad groupings intended to organize the capstone experience, not provide a detailed description of data science work.
  • The order is not strictly linear; there is always some back and forth between contiguous phases, and even some circling back.
  • The speed at which capstones pass through these phases varies. It is not advisable to keep groups in lockstep. However, it is important that teams act with some urgency in establishing their data in the first semester, and performing the analytics in the second.
  • These phases are actually milestones – they each have closure, and these closure points are tied to the semesters. Specifically, Q and D need to be completed by semester 1, and A and P by semester 2.

Q: Frame the Question

  • Students should define the value proposition or scientific question that motivates the project, as well as ethical issues that may be anticipated. They should clarify this with an "aim and scope" statement that is shared with and understood by the client.
  • A key part of this phase is to begin the process of operationalizing the question; translating the terms of the verbal construction into variables and quantitative approaches.
  • Operationalizing often requires that students become familiar with existing literature and prior art on the subject. This is crucial but sometimes dismissed by students, who often want to dive in with their favorite method or tool.

D: Establish the Data

  • Students need to get their hands on the data and be able to access it. This may mean bringing into a local site (e.g. Rivanna), working onsite (if the organization requires that), or into Ivy (if HIPAA constrained).
  • Students should perform any wrangling or cleaning, ETL, and EDA to understand the data and make it ready for analysis.
  • Students should be able to describe their data's model, shape(s), feature space, biases, gaps, etc.

A: Perform the Analytics

  • This is where the students apply specific models and approaches to their data. Given the nature of program, this is often the application of classifiers to the data, ranging from classical tools like logistic regression or support vector machines to deep learning NNs. But this phase may also include the building out of a dashboard or the exploration of patterns using unsupervised methods.
  • Students may engage in analytics in semester 1, depending on their pace. Frequently, students will work with methods that have not been covered in their courses. This is fine. They often gain a practical familiarity with the tools required to deploy methods and then a theoretical understanding of what they did later.
  • Analytics also involves performance evaluation, either intrinsic (e.g. ROC/AUC, perplexity) or extrinsic (application to a real use case if software.

P: Produce the Product

  • The product is a broad term. It may include an MVP, a working model, and of course the final paper. It also concludes all communicative events and artifacts, e.g. presentations, posters, papers.
  • In addition to the paper and core work of the project, students often give talks at local conferences or develop an idea that could not be included in the final paper.
  • This part of the project also includes reconnecting with the client and presenting results to them.
  • No labels