SP2024:
I will make sure all questions are posted via Piazza and set the visibility so that the entire class can see relevant questions and responses.
Update (April 30th): All relevant Q&A from this semester is now listed on Piazza - click the "final_project" tab to see the full list.
Here are relevant questions from previous years:
FAQ 2023:
Q1:For part f in which we are asked to produce a confusion matrix, should we include just one confusion matrix including all of the impurity fraction information or a confusion matrix for each class (3 total)?
→ For all classes. You can't calculate the matrix for a single class, because it also contains information about the other classes being misclassified. With only one class, you can't be very 'confused'.
Q2: I am still a bit confused as to what is being asked for in the narrative section for parts that don't have any explicit questions (like parts b, d, and g, for example). Could you elaborate a bit more on what we should be discussing in here?
→ I would just describe your thoughts and methods in a sentence or two if there's not a specific question. This just gives us an understanding of what you are thinking so that we can award partial credit if your code is not quite correct, but you were thinking the right thing. But don't worry about everything sounding perfect or being too long; this isn't a lab report just a step up from comments.
Q3: I am having a difficult time conceptually understanding part f of the project and would like some clarification.
Part 2 asks: "What would be the impurity fraction or percentage from each of the other two samples (i.e. what fraction or percentage of incorrect classifications from each of the other two classes would you select with this simple classification)?".
Does this mean to calculate the number of values incorrectly classified outside of one standard deviation for each class and divide by the sum of them? Say star has 0.2, galaxy has 0.5, and qso has 0.3 of the observations incorrectly classified outside of 1 standard deviation, then the star impurity fraction would be: 0.2 / (0.2 + 0.5 + 0.3)?
→ I don't think so. I think it means:
(N_star_misclassified as Galaxy) / (Total N_star)
And all the possible combinations...
Q4: Quick question about the dataset. What are the units for the features (r,u,i, ext.) Is it wavelength, intensity?
→ I think it is a measure of photon flux of the given wavelength (basically light intensity). It is find to just lable your figures with "r-band" or "flux(r-band)", or something like that.
Q5: The wording of b) makes me think I need a singular histogram, as it says "a histogram", but c) says "your histograms". Currently I have 1 histogram with 3 classes for part b).
→ It is a histogram (the r-band) for each of the three classes. So, 3 histograms in total. I asked you to draw the three histograms on the same axis – one figure with three histograms.
From last year:
Q1: The notebook says fit the data to gaussian function and extract the mean and sigma, as well as teh uncertainties from the fits. Are you looking for the actual mean and sigma for the three classes in r band or are you looking for the mean and sigma from the fit?
→ When I say sigma, I mean Gaussian sigma. I would say 'standard deviation' for the variation of the data itself around its mean.
Q2: For part c uncertainty, should we do the chi-square value as uncertainty or get the covariance on the fit parameters?
→ You are to extract the statistical uncertainty on the extracted best-fit parameters.
Q3: In looking at question c, it says to fit a Gaussian to the distributions, but in part d, we then plot the fit results *with* the Gaussian fit. So for c, do we need to plot anything? Or should we just be extracting the numbers of mean, sigma, and uncertainty?
→ That is correct - no plot is required in part C. But, you are welcome to draw one if it helps you understand what is going on. You will use these extracted parameters in part D to draw Gaussian with the best-fit parameters.
Q4: "Could not convert string to float "STAR" error"
In part I, i am receiving this error when trying to test and train the model. not sure what's going on.
→ I probably shouldn't provide assistance, but this *might* be an issue with how you constructed your feature matrix and/or target vector.
Q5: Do we need to make the frequency of the historgam in part 2 the same as its actual frequency or shoudl we set density=1?
→ I think you are actually referring to the normalization. Some classes occur more than others, so they have different totals. Either one is fine - explain what you did and why either with a comment, or in the narrative for that part.
Q6: I was wondering what was meant specifically by "fit the three distributions to a Gaussian function and extract the mean and sigma, as well as the uncertainty from the fits." Does this mean that we should be extrapolating error from our data?
I recall doing this in HW when we used the covariance matrix to make an error fit, but this doesn't seem quite right for this situation.
→ When you do a fit, you are finding the best-fit parameter values. There is an uncertainty on these values due to limited statistics. What is that uncertainty? I'm just asking for the uncertainty on the best-fit parameter values.
Q7: The assignment says to "provide a Narrative that includes your response to each question." Not all parts include questions to answer; is the expectation for those to not have any narrative, even though underneath there is places to put a narrative? If not, what is expected there?
→ Good point. If there are no questions then putting something their is optional. If you think it went fine on that section, then no text is required.
You may want to explain the results, why you did it the way you did, or if something went wrong you might discuss the issue you ran into.
Q8: part b) asks us to "Make and plot a histogram for the "r band" feature. On the same axis, include the distributions for the 3 different classes," does that mean plot all of the "r band" and then also plot the r values for each of the classes?
→ just meant to draw the histogram for the "r band" for all three classes (star, galaxy, quasar) on the same axis (same figure). So it will be one plot with three histograms drawn on it.