Project 01 - 20 Points
Date Assigned: Tuesday, Sep. 16, 2025
Due Date: Tuesday, Oct. 7, 2025, 5 pm CST.
Individual Assignment: Every student should work independently and submit their own project. You are allowed to talk to other students about the project, but please do not copy any code for the notebook or text for the report.
Use of AI. You are allowed to use AI tools like ChatGPT to help you with the coding portion of this project, but you must include a document called “Use of AI” with your project. For each time you used the output of an AI tool, add an entry into your “Use of AI” document with the following fields:
Tool: The tool used, such as ChatGPT.
Prompt: The prompt you provided to the tool.
Output: The output from the tool that you used.
Number each entry in the document like [1], [2], [3], etc. For example:
Use of AI
---------
[1]. Tool: ChatGPT
Prompt: Write python code to read a csv file into a pandas dataframe
Output:
# Replace 'your_file.csv' with the path to your CSV file
df = pd.read_csv("your_file.csv")
[2]. *Additional entries here*...
Then, within the code (i.e., jupyter notebook) put a comment referencing the entry of the Use of AI document. For example:
# The code below was generated by AI; see [1].
df = pd.read_csv("cars.csv")
Please do not use AI to generate the report for part 3. Learning to communicate effectively is an important life skill.
Late Policy: Late projects will be accepted at a penalty of 1 point per day late, up to five days late. After the fifth late date, we will no longer be able to accept late submissions. In extreme cases (e.g., severe illness, death in the family, etc.) special accommodations can be made. Please notify us as soon as possible if you have such a situation.
Project Description: For this project, you will use a dataset about sheltered animals in Austin. available from the class git repository. It can be downloaded here: Project 1 Dataset This dataset has the following 12 variables:
AnimalID: ID given to the animal
Date of Birth: Date of birth of animal
Name: Name of Animal
DateTime: date and time of admission to the shelter
MonthYear: month and year of admission to the shelter
OutcomeType: Two classes Transfer or Adoption, where Transfer is when the animal was moved to other organization or shelter; Adoption is it was adopted by someone
Outcome Subtype: Provides more specific explanaton of the Outcome Type
Animal Type: Type of Animal
Sex Upon Outcome: Animal’s biological sex and reproductive status at the time of outcome
Age Upon Outcome: Animal’s age upon outcome
Breed: Breed of the animal
Color : color of the animal
Part 1 (8 points): Your objective is to perform Exploratory data analysis on the dataset. Complete the following:
Identify shape, size of the raw data (1 point)
Get information about datatypes. Comment if any of the variables need datatype conversion. Check for duplicate rows and treat them. (1 point)
Identify missing data and/or invalid values and treat them with suitable mean, median, mode or other method (1 point)
Visualize the dataset through different univariate analysis and comment on your observations (2)
Drop duplicate rows and irrelevant columns. (1 point)
Convert all data to numeric and/or categorical data types. Hint: Make the Age at Outcome column a float by converting all values to one unit for example days. (1 point)
Perform one-hot encoding on categorical variables (1 point)
Part 2 (7 points): Fit Classification models on the data to predict the outcome type (OutcomeType):
First, drop the Breed column, as it will complicate the analysis here in Part 3.
Split the data into training and test datasets. Make sure your split is reproducible and that it maintains roughly the proportion of each class of dependent variable. (1 point)
- Perform classification to predict using OutcomeType (4 points)
K-Nearest Neighbor Classifier (1 point)
K-Nearest Neighbor Classifier using Grid search CV (2 points)
Linear classification (1 point)
Print report showing accuracy, recall, precision and f1-score for each classification model. Which metric is most important for this problem? (You will explain your answer in the report in Part 3). ( 2 points)
Part 3 (5 points): Submit a 2 page report with the following:
What did you do to prepare the data?
What insights did you get from your data preparation?
What procedure did you use to train the model?
How does the model perform to predict the class?
How confident are you in the model?
Submission Guidelines: Part 1 and Part 2 should be submitted as one notebook file. Part 3 should be submitted as a PDF file. Both the files should be committed to a personal GitHub repo.
To submit your project, send an email with the following information:
Subject: COE 379L Project 1 Submission
To: jstubbs@tacc.utexas.edu, ajamthe@tacc.utexas.edu, shukai.cai@utexas.edu
Body: Please include the following:
1) GitHub Repo Link
2) Any other details needed to access the repository (e.g., file locations)
Please make sure the repository is either public or shared with the following GitHub accounts:
Joe Stubbs, GitHub account:
joestubbsAnagha Jamthe, GitHub account:
ajamthetaccShukai Cai, GitHub account:
Projects will be considered late if an email is not received by the due date. We will reply with an acknowledgement that we received and were able to pull the GitHub repo. I recommend that everyone create the git repository, either share it with us more make it public, and then send us the email above ASAP.
Evaluation: We will git pull all repos on the due date at or after 5 pm. This is the version of your submission that we will evaluate unless we receive a message that you would like an extension (with a 1 point per day penalty).