Data Incubator Challenge

This repo is created for the Data Incubator Challenge problems

Introduction

The Office Defect Inspection(ODI) of the Department of Transportation (DOT) has a dataset of vehicle defect investigation recall records since 1972. The dataset contains variables like Vehicle Make, Model, Model Year, Defect Component, Manufacturer, Case Open Date and Close Date. With these data, the goal of this project is to study the recall events by each variable to see which one has greater effect on ending up a defect investigation recall. With this knowledge on historical recall records, the project will be finalized as an interactive application to provide recommendations to vehicle buyers regarding to safety considerations.

Preliminary Analysis

2 plots were made to get a first idea on the data. These plots were generated with R Package "plotly".

Plot 1. Number of Defected Models by Make (Click the graph for interactive view)

This plot shows the number of defected Models for top 10 Makes. The interactive feature of Plotly allows the viewer to select interested Make by clicking the corresponding legend. Once chosen, the plot shows data for only one Make in a descending order. Since some of the Model names are shared by multiple Makes, the plot looks messy when showing all Makes' data altogether. In future steps, the Models should be categorized into broader classes, e.g. sedan, SUV, etc, to filter out the confusion.

Plot 2. Number of Defected Components by Manufacturer (Click the graph for interactive view)

This plot shows the number of defected Components for top 10 Manufacturers. The interactive feature of Plotly allows the viewer to select interested Manufacturer by clicking the corresponding legend. Once chosen, the plot shows data for only one Manufacturer in a descending order. Since Component names are shared by Manufacturers, the plot looks messy when showing all Manufacturers' data altogether. More plots will be made for side-by-side comparison for different Manufacturers.

Future Steps

Similar to the two plots shown above, other plots could be generated between different variables to get more knowledge and understanding on the data. Some examples are,

Given a specific Defected Component, what is the average case management time for each Manufacturer
Given a specific Year, what are the top 10 Makes, Model, Manufacturers More relevant datasets are also worth digging into.