Do more people ride the subway when it's raining outside than when it isn't raining? The answer isn't as intuitive as you would think. This project involved wrangling actual NYC Subway data, then analyzing it using statistical methods and data visualization.
You can read my analysis here, or look at the code used to create it. This project was completed as part of the Udacity Data Analyst Nanodegree.
Skills used: Python, NumPy, Pandas, PandasSQL, SQL, ggplot, linear regression, gradient descent, MapReduce
Data analysis is great, but it usually takes some work to get real data into a format that actually be analyzed. This project involved applying data munging (or wrangling/cleaning) techniques on OpenStreetMap data for an area of our choice. These techniques included assessing the quality of the data for validity, accuracy, completeness, consistency and uniformity, and then correcting the issues identified in the data. Finally, all of the cleaned data was imported into MongoDB and then queried to gain insights about the area. If you've always wondered what the most popular fast food restaurant in San Diego is, read on!
You can read my analysis here, or look at the code used to create it. This project was completed as part of the Udacity Data Analyst Nanodegree.
Skills used: Python, MongoDB, XML, ElementTree, JSON
Enron was the source of one of the largest corporate fraud cases in history. For this project, machine learning algorithms were used to exploit relationships and patterns in email and financial data from the Enron fraud case to identify persons of interest (POIs). This involved determining the best information (or features) to use in the financial and email data, testing/training four different machine learning algorithms, optimizing their performance, and selecting the best algorithm.
To find out which algorithm performed the best, you can read my analysis here, or look at the code used to create it. This project was completed as part of the Udacity Data Analyst Nanodegree.
Skills used: Python, Machine Learning, Naive Bayes, Decision Tree, Support Vector Machine (SVM), Scikit-Learn, Feature Selection, Cross-Validation, Algorithm Evaluation/Selection, Text-Learning, NLTK.
Roughly 1600 different Vinho Verde red wines from Portugal were measured for their chemical properties (such as alcohol content, residual sugar, density, etc.), along with a blind "quality" rating from three wine experts. For this project, R was used to explore the various relationships in the dataset and determine which chemical properties had the biggest influence on wine quality. All of the analysis was documented stream-of-consciousness style, in order to explain the thought process used along the way.
To find out which chemical properties had the biggest influence on quality, you can read my analysis here, or look at the code used to create it. This project was completed as part of the Udacity Data Analyst Nanodegree.
Skills used: R, Data Analysis, Data Visualization, Multivariate Analysis, Statistics.
Since 2003, Airline Carriers that generate at least 1% of total domestic revenue have been required by the US Department of Transportation (DOT) to report on-time data for domestic flights.
To find out which airlines performed the best in 2014, you can check out my data visualization here, or read some background about the design rationale used and how the design iterated based on user feedback. This project was completed as part of the Udacity Data Analyst Nanodegree.
Skills used: Data Visualization, Data Analysis, D3, dimple.js.