...
Project Overview This project proposes data analysis of flight data for the year 2007. This analysis encompasses critical parameters such as flight delays, cancellations and the factors influencing these events. By examining this data, we can identify trends, patterns and anomalies. The 2007 dataset is particularly valuable as it captures a unique historical perspective. Utilizing statistical techniques and data visualization tools, this analysis aims to enhance our understanding of flight dynamics and contribute to better decision-making in the industry. Commercial Airlines Outlook 2007 In 2007, commercial airlines in the USA faced numerous challenges, significantly impacting their operations and profitability. Rising fuel costs were a major concern, squeezing profit margins and leading to increased ticket prices. Additionally, heightened security measures post-9/11 continued to affect passenger convenience and operational efficiency. The industry also grappled with fluctuating demand, partly due to economic uncertainty and competition from low-cost carriers. Weather-related disruptions and airport congestion further contributed to delays and cancellations, straining customer satisfaction. These challenges necessitated strategic adaptations, including cost-cutting measures, improved fuel efficiency and enhanced customer service to navigate the turbulent landscape of the aviation industry. Code and Resources Used Dataset Information Flight Data 2007 Table Airport Attributes Table Carrier Attributes Table Plane Attributes Table Data Import To Hadoop File And Hive Database The 2007.csv contains huge dataset that consumed approximately over 686.5MB. Thus exceeding the maximum file handling for Putty and Hortonworks Sandbox HDP 2.6.5.0. However, we can use Command Prompt to upload the 2007.csv into Putty by using SCP client for command line secure file copy pscp.exe which is downloadable from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html. The dataset can be downloaded by putting together the command below: Command Prompt PuTTY Commands Similarly we can repeat the steps above to load datasets for airports.csv, carriers.csv and plane-data.csv into Hadoop. Now, that the dataset is loaded in HDP, we can import the datasets into Hive Database using Query Editor as below: Image 1: Screenshoot to locate Hadoop file directory. Once the HDP is ready, we can locate the file directory from the File Preview. This directory will be use to upload into Hive Database as shown in the figure below: Image 2: Screenshot of upload table in Hive HDP. Similarly we can repeat the steps above to load datasets into Hive Database for airports.csv, carriers.csv and plane-data.csv. Now, that the dataset is loaded in Hive, we can overview and explore the datasets using Query Editor. First, we combine the datasets by identifying the most relevant variables. This will give more insights of the data. We can join the selected variables using the command below: Image 3: Query to join all four tables for better insights. The query above will generate huge dataset exceeding 1.3GB and data this big will not ideal for local data processing. Thus we will carefully select the suitable variables that could contribute to flight delays and cancellations. Now, we will execute query as below: Image 4: Query to extract delays in days of month. Image 5: Query for to extract delays during the weeks. Image 6: Query to extract delays over the months. Image 7: Query to extract primary factors for flight delays and cancellations. Image 8: Query to extract factors for flight cancellations. Image 9: Query to extract most frequent delays and flight cancellations. 🔶 Insights: From the figure above, we can observe that the least delayed flights during the day were in the time intervals of 0000-0259 hour and 0300-0559 hour. This early morning flights have the lowest delays. 🔶 Insights: From the Heatmap of Departure Delays by Time Interval and Day of Week, we can observe that the least delayed flights during the week were on Wednesday in the time intervals 0000-0259. 🔶Insights: From the Heatmap of Departure Delays by Time Interval and Month, we can observe that the least delayed flights during the year was on the month of February 2007 in the time intervals 0000-0259. The interactive map above gives the view of the first 100,000 delays among the operated airports. The blue dots explains lesser delays while yellow dots have more delays of approximately closer to 60,000 delays in a year. From the table below, we can observe that the most significant factor, with a total of 22,091,176 minutes of delay is the caused from Late Aircraft. Delays caused by a late incoming aircraft often cascade to affect the departure times of subsequent flights operated by the same aircraft. Meanwhile Carrier delays, which total 16,590,482 minutes, typically include factors such as maintenance or ground problems. These are directly controlled by the air carrier. Additionally, National Airspace System (NAS) includes delays of 16,535,622 minutes that are due to non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control. Weather-related delays accounted for 3,408,203 minutes. These delays are due to significant meteorological conditions (such as storms or fog) that affect flight operations and safety. Lastly, security delays are the least common cause, with 100,274 minutes of delay. This includes delays due to evacuations of terminals, re-boarding of aircraft due to security breaches, inoperative screening equipment and long lines at screening areas. 🔶Insights: The table above shows that late aircraft delays were the highest cause of delay, totalling 22,091,176 minutes. This was followed by delays caused by the carriers, totalling 16,590,482 minutes. 🔶Upon reviewing the table above, it is apparent that the top 5 highest flight cancellations are not attributed to weather conditions, late aircraft delays, NAS delays, or flight diversions. Another potential factor contributing to flight cancellations could be low passenger bookings, which may lead airlines to cancel flights. Additionally, technical issues with the flights, not included in the dataset, might also contribute to these cancellations. Performing Multiple Linear Regression We can perform multiple linear regression to identify coefficients among independent variables in the table above. Ideally, this regression will determine the response of variables to the dependent variable as the MLR model below: From there, we are able to develop a multiple linear regression model to determine response of dependent variable to to the independent variables as follow: 🔶Insights: Based on the data provided in the table, it is evident that Expressjet Airlines flight number 2996 experienced the highest frequency of delays, accumulating a total of 20,048 minutes of delay. Following closely behind is Alaska Airlines with flight number 64, which accumulated 19,063 minutes of delay. Additionally, other flights operated by Expressjet Airlines also ranked among the top 5 most frequently delayed flights, including flight number 2393 with 18,001 minutes of delay and flight number 2782 with 17,997 minutes of delay. In a separate analysis, it was observed that Alaska Airlines flight number 64 was cancelled 31 times over the course of the year, making it the flight with the highest number of cancellations. Choosing Airliners When selecting an airline, consider to review on-time performance, cancellation policies for each the choosen routes. Based on the analysis above, we can use the findings to choose wisely for a better journey. Bargraph below shows that number of flights offered by each airliners. Route Selections Choosing the right flight route is crucial for enhancing your travel experience. Opt for ideal flights to avoid potential delays or possibly cancellations. Seasonal weather patterns should also influence your decision, as certain routes might be prone to weather-related disruptions. When planning your travel, it's essential to compare airlines based on their service records, which include their punctuality and cancellations policy. Additionally, you might need to consider the flights frequency for the airlines as this will ensure you will get on the next flight whenever cancellations occured. Inconclusion, research different airlines for the best service records and convenience on your chosen route.Flight Delays And Cancellations In 2007¶
Attribute
Type
Attribute
Type
year
string
crselapsedtime: Estimated Elapsed time, in minutes
int
month: 1-12
string
actualelapsedtime: Actual Elapsed time, in minutes
int
dayofmonth: 1-31
string
origin: Origin IATA airport code
string
dayofweek: 1 (Monday) - 7 (Sunday)
string
dest: Destination IATA airport code
string
crsdeptime: scheduled departure time (local, hhmm)
string
distance: Distance between airports in miles
int
deptime: actual departure time (local, hhmm)
string
taxiin: Wheels down and arrival at the destination airport gate, in minutes
int
crsarrtime: scheduled arrival time (local, hhmm)
string
taxiout: The time elapsed between departure from the origin airport gate and wheels off, in minutes
int
arrtime: actual arrivaltime (local, hhmm)
string
cancelled: was the flight cancelled?
bool
uniquecarrier: Carrier Code
string
cancellationcode: reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
string
flightnum: Flight Number
string
diverted: 1 = yes, 0 = no
bool
tailnum: Plane Tail Number
string
carrierdelay: Delay (in minutes) caused by Carrier
int
airtime: flight time in minutes
int
weatherdelay: Delay caused by weather, in minutes
int
arrdelay: difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers, in minutes
int
nasdelay: National Air System Delay, in minutes
int
depdelay: Difference in minutes between scheduled and actual departure time. Early departures show negative numbers, in minutes
int
securitydelay: Security Delay, in Minutes
int
lateaircraftdelay: Late Aircraft Delay, in Minutes
int
Attribute
Type
Attribute
Type
airport: Airport Name
string
lat: Airport Latitude
string
city: Airport City
string
long: Airport Longitude
string
country: Airport Country
string
state: Airport State
string
iata: Airport Code
string
Attribute
Type
Attribute
Type
code: Carrier Code
string
description: Carrier name
string
Attribute
Type
Attribute
Type
tailnum: Aircraft Tail Number
string
status: Status
string
type: Usage Type
string
aircraft_type: Aircraft Type
string
manufacturer: Manufacturer Company
string
engine_type: Engine Type
string
issue_date: Date Start Operating
string
year: Manufacture Year
int
model: Aircraft Model
string
Microsoft Windows [Version 10.0.22631.3593]
(c) Microsoft Corporation. All rights reserved.
C:\Users\Name>cd C:\ProgramData\Microsoft\Windows\Start Menu\Programs\PuTTY (64-bit)
C:\ProgramData\Microsoft\Windows\Start Menu\Programs\PuTTY (64-bit)>pscp -P 2222 "C:\Users\Name\Desktop\Data Management\2007.csv" maria_dev@127.0.0.1:2007.csv
hadoop fs -mkdir flight_data_2007
hadoop fs -put 2007.csv flight_data_2007
from IPython.display import display, Image
display(Image(filename=r'C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\locatedirectory.png'))
Image(filename=r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\uploadhivedatabase.png")
Image(filename=r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\JoinTables.png")
1. Optimal Times In Day, Week And Year For Minimizing Flight Delays¶
Image(filename=r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query1-DayofMonth.png")
Image(filename=r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query1-DayofWeek.png")
Image(filename=r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query1-Month.png")
Image(filename=r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query2-PrimaryFactorsContributingDelays.png")
Image(filename=r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query3-FactorsCancellations.png")
Image(filename=r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query4-MostDelays&CancelledFlights.png")
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
day_of_month = pd.read_csv(r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query1-DayofMonth.csv")
day_of_week = pd.read_csv(r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query1-DayofWeek.csv")
month_year = pd.read_csv(r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query1-Month.csv")
primary_delays = pd.read_csv(r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query2-PrimaryFactorsContributingDelays.csv")
factor_cancelled = pd.read_csv(r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query3-FactorsforCancellations.csv")
most_delays = pd.read_csv(r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\Query4-FlightsWithMostDelays&Cancellations.csv")
The Heatmap Day of Month¶
heatmap_data = day_of_month.pivot_table(
index='crsdeptime_interval',
columns='dayofmonth',
values='count_depdelay_status',
aggfunc='count'
)
f, ax = plt.subplots(figsize=(20, 16))
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='coolwarm', linewidths=.75, ax=ax)
plt.title('Heatmap of Departure Delays by Time Interval and Day of Month')
plt.xlabel('Day of Month')
plt.ylabel('CRS Departure Time Interval')
plt.show()
The Heatmap Day of Week¶
heatmap_data = day_of_week.pivot_table(
index='crsdeptime_interval',
columns='dayofweek',
values='count_depdelay_status',
aggfunc='count'
)
f, ax = plt.subplots(figsize=(20, 16))
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='coolwarm', linewidths=.75, ax=ax)
plt.title('Heatmap of Departure Delays by Time Interval and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('CRS Departure Time Interval')
plt.show()
The Heatmap of Month¶
heatmap_data = month_year.pivot_table(
index='crsdeptime_interval',
columns='month',
values='count_depdelay_status',
aggfunc='count'
)
f, ax = plt.subplots(figsize=(20, 16))
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='coolwarm', linewidths=.75, ax=ax)
plt.title('Heatmap of Departure Delays by Time Interval and Month')
plt.xlabel('Month')
plt.ylabel('CRS Departure Time Interval')
plt.show()
Interactive Map of Airports Delays¶
import plotly.express as px
fig = px.scatter_mapbox(month_year.head(100000), lat='originlat', lon='originlong', color='count_depdelay_status', hover_data= {'originairport': True, 'count_depdelay_status': True},
color_continuous_scale=px.colors.sequential.Viridis, mapbox_style='carto-positron', size_max=15, zoom=10)
fig.update_layout(autosize=True, mapbox_style='open-street-map', mapbox=dict(center=dict(lat=38, lon=-97), pitch=0, zoom=3))
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
2. Primary Factors Contributing Flight Delays¶
ax = sns.barplot(primary_delays, x="reason", y="totaltime", estimator="sum", errorbar=None)
ax.bar_label(ax.containers[0], fontsize=10)
plt.title('Bargraph of Reasons for Flight Delays');
primary_delays
reason
totaltime
0
Late Aircraft
22091176
1
Carrier
16590482
2
NAS
16535622
3
Weather
3408203
4
Security
100274
3. Factors Leading To Flight Cancellations¶
factor_cancelled.head()
month
flightnum
carrierdescription
destairport
originairport
diverted
weatherdelay
nasdelay
securitydelay
lateaircraftdelay
cancelled
total_cancellations
0
1
33
Alaska Airlines Inc.
Ted Stevens Anchorage International
Kodiak
0
0
0
0
0
1
13
1
1
32
Alaska Airlines Inc.
Kodiak
Ted Stevens Anchorage International
0
0
0
0
0
1
13
2
1
7098
Mesa Airlines Inc.
Denver Intl
Aspen-Pitkin Co/Sardy
0
0
0
0
0
1
11
3
1
4825
Atlantic Southeast Airlines
William B Hartsfield-Atlanta Intl
Erie Intl
0
0
0
0
0
1
10
4
1
7115
Mesa Airlines Inc.
Chicago O'Hare International
Austin Straubel International
0
0
0
0
0
1
10
from sklearn.linear_model import LinearRegression
X = factor_cancelled[['diverted', 'weatherdelay', 'nasdelay', 'securitydelay', 'lateaircraftdelay', 'flightnum', 'carrierdescription', 'originairport']].astype(bool)
y = factor_cancelled['total_cancellations']
model = LinearRegression()
model.fit(X, y)
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
Coefficients: [-0.06378787 -0.02004822 -0.04668883 -0.02334459 -0.04029519 0.
0. 0. ]
Intercept: 0.0637878689406704
4. Most Frequent Delays And Cancellations¶
most_delays.head()
flightnumber
carrier
totaldelayminutes
totalcancellations
0
2996
Expressjet Airlines Inc.
20048.0
0
1
64
Alaska Airlines Inc.
19063.0
31
2
2393
Expressjet Airlines Inc.
18001.0
0
3
2782
Expressjet Airlines Inc.
17997.0
0
4
44
Southwest Airlines Co.
17584.0
0
Recommendations¶
data=pd.read_csv(r"C:\Users\KEMAS\Desktop\Msc DS Analytics\Data Management\Assignment 2\delays&cancelled2007.csv")
fig, ax = plt.subplots(figsize=(10, 10))
airlines_ordered = (data["carrierdescription"].value_counts(ascending=True) / 1000).plot(
kind="barh", ax=ax, width=1, edgecolor="black"
)
ax.set_title("Number of Flights")
ax.set_xlabel("Number of Flights x 1,000");
Conclusion¶