Summary. In this project a messy dataset about airline flights is cleaned to enable it to be used for further analysis. For this purpose, the project uses Python pandas library.
Skills used:
The data analysed is from Kaggle [1]. Loading the data with pandas [2] we can see the format of the data is individual flights shown in rows and 31 variables about the flights displayed in columns.
There are a total of 1191805 entries. There are quite a number of things to notice straight away:
# Get an idea of how much missing data there is
columns = data.columns
percentage_missing = data.isnull().sum() * 100 / len(data)
table_percentage_missing = pd.DataFrame({'column_name': columns,
'percentage_missing': percentage_missing})
table_percentage_missing
CreateTableHTML('./figures/',table_percentage_missing,'table2')
We can see there is quite alot of missing data in a range of different columns. Some of this may be able to be filled in after further investigation. However, we will start by sorting out a few of the issues already identified.
def getTime(t):
if len(t) == 4:
if t == "2400":
t = "0000"
t = t + '00'
return t[:2] + ':' + t[2:4] + ':' + t[4:]
elif len(t) == 3:
t = '0' + t + '00'
return t[:2] + ':' + t[2:4] + ':' + t[4:]
data['CRSARRTIME'] = data['CRSARRTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['CRSARRTIME'] = data['CRSARRTIME'].map(getTime, na_action='ignore')
data['CRSARRTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['CRSARRTIME']).apply(str), format='%Y%m%d%H:%M:%S')
data['CRSARRTIME'].head()
data['ARRTIME'] = data['ARRTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['ARRTIME'] = data['ARRTIME'].str.replace('\.0','')
data['ARRTIME'] = data['ARRTIME'].map(getTime, na_action='ignore')
data['ARRTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['ARRTIME']).apply(str), format='%Y%m%d%H:%M:%S')
data['ARRTIME'].head()
data['DEPTIME'] = data['DEPTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['DEPTIME'] = data['DEPTIME'].str.replace('\.0','')
data['DEPTIME'] = data['DEPTIME'].map(getTime, na_action='ignore')
data['DEPTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['DEPTIME']).apply(str), format='%Y%m%d%H:%M:%S')
data['CRSDEPTIME'] = data['CRSDEPTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['CRSDEPTIME'] = data['CRSDEPTIME'].map(getTime, na_action='ignore')
data['CRSDEPTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['CRSDEPTIME']).apply(str), format='%Y%m%d%H:%M:%S')
data['WHEELSOFF'] = data['WHEELSOFF'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['WHEELSOFF'] = data['WHEELSOFF'].str.replace('\.0','')
data['WHEELSOFF'] = data['WHEELSOFF'].map(getTime, na_action='ignore')
data['WHEELSOFF'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['WHEELSOFF']).apply(str), format='%Y%m%d%H:%M:%S')
data['WHEELSON'] = data['WHEELSON'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['WHEELSON'] = data['WHEELSON'].str.replace('\.0','')
data['WHEELSON'] = data['WHEELSON'].map(getTime, na_action='ignore')
data['WHEELSON'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['WHEELSON']).apply(str), format='%Y%m%d%H:%M:%S')
data['DISTANCE'] = data['DISTANCE'].str.replace('miles','')
data['DISTANCE'] = pd.to_numeric(data['DISTANCE'], errors='coerce')
data['DIVERTED'] = data['DIVERTED'].str.replace('False','0')
data['DIVERTED'] = data['DIVERTED'].str.replace('F','0')
data['DIVERTED'] = data['DIVERTED'].str.replace('True','T')
data['DIVERTED'] = data['DIVERTED'].str.replace('T','0')
data['CANCELLED'] = data['CANCELLED'].str.replace('False','0')
data['CANCELLED'] = data['CANCELLED'].str.replace('F','0')
data['CANCELLED'] = data['CANCELLED'].str.replace('True','T')
data['CANCELLED'] = data['CANCELLED'].str.replace('T','0')
data['FLIGHTDATE'] = pd.to_datetime(data['FLIGHTDATE'], format='%Y%m%d')
df3 = data.head()
CreateTableHTML('./figures/',df3,'table3')
To begin with it looked like this.
The most straightforward options for dealing with missing data in this instance are to either:
The third option is definitely useful in this instance. For example, we can see below there are lots of instances in which ORIGINSTATENAME is not defined.
However, the ORIGINSTATENAME variable will be defined by the ORIGINAIRPORTCODE, since each airport is in only one state. To find the relevant ORIGINAIRPORTCODE we can find the unique combinations of ORIGINSTATENAME and ORIGINAIRPORTCODE for entries when the latter is known. Then we can use a dictionary to manually set the state for the ORIGINAIRPORTCODE's returned and use this to fill the na values for the missing ORIGINSTATENAME values.
mydf_originstatename = data.groupby(["ORIGINAIRPORTCODE", "ORIGINSTATENAME","ORIGINSTATE"]).size().reset_index().rename(columns={0:'count'})
mydf_nonull_originstatename = data.loc[data['ORIGINSTATENAME'].isnull(), 'ORIGINAIRPORTCODE']
mydf_nonull_originstatename.unique()
OriginStateNameDict = {
'LAW' : 'Oklahoma',
'MHK' : 'Kansas',
'HYS' : 'Kansas',
'OKC' : 'Oklahoma',
'FOE' : 'Kansas',
'TUL' : 'Oklahoma',
'ICT' : 'Kansas',
'GCK' : 'Kansas'
}
data['ORIGINSTATENAME'] = data['ORIGINSTATENAME'].fillna(data['ORIGINAIRPORTCODE'].map(OriginStateNameDict))
#check all missing values filled
df7 = data[data['ORIGINSTATENAME'].isnull()]
CreateTableHTML('./figures/',df7,'table7')
The same procedure can be used for ORIGINSTATE, apart from a dictionary with the state code is needed.
df8 = data[data['ORIGINSTATE'].isnull()].head()
CreateTableHTML('./figures/',df8,'table8')
# Clean origin state
mydf_nonull_originstate = data.loc[data['ORIGINSTATE'].isnull(), 'ORIGINAIRPORTCODE']
mydf_nonull_originstate.unique()
OriginStateDict = {
'LAW' : 'OK',
'MHK' : 'KS',
'HYS' : 'KS',
'OKC' : 'OK',
'FOE' : 'KS',
'TUL' : 'OK',
'ICT' : 'KS',
'GCK' : 'KS'
}
data['ORIGINSTATE'] = data['ORIGINSTATE'].fillna(data['ORIGINAIRPORTCODE'].map(OriginStateDict))
#check
df9 = data[data['ORIGINSTATE'].isnull()]
CreateTableHTML('./figures/',df9,'table9')
A similar procedure can be applied to fill missing values for the DESTSTATENAME column. However this time because we are dealing with the destination, DESTAIRPORTCODE has to be used to find the destination states.
# Clean DestStateName
mydf_deststatename = data.groupby(["DESTAIRPORTCODE", "DESTSTATENAME","DESTSTATE"]).size().reset_index().rename(columns={0:'count'})
df10 = mydf_deststatename.head()
CreateTableHTML('./figures/',df10,'table10')
mydf_nonull_deststatename = data.loc[data['DESTSTATENAME'].isnull(), 'DESTAIRPORTCODE']
mydf_nonull_deststatename.unique()
DestStateNameDict_deststatename_orig = {
'ATL' : "Georgia",
"ORD" : "Illinois",
'BWI' : "Maryland",
'DFW' : "Texas",
"LAS" : "Nevada",
'LIT' : "Arkansas",
'LAX' : "California",
'MCI' : "Missouri",
'IAH' : "Texas",
'HOU' : "Texas",
'MEM' : "Tennessee",
'EWR' : "New Jersey",
'OKC' : "Oklahoma",
'MCO' : "Florida",
'BNA' : "Tennessee",
'MSP' : "Minnesota",
'PHX' : "Arizona",
'MIA' : "Florida",
'PIT' : "Pennsylvania",
'SLC' : "Utah",
'SFO' : "California",
'STL' : "Missouri",
'SEA' : "Washington",
'TUL' : "Oklahoma",
'IAD' : "Virginia",
'ICT' : "Kansas",
'DAL' : "Texas",
'DTW' : "Michigan",
'MDW' : "Illinois",
'CVG' : "Kentucky",
'CLE' : "Ohio",
'DEN' : "Colorado"
}
DestStateNameDict = {
'OKC' : "Oklahoma",
'TUL' : "Oklahoma",
'ICT' : "Kansas",
'LAW' : "Oklahoma",
'MHK' : "Kansas",
'FOE' : "Kansas",
'GCK' : "Kansas",
'HYS' : "Kansas"
}
df11 = data[data['DESTSTATENAME'].isnull()].head()
CreateTableHTML('./figures/',df11,'table11')
data['DESTSTATENAME'] = data['DESTSTATENAME'].fillna(data['DESTAIRPORTCODE'].map(DestStateNameDict))
df12 = data[data['DESTSTATENAME'].isnull()]
CreateTableHTML('./figures/',df12,'table12')
# Clean DEST STATE
mydf_nonull_deststate = data.loc[data['DESTSTATE'].isnull(), 'DESTAIRPORTCODE']
mydf_nonull_deststate.unique()
DestStateNameDict_deststate_orig = {
'ATL' : "GA",
"ORD" : "IL",
'BWI' : "MD",
'DFW' : "TX",
"LAS" : "NV",
'LIT' : "AR",
'LAX' : "CA",
'MCI' : "MO",
'IAH' : "TX",
'HOU' : "TX",
'MEM' : "TN",
'EWR' : "NJ",
'OKC' : "OK",
'MCO' : "FL",
'BNA' : "TN",
'MSP' : "MN",
'PHX' : "AZ",
'MIA' : "FL",
'PIT' : "PA",
'SLC' : "UT",
'SFO' : "CA",
'STL' : "MO",
'SEA' : "WA",
'TUL' : "OK",
'IAD' : "VA",
'ICT' : "KS",
'DAL' : "TX",
'DTW' : "MI",
'MDW' : "IL",
'CVG' : "KY",
'CLE' : "OH",
'DEN' : "CO"
}
DestStateDict = {
'OKC' : "OK",
'TUL' : "OK",
'ICT' : "KS",
'LAW' : "OK",
'MHK' : "KS",
'FOE' : "KS",
'GCK' : "KS",
'HYS' : "KS"
}
df13 = data[data['DESTSTATE'].isnull()].head()
CreateTableHTML('./figures/',df13,'table13')
data['DESTSTATE'] = data['DESTSTATE'].fillna(data['DESTAIRPORTCODE'].map(DestStateDict))
df14 = data[data['DESTSTATE'].isnull()]
CreateTableHTML('./figures/',df14,'table14')
# Get an idea of how much missing data there is
columns_afterclean = data.columns
percentage_missing_afterclean = data.isnull().sum() * 100 / len(data)
table_percentage_missing_afterclean = pd.DataFrame({'column_name': columns_afterclean,
'percentage_missing': percentage_missing_afterclean})
table_percentage_missing_afterclean
CreateTableHTML('./figures/',table_percentage_missing_afterclean,'table15')
Clearly lots of missing data remains. However, this is now more from the datetime columns. Given there is a cancelled column and it doesn't make sense for there to be a departure time if a plane is cancelled, lets check to see if the data shows this pattern.
#look at time stats for those flights that were cancelled
mydf_cancelled = data[data['CANCELLED'] == "1"]
columns_cancelled = mydf_cancelled.columns
number_missing_cancelled = mydf_cancelled.isnull().sum() #* 100 / len(mydf_cancelled)
table_number_missing_cancelled = pd.DataFrame({'column_name': columns_cancelled,
'Number_missing': number_missing_cancelled})
table_number_missing_cancelled
CreateTableHTML('./figures/',table_number_missing_cancelled,'table16')
So out of 8160 cancelled flights a very high proportion of the missing values for DEPTIME, DEPDELAY, TAXIOUT, WHEELSOFF, WHEELSON, TAXIIN are also missing and all ARRTIME, ARRDELAY and ACTUALELAPSEDTIME variables are missing.
mydf_diverted = data[data['DIVERTED'] == "1"]
columns_diverted = mydf_diverted.columns
number_missing_diverted = mydf_diverted.isnull().sum() #* 100 / len(mydf_cancelled)
table_number_missing_diverted = pd.DataFrame({'column_name': columns_diverted,
'Number_missing': number_missing_diverted})
table_number_missing_diverted
CreateTableHTML('./figures/',table_number_missing_diverted,'table17')
For the diverted flights (799 flights in total), all these entries have missing values for ARRDELAY and ACTUALELAPSEDTIME.
not_cancelled = data['CANCELLED'] == "0"
not_diverted = data['DIVERTED'] == "0"
not_diverted_or_cancelled = not_cancelled & not_diverted
mydf_neither_departed_or_cancelled = data[not_diverted_or_cancelled]
columns_leftok = mydf_neither_departed_or_cancelled.columns
number_missing_leftok = mydf_neither_departed_or_cancelled.isnull().sum() #* 100 / len(mydf_cancelled)
table_number_missing_leftok = pd.DataFrame({'column_name': columns_leftok,
'Number_missing': number_missing_leftok})
table_number_missing_leftok
CreateTableHTML('./figures/',table_number_missing_leftok,'table18')
It is likely that using a similar strategy as that used for names will be useful to reduce the amount of missing data in these columns.