Cleaning Airline Data with Python

Clive Edwards (mail at cedwards.info)

Summary. In this project a messy dataset about airline flights is cleaned to enable it to be used for further analysis. For this purpose, the project uses Python pandas library.

Skills used:

Programming (Python: pandas and NumPy libraries).

Dataset Synopsis

The data analysed is from Kaggle [1]. Loading the data with pandas [2] we can see the format of the data is individual flights shown in rows and 31 variables about the flights displayed in columns.

There are a total of 1191805 entries. There are quite a number of things to notice straight away:

FLIGHTDATE is an integer variable of form YYYYMMDD.
AIRLINECODE is a two letter code.
AIRLINENAME is a string description followed by : and the airline code.
TAILNUM has a @ in some rows.
ORIGAIRPORTNAME has format ORIGINCITYNAMEORIGINSTATE:airport name.
ARRTIME column is type float, representing time in format 1600.0 representing HHMM.0.
CRSARRTIME column is type int, representing time in 24 hour format (i.e.1600 for HH:MM).
CRSDEPTIME, DEPTIME, WHEELSOFF, WHEELSON, CRSARRTIME, and ARRTIME columns are all in numerical format.
CANCELLED/DIVERTED columns contain a mixture of False, F, 0 / True, T, 0 entries.
DISTANCE column is string object and appended by 'miles'.

Missing Data

In the following the percentage of missing data is calculated.

# Get an idea of how much missing data there is
columns = data.columns
percentage_missing = data.isnull().sum() * 100 / len(data)
table_percentage_missing = pd.DataFrame({'column_name': columns,
                                 'percentage_missing': percentage_missing})
table_percentage_missing
CreateTableHTML('./figures/',table_percentage_missing,'table2')

We can see there is quite alot of missing data in a range of different columns. Some of this may be able to be filled in after further investigation. However, we will start by sorting out a few of the issues already identified.

Dealing with preliminary issues

Changing columns to time format

CRSDEPTIME, DEPTIME, WHEELSOFF, WHEELSON, CRSARRTIME, and ARRTIME need to be changed to format datetime objects.

def getTime(t):
    if len(t) == 4:
        
        if t == "2400":
            t = "0000"
            
        t = t + '00' 
        return t[:2] + ':' + t[2:4] + ':' + t[4:]
    elif len(t) == 3:
    
        t = '0' + t + '00'
        return t[:2] + ':' + t[2:4] + ':' + t[4:]

data['CRSARRTIME'] = data['CRSARRTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['CRSARRTIME'] = data['CRSARRTIME'].map(getTime, na_action='ignore')
data['CRSARRTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['CRSARRTIME']).apply(str), format='%Y%m%d%H:%M:%S')
data['CRSARRTIME'].head()

data['ARRTIME'] = data['ARRTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['ARRTIME'] = data['ARRTIME'].str.replace('\.0','')
data['ARRTIME'] = data['ARRTIME'].map(getTime, na_action='ignore')
data['ARRTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['ARRTIME']).apply(str), format='%Y%m%d%H:%M:%S')
data['ARRTIME'].head()

data['DEPTIME'] = data['DEPTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['DEPTIME'] = data['DEPTIME'].str.replace('\.0','')
data['DEPTIME'] = data['DEPTIME'].map(getTime, na_action='ignore')
data['DEPTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['DEPTIME']).apply(str), format='%Y%m%d%H:%M:%S')

data['CRSDEPTIME'] = data['CRSDEPTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['CRSDEPTIME'] = data['CRSDEPTIME'].map(getTime, na_action='ignore')
data['CRSDEPTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['CRSDEPTIME']).apply(str), format='%Y%m%d%H:%M:%S')

data['WHEELSOFF'] = data['WHEELSOFF'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['WHEELSOFF'] = data['WHEELSOFF'].str.replace('\.0','')
data['WHEELSOFF'] = data['WHEELSOFF'].map(getTime, na_action='ignore')
data['WHEELSOFF'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['WHEELSOFF']).apply(str), format='%Y%m%d%H:%M:%S')

data['WHEELSON'] = data['WHEELSON'].apply(lambda x: str(x) if not pd.isnull(x) else x)
data['WHEELSON'] = data['WHEELSON'].str.replace('\.0','')
data['WHEELSON'] = data['WHEELSON'].map(getTime, na_action='ignore')
data['WHEELSON'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['WHEELSON']).apply(str), format='%Y%m%d%H:%M:%S')

Distance column

There are two issues here. First we need to remove the 'miles' string and then we need to convert variable to float.

data['DISTANCE'] = data['DISTANCE'].str.replace('miles','')
data['DISTANCE'] = pd.to_numeric(data['DISTANCE'], errors='coerce')

Changing the diverted/cancelled columns

Here we want to change the options to either True or False.

data['DIVERTED'] = data['DIVERTED'].str.replace('False','0')
data['DIVERTED'] = data['DIVERTED'].str.replace('F','0')
data['DIVERTED'] = data['DIVERTED'].str.replace('True','T')
data['DIVERTED'] = data['DIVERTED'].str.replace('T','0')

data['CANCELLED'] = data['CANCELLED'].str.replace('False','0')
data['CANCELLED'] = data['CANCELLED'].str.replace('F','0')
data['CANCELLED'] = data['CANCELLED'].str.replace('True','T')
data['CANCELLED'] = data['CANCELLED'].str.replace('T','0')

Flight Date column

Here we need to change the column to type datatime object. After these changes the dataframe now looks like the following.

data['FLIGHTDATE'] = pd.to_datetime(data['FLIGHTDATE'], format='%Y%m%d')
df3 = data.head()
CreateTableHTML('./figures/',df3,'table3')

To begin with it looked like this.

Dealing with missing data

The most straightforward options for dealing with missing data in this instance are to either:

Remove rows with missing values (removing 3 % of the data).
Estimate the numerical missing data using mean, median, mode.
Complete missing values using duplicated data present in other columns.
Categorical data can be completed by comparing values in other columns.

Cleaning ORIGINSTATENAME

The third option is definitely useful in this instance. For example, we can see below there are lots of instances in which ORIGINSTATENAME is not defined.

However, the ORIGINSTATENAME variable will be defined by the ORIGINAIRPORTCODE, since each airport is in only one state. To find the relevant ORIGINAIRPORTCODE we can find the unique combinations of ORIGINSTATENAME and ORIGINAIRPORTCODE for entries when the latter is known. Then we can use a dictionary to manually set the state for the ORIGINAIRPORTCODE's returned and use this to fill the na values for the missing ORIGINSTATENAME values.

mydf_originstatename = data.groupby(["ORIGINAIRPORTCODE", "ORIGINSTATENAME","ORIGINSTATE"]).size().reset_index().rename(columns={0:'count'})

mydf_nonull_originstatename = data.loc[data['ORIGINSTATENAME'].isnull(), 'ORIGINAIRPORTCODE']
mydf_nonull_originstatename.unique()

OriginStateNameDict = {
'LAW' : 'Oklahoma', 
'MHK' : 'Kansas',
'HYS' : 'Kansas',
'OKC' : 'Oklahoma',
'FOE' : 'Kansas',
'TUL' : 'Oklahoma',
'ICT' : 'Kansas',
'GCK' : 'Kansas'
}

data['ORIGINSTATENAME'] = data['ORIGINSTATENAME'].fillna(data['ORIGINAIRPORTCODE'].map(OriginStateNameDict))

#check all missing values filled
df7 = data[data['ORIGINSTATENAME'].isnull()]
CreateTableHTML('./figures/',df7,'table7')

Cleaning ORIGINSTATE

The same procedure can be used for ORIGINSTATE, apart from a dictionary with the state code is needed.

df8 = data[data['ORIGINSTATE'].isnull()].head()
CreateTableHTML('./figures/',df8,'table8')

# Clean origin state
mydf_nonull_originstate = data.loc[data['ORIGINSTATE'].isnull(), 'ORIGINAIRPORTCODE']
mydf_nonull_originstate.unique()

OriginStateDict = {
'LAW' : 'OK', 
'MHK' : 'KS',
'HYS' : 'KS',
'OKC' : 'OK',
'FOE' : 'KS',
'TUL' : 'OK',
'ICT' : 'KS',
'GCK' : 'KS'
}

data['ORIGINSTATE'] = data['ORIGINSTATE'].fillna(data['ORIGINAIRPORTCODE'].map(OriginStateDict))

#check
df9 = data[data['ORIGINSTATE'].isnull()]
CreateTableHTML('./figures/',df9,'table9')

Cleaning DESTSTATENAME

A similar procedure can be applied to fill missing values for the DESTSTATENAME column. However this time because we are dealing with the destination, DESTAIRPORTCODE has to be used to find the destination states.

# Clean DestStateName
mydf_deststatename = data.groupby(["DESTAIRPORTCODE", "DESTSTATENAME","DESTSTATE"]).size().reset_index().rename(columns={0:'count'})

df10 = mydf_deststatename.head()
CreateTableHTML('./figures/',df10,'table10')

mydf_nonull_deststatename = data.loc[data['DESTSTATENAME'].isnull(), 'DESTAIRPORTCODE']
mydf_nonull_deststatename.unique()

DestStateNameDict_deststatename_orig = {
'ATL' : "Georgia",
"ORD" : "Illinois",
'BWI' : "Maryland",
'DFW' : "Texas",
"LAS" : "Nevada",
'LIT' : "Arkansas",
'LAX' : "California",
'MCI' : "Missouri",
'IAH' : "Texas",
'HOU' : "Texas",
'MEM' : "Tennessee",
'EWR' : "New Jersey",
'OKC' : "Oklahoma",
'MCO' : "Florida",
'BNA' : "Tennessee",
'MSP' : "Minnesota",
'PHX' : "Arizona",
'MIA' : "Florida",
'PIT' : "Pennsylvania",
'SLC' : "Utah",
'SFO' : "California",
'STL' : "Missouri",
'SEA' : "Washington",
'TUL' : "Oklahoma",
'IAD' : "Virginia",
'ICT' : "Kansas",
'DAL' : "Texas",
'DTW' : "Michigan",
'MDW' : "Illinois",
'CVG' : "Kentucky",
'CLE' : "Ohio",
'DEN' : "Colorado"
}

DestStateNameDict = {
'OKC' : "Oklahoma",
'TUL' : "Oklahoma",
'ICT' : "Kansas",
'LAW' : "Oklahoma",    
'MHK' : "Kansas",  
'FOE' : "Kansas",  
'GCK' : "Kansas",      
'HYS' : "Kansas"
}

df11 = data[data['DESTSTATENAME'].isnull()].head()
CreateTableHTML('./figures/',df11,'table11')

data['DESTSTATENAME'] = data['DESTSTATENAME'].fillna(data['DESTAIRPORTCODE'].map(DestStateNameDict))

df12 = data[data['DESTSTATENAME'].isnull()]
CreateTableHTML('./figures/',df12,'table12')

Cleaning DESTSTATE

# Clean DEST STATE

mydf_nonull_deststate = data.loc[data['DESTSTATE'].isnull(), 'DESTAIRPORTCODE']
mydf_nonull_deststate.unique()

DestStateNameDict_deststate_orig = {
'ATL' : "GA",
"ORD" : "IL",
'BWI' : "MD",
'DFW' : "TX",
"LAS" : "NV",
'LIT' : "AR",
'LAX' : "CA",
'MCI' : "MO",
'IAH' : "TX",
'HOU' : "TX",
'MEM' : "TN",
'EWR' : "NJ",
'OKC' : "OK",
'MCO' : "FL",
'BNA' : "TN",
'MSP' : "MN",
'PHX' : "AZ",
'MIA' : "FL",
'PIT' : "PA",
'SLC' : "UT",
'SFO' : "CA",
'STL' : "MO",
'SEA' : "WA",
'TUL' : "OK",
'IAD' : "VA",
'ICT' : "KS",
'DAL' : "TX",
'DTW' : "MI",
'MDW' : "IL",
'CVG' : "KY",
'CLE' : "OH",
'DEN' : "CO"
}

DestStateDict = {
'OKC' : "OK",
'TUL' : "OK",
'ICT' : "KS",
'LAW' : "OK",    
'MHK' : "KS",  
'FOE' : "KS",  
'GCK' : "KS",      
'HYS' : "KS"
}

df13 = data[data['DESTSTATE'].isnull()].head()
CreateTableHTML('./figures/',df13,'table13')

data['DESTSTATE'] = data['DESTSTATE'].fillna(data['DESTAIRPORTCODE'].map(DestStateDict))

df14 = data[data['DESTSTATE'].isnull()]
CreateTableHTML('./figures/',df14,'table14')

Missing Data Re-check

# Get an idea of how much missing data there is
columns_afterclean = data.columns
percentage_missing_afterclean = data.isnull().sum() * 100 / len(data)
table_percentage_missing_afterclean = pd.DataFrame({'column_name': columns_afterclean,
                                 'percentage_missing': percentage_missing_afterclean})
table_percentage_missing_afterclean

CreateTableHTML('./figures/',table_percentage_missing_afterclean,'table15')

Clearly lots of missing data remains. However, this is now more from the datetime columns. Given there is a cancelled column and it doesn't make sense for there to be a departure time if a plane is cancelled, lets check to see if the data shows this pattern.

#look at time stats for those flights that were cancelled
mydf_cancelled = data[data['CANCELLED'] == "1"] 
columns_cancelled = mydf_cancelled.columns
number_missing_cancelled = mydf_cancelled.isnull().sum() #* 100 / len(mydf_cancelled)
table_number_missing_cancelled = pd.DataFrame({'column_name': columns_cancelled,
                                 'Number_missing': number_missing_cancelled})
table_number_missing_cancelled

CreateTableHTML('./figures/',table_number_missing_cancelled,'table16')

So out of 8160 cancelled flights a very high proportion of the missing values for DEPTIME, DEPDELAY, TAXIOUT, WHEELSOFF, WHEELSON, TAXIIN are also missing and all ARRTIME, ARRDELAY and ACTUALELAPSEDTIME variables are missing.

mydf_diverted = data[data['DIVERTED'] == "1"] 
columns_diverted = mydf_diverted.columns
number_missing_diverted = mydf_diverted.isnull().sum() #* 100 / len(mydf_cancelled)
table_number_missing_diverted = pd.DataFrame({'column_name': columns_diverted,
                                 'Number_missing': number_missing_diverted})
table_number_missing_diverted

CreateTableHTML('./figures/',table_number_missing_diverted,'table17')

For the diverted flights (799 flights in total), all these entries have missing values for ARRDELAY and ACTUALELAPSEDTIME.

not_cancelled = data['CANCELLED'] == "0"
not_diverted = data['DIVERTED'] == "0"
not_diverted_or_cancelled = not_cancelled & not_diverted
mydf_neither_departed_or_cancelled = data[not_diverted_or_cancelled]

columns_leftok = mydf_neither_departed_or_cancelled.columns
number_missing_leftok = mydf_neither_departed_or_cancelled.isnull().sum() #* 100 / len(mydf_cancelled)
table_number_missing_leftok = pd.DataFrame({'column_name': columns_leftok,
                                 'Number_missing': number_missing_leftok})
table_number_missing_leftok

CreateTableHTML('./figures/',table_number_missing_leftok,'table18')

It is likely that using a similar strategy as that used for names will be useful to reduce the amount of missing data in these columns.

Bibliography

M. Metter. Flight Data., messy flight data, 2018.
W. McKinney. pandas., Python, 2008.