Cleaning Airline Data with Python

 

 

 

Cleaning Airline Data with Python

Clive Edwards (mail at cedwards.info)


Summary. In this project a messy dataset about airline flights is cleaned to enable it to be used for further analysis. For this purpose, the project uses Python pandas library.

Skills used:

Dataset Synopsis

The data analysed is from Kaggle [1]. Loading the data with pandas [2] we can see the format of the data is individual flights shown in rows and 31 variables about the flights displayed in columns.

    There are a total of 1191805 entries. There are quite a number of things to notice straight away:

    Missing Data

    In the following the percentage of missing data is calculated.

    # Get an idea of how much missing data there is
    columns = data.columns
    percentage_missing = data.isnull().sum() * 100 / len(data)
    table_percentage_missing = pd.DataFrame({'column_name': columns,
                                     'percentage_missing': percentage_missing})
    table_percentage_missing
    CreateTableHTML('./figures/',table_percentage_missing,'table2')
    

      We can see there is quite alot of missing data in a range of different columns. Some of this may be able to be filled in after further investigation. However, we will start by sorting out a few of the issues already identified.

      Dealing with preliminary issues

      Changing columns to time format

      CRSDEPTIME, DEPTIME, WHEELSOFF, WHEELSON, CRSARRTIME, and ARRTIME need to be changed to format datetime objects.

      def getTime(t):
          if len(t) == 4:
              
              if t == "2400":
                  t = "0000"
                  
              t = t + '00' 
              return t[:2] + ':' + t[2:4] + ':' + t[4:]
          elif len(t) == 3:
          
              t = '0' + t + '00'
              return t[:2] + ':' + t[2:4] + ':' + t[4:]
      

      data['CRSARRTIME'] = data['CRSARRTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
      data['CRSARRTIME'] = data['CRSARRTIME'].map(getTime, na_action='ignore')
      data['CRSARRTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['CRSARRTIME']).apply(str), format='%Y%m%d%H:%M:%S')
      data['CRSARRTIME'].head()
      

      data['ARRTIME'] = data['ARRTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
      data['ARRTIME'] = data['ARRTIME'].str.replace('\.0','')
      data['ARRTIME'] = data['ARRTIME'].map(getTime, na_action='ignore')
      data['ARRTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['ARRTIME']).apply(str), format='%Y%m%d%H:%M:%S')
      data['ARRTIME'].head()
      

      data['DEPTIME'] = data['DEPTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
      data['DEPTIME'] = data['DEPTIME'].str.replace('\.0','')
      data['DEPTIME'] = data['DEPTIME'].map(getTime, na_action='ignore')
      data['DEPTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['DEPTIME']).apply(str), format='%Y%m%d%H:%M:%S')
      

      data['CRSDEPTIME'] = data['CRSDEPTIME'].apply(lambda x: str(x) if not pd.isnull(x) else x)
      data['CRSDEPTIME'] = data['CRSDEPTIME'].map(getTime, na_action='ignore')
      data['CRSDEPTIME'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['CRSDEPTIME']).apply(str), format='%Y%m%d%H:%M:%S')
      

      data['WHEELSOFF'] = data['WHEELSOFF'].apply(lambda x: str(x) if not pd.isnull(x) else x)
      data['WHEELSOFF'] = data['WHEELSOFF'].str.replace('\.0','')
      data['WHEELSOFF'] = data['WHEELSOFF'].map(getTime, na_action='ignore')
      data['WHEELSOFF'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['WHEELSOFF']).apply(str), format='%Y%m%d%H:%M:%S')
      

      data['WHEELSON'] = data['WHEELSON'].apply(lambda x: str(x) if not pd.isnull(x) else x)
      data['WHEELSON'] = data['WHEELSON'].str.replace('\.0','')
      data['WHEELSON'] = data['WHEELSON'].map(getTime, na_action='ignore')
      data['WHEELSON'] = pd.to_datetime((data['FLIGHTDATE'].astype(str)+data['WHEELSON']).apply(str), format='%Y%m%d%H:%M:%S')
      

      Distance column

      There are two issues here. First we need to remove the 'miles' string and then we need to convert variable to float.

      data['DISTANCE'] = data['DISTANCE'].str.replace('miles','')
      data['DISTANCE'] = pd.to_numeric(data['DISTANCE'], errors='coerce')
      

      Changing the diverted/cancelled columns

      Here we want to change the options to either True or False.

      data['DIVERTED'] = data['DIVERTED'].str.replace('False','0')
      data['DIVERTED'] = data['DIVERTED'].str.replace('F','0')
      data['DIVERTED'] = data['DIVERTED'].str.replace('True','T')
      data['DIVERTED'] = data['DIVERTED'].str.replace('T','0')
      
      data['CANCELLED'] = data['CANCELLED'].str.replace('False','0')
      data['CANCELLED'] = data['CANCELLED'].str.replace('F','0')
      data['CANCELLED'] = data['CANCELLED'].str.replace('True','T')
      data['CANCELLED'] = data['CANCELLED'].str.replace('T','0')
      

      Flight Date column

      Here we need to change the column to type datatime object. After these changes the dataframe now looks like the following.

      data['FLIGHTDATE'] = pd.to_datetime(data['FLIGHTDATE'], format='%Y%m%d')
      df3 = data.head()
      CreateTableHTML('./figures/',df3,'table3')
      

        To begin with it looked like this.

          Dealing with missing data

          The most straightforward options for dealing with missing data in this instance are to either:

          Cleaning ORIGINSTATENAME

          The third option is definitely useful in this instance. For example, we can see below there are lots of instances in which ORIGINSTATENAME is not defined.

            However, the ORIGINSTATENAME variable will be defined by the ORIGINAIRPORTCODE, since each airport is in only one state. To find the relevant ORIGINAIRPORTCODE we can find the unique combinations of ORIGINSTATENAME and ORIGINAIRPORTCODE for entries when the latter is known. Then we can use a dictionary to manually set the state for the ORIGINAIRPORTCODE's returned and use this to fill the na values for the missing ORIGINSTATENAME values.

            mydf_originstatename = data.groupby(["ORIGINAIRPORTCODE", "ORIGINSTATENAME","ORIGINSTATE"]).size().reset_index().rename(columns={0:'count'})
            

              mydf_nonull_originstatename = data.loc[data['ORIGINSTATENAME'].isnull(), 'ORIGINAIRPORTCODE']
              mydf_nonull_originstatename.unique()
              

              OriginStateNameDict = {
              'LAW' : 'Oklahoma', 
              'MHK' : 'Kansas',
              'HYS' : 'Kansas',
              'OKC' : 'Oklahoma',
              'FOE' : 'Kansas',
              'TUL' : 'Oklahoma',
              'ICT' : 'Kansas',
              'GCK' : 'Kansas'
              }
              

              data['ORIGINSTATENAME'] = data['ORIGINSTATENAME'].fillna(data['ORIGINAIRPORTCODE'].map(OriginStateNameDict))
              

              #check all missing values filled
              df7 = data[data['ORIGINSTATENAME'].isnull()]
              CreateTableHTML('./figures/',df7,'table7')
              

                Cleaning ORIGINSTATE

                The same procedure can be used for ORIGINSTATE, apart from a dictionary with the state code is needed.

                df8 = data[data['ORIGINSTATE'].isnull()].head()
                CreateTableHTML('./figures/',df8,'table8')
                

                  # Clean origin state
                  mydf_nonull_originstate = data.loc[data['ORIGINSTATE'].isnull(), 'ORIGINAIRPORTCODE']
                  mydf_nonull_originstate.unique()
                  

                  OriginStateDict = {
                  'LAW' : 'OK', 
                  'MHK' : 'KS',
                  'HYS' : 'KS',
                  'OKC' : 'OK',
                  'FOE' : 'KS',
                  'TUL' : 'OK',
                  'ICT' : 'KS',
                  'GCK' : 'KS'
                  }
                  

                  data['ORIGINSTATE'] = data['ORIGINSTATE'].fillna(data['ORIGINAIRPORTCODE'].map(OriginStateDict))
                  

                  #check
                  df9 = data[data['ORIGINSTATE'].isnull()]
                  CreateTableHTML('./figures/',df9,'table9')
                  

                    Cleaning DESTSTATENAME

                    A similar procedure can be applied to fill missing values for the DESTSTATENAME column. However this time because we are dealing with the destination, DESTAIRPORTCODE has to be used to find the destination states.

                    # Clean DestStateName
                    mydf_deststatename = data.groupby(["DESTAIRPORTCODE", "DESTSTATENAME","DESTSTATE"]).size().reset_index().rename(columns={0:'count'})
                    

                    df10 = mydf_deststatename.head()
                    CreateTableHTML('./figures/',df10,'table10')
                    

                      mydf_nonull_deststatename = data.loc[data['DESTSTATENAME'].isnull(), 'DESTAIRPORTCODE']
                      mydf_nonull_deststatename.unique()
                      

                      DestStateNameDict_deststatename_orig = {
                      'ATL' : "Georgia",
                      "ORD" : "Illinois",
                      'BWI' : "Maryland",
                      'DFW' : "Texas",
                      "LAS" : "Nevada",
                      'LIT' : "Arkansas",
                      'LAX' : "California",
                      'MCI' : "Missouri",
                      'IAH' : "Texas",
                      'HOU' : "Texas",
                      'MEM' : "Tennessee",
                      'EWR' : "New Jersey",
                      'OKC' : "Oklahoma",
                      'MCO' : "Florida",
                      'BNA' : "Tennessee",
                      'MSP' : "Minnesota",
                      'PHX' : "Arizona",
                      'MIA' : "Florida",
                      'PIT' : "Pennsylvania",
                      'SLC' : "Utah",
                      'SFO' : "California",
                      'STL' : "Missouri",
                      'SEA' : "Washington",
                      'TUL' : "Oklahoma",
                      'IAD' : "Virginia",
                      'ICT' : "Kansas",
                      'DAL' : "Texas",
                      'DTW' : "Michigan",
                      'MDW' : "Illinois",
                      'CVG' : "Kentucky",
                      'CLE' : "Ohio",
                      'DEN' : "Colorado"
                      }
                      
                      DestStateNameDict = {
                      'OKC' : "Oklahoma",
                      'TUL' : "Oklahoma",
                      'ICT' : "Kansas",
                      'LAW' : "Oklahoma",    
                      'MHK' : "Kansas",  
                      'FOE' : "Kansas",  
                      'GCK' : "Kansas",      
                      'HYS' : "Kansas"
                      }
                      

                      df11 = data[data['DESTSTATENAME'].isnull()].head()
                      CreateTableHTML('./figures/',df11,'table11')
                      

                        data['DESTSTATENAME'] = data['DESTSTATENAME'].fillna(data['DESTAIRPORTCODE'].map(DestStateNameDict))
                        

                        df12 = data[data['DESTSTATENAME'].isnull()]
                        CreateTableHTML('./figures/',df12,'table12')
                        

                          Cleaning DESTSTATE

                          # Clean DEST STATE
                          
                          mydf_nonull_deststate = data.loc[data['DESTSTATE'].isnull(), 'DESTAIRPORTCODE']
                          mydf_nonull_deststate.unique()
                          

                          DestStateNameDict_deststate_orig = {
                          'ATL' : "GA",
                          "ORD" : "IL",
                          'BWI' : "MD",
                          'DFW' : "TX",
                          "LAS" : "NV",
                          'LIT' : "AR",
                          'LAX' : "CA",
                          'MCI' : "MO",
                          'IAH' : "TX",
                          'HOU' : "TX",
                          'MEM' : "TN",
                          'EWR' : "NJ",
                          'OKC' : "OK",
                          'MCO' : "FL",
                          'BNA' : "TN",
                          'MSP' : "MN",
                          'PHX' : "AZ",
                          'MIA' : "FL",
                          'PIT' : "PA",
                          'SLC' : "UT",
                          'SFO' : "CA",
                          'STL' : "MO",
                          'SEA' : "WA",
                          'TUL' : "OK",
                          'IAD' : "VA",
                          'ICT' : "KS",
                          'DAL' : "TX",
                          'DTW' : "MI",
                          'MDW' : "IL",
                          'CVG' : "KY",
                          'CLE' : "OH",
                          'DEN' : "CO"
                          }
                          
                          DestStateDict = {
                          'OKC' : "OK",
                          'TUL' : "OK",
                          'ICT' : "KS",
                          'LAW' : "OK",    
                          'MHK' : "KS",  
                          'FOE' : "KS",  
                          'GCK' : "KS",      
                          'HYS' : "KS"
                          }
                          

                          df13 = data[data['DESTSTATE'].isnull()].head()
                          CreateTableHTML('./figures/',df13,'table13')
                          

                            data['DESTSTATE'] = data['DESTSTATE'].fillna(data['DESTAIRPORTCODE'].map(DestStateDict))
                            

                            df14 = data[data['DESTSTATE'].isnull()]
                            CreateTableHTML('./figures/',df14,'table14')
                            

                              Missing Data Re-check

                              # Get an idea of how much missing data there is
                              columns_afterclean = data.columns
                              percentage_missing_afterclean = data.isnull().sum() * 100 / len(data)
                              table_percentage_missing_afterclean = pd.DataFrame({'column_name': columns_afterclean,
                                                               'percentage_missing': percentage_missing_afterclean})
                              table_percentage_missing_afterclean
                              
                              CreateTableHTML('./figures/',table_percentage_missing_afterclean,'table15')
                              

                                Clearly lots of missing data remains. However, this is now more from the datetime columns. Given there is a cancelled column and it doesn't make sense for there to be a departure time if a plane is cancelled, lets check to see if the data shows this pattern.

                                #look at time stats for those flights that were cancelled
                                mydf_cancelled = data[data['CANCELLED'] == "1"] 
                                columns_cancelled = mydf_cancelled.columns
                                number_missing_cancelled = mydf_cancelled.isnull().sum() #* 100 / len(mydf_cancelled)
                                table_number_missing_cancelled = pd.DataFrame({'column_name': columns_cancelled,
                                                                 'Number_missing': number_missing_cancelled})
                                table_number_missing_cancelled
                                
                                CreateTableHTML('./figures/',table_number_missing_cancelled,'table16')
                                

                                  So out of 8160 cancelled flights a very high proportion of the missing values for DEPTIME, DEPDELAY, TAXIOUT, WHEELSOFF, WHEELSON, TAXIIN are also missing and all ARRTIME, ARRDELAY and ACTUALELAPSEDTIME variables are missing.

                                  mydf_diverted = data[data['DIVERTED'] == "1"] 
                                  columns_diverted = mydf_diverted.columns
                                  number_missing_diverted = mydf_diverted.isnull().sum() #* 100 / len(mydf_cancelled)
                                  table_number_missing_diverted = pd.DataFrame({'column_name': columns_diverted,
                                                                   'Number_missing': number_missing_diverted})
                                  table_number_missing_diverted
                                  
                                  CreateTableHTML('./figures/',table_number_missing_diverted,'table17')
                                  

                                    For the diverted flights (799 flights in total), all these entries have missing values for ARRDELAY and ACTUALELAPSEDTIME.

                                    not_cancelled = data['CANCELLED'] == "0"
                                    not_diverted = data['DIVERTED'] == "0"
                                    not_diverted_or_cancelled = not_cancelled & not_diverted
                                    mydf_neither_departed_or_cancelled = data[not_diverted_or_cancelled]
                                    
                                    columns_leftok = mydf_neither_departed_or_cancelled.columns
                                    number_missing_leftok = mydf_neither_departed_or_cancelled.isnull().sum() #* 100 / len(mydf_cancelled)
                                    table_number_missing_leftok = pd.DataFrame({'column_name': columns_leftok,
                                                                     'Number_missing': number_missing_leftok})
                                    table_number_missing_leftok
                                    
                                    CreateTableHTML('./figures/',table_number_missing_leftok,'table18')
                                    

                                      It is likely that using a similar strategy as that used for names will be useful to reduce the amount of missing data in these columns.

                                      Bibliography

                                      1. M. Metter. Flight Data., messy flight data, 2018.
                                      2. W. McKinney. pandas., Python, 2008.

                                      © 2021, Clive Edwards