Profile Image
  • Methods: Python
  • Data Analysis: Pandas
  • Data Visualization: Matplotlib, Seaborn
  • Data Modeling: Scikit-Learn, K-Means
  • Web Development: HTML, CSS

Identifying Similar Airline Customers with K-Means Clustering

  • The objective of this exercise is to analyze airline customer data using K-Means to identify similar customer groups. The coding exercise is provided in the python jupyter notebook below.
  • K-means is a popular unsupervised learning clustering algorithm that identifies similar data points into groups or clusters. K-means can be used for a wide variety data such as identifying similar customers, anomaly and fraud detection, image processing, text processing, and many other valuable applications.
  • The K-means algorithm identifies the centroid (arithmetic mean) of all points within the same cluster. The K-means algorithm will assign every point to its closest cluster center (centroid) each iteration. The algorithm will continue to repeat this process until there is a minimum change of the cluster centers. More information can be found here for Scikit-Learn K-Means and here for a general overview of the K-Means algorithm.
  • The K-Means clustering elbow method graph (number of clusters versus WCSS) shows there is a noticable bend (elbow) in the curve at approximately 3 clusters. The within-cluster sum of squares (WCSS), also known as inertia is calculated for each k value and appended to the wcss list. WCSS measures how compact clusters are, where lower wcss values indicate tighter clusters. The final graph shows the three clusters for Airline Customer AccountAge vs Mileage Balance.
  

Results

-The generated data consists of 20 points that each include how many months the customer account has been active (AccountAge) and the Current Mileage Balance (CurrentMileageBalance) reward points.

-Histograms are created with seaborn histplot to provide insight into the account age and current mileage balance distributions. The histogram plots show the distribution of the two variables are different, so the AccountAge variable is broken down into specific age ranges and displayed on a bar plot.

-The k value is the number of clusters for the k-means algorithm. The visual elbow method consists of plotting the k values and corresponding WCSS values over a given range. This exercise plotted the k values range of 1 to 12.

-There is a noticable elbow in the WCSS curve, which shows k = 3 is likely the best number of clusters for this exercise. The centroid coordinates were identified and plotted with the three data clusters on the Airline Customer AccountAge vs Mileage Balance Clusters graph.

-Similar data could be processed through this methodology in order to identify similar data points as well as anomalies.


Contact

Please feel free to reach out through the following platforms: