Kenneth Griffin, PhD, PE - Identifying Similar Airline Customers with K-Means Clustering

Methods: Python

Data Analysis: Pandas

Data Visualization: Matplotlib, Seaborn

Data Modeling: Scikit-Learn, K-Means

Web Development: HTML, CSS

Identifying Similar Airline Customers with K-Means Clustering

The objective of this exercise is to analyze airline customer data using K-Means to identify similar customer groups. The coding exercise is provided in the python jupyter notebook below.
K-means is a popular unsupervised learning clustering algorithm that identifies similar data points into groups or clusters. K-means can be used for a wide variety data such as identifying similar customers, anomaly and fraud detection, image processing, text processing, and many other valuable applications.

here for Scikit-Learn K-Means

here for a general overview of the K-Means algorithm

The K-Means clustering elbow method graph (number of clusters versus WCSS) shows there is a noticable bend (elbow) in the curve at approximately 3 clusters. The within-cluster sum of squares (WCSS), also known as inertia is calculated for each k value and appended to the wcss list. WCSS measures how compact clusters are, where lower wcss values indicate tighter clusters. The final graph shows the three clusters for Airline Customer AccountAge vs Mileage Balance.

Tables and graphs can be found here: Identifying Similar Airline Customers with K-Means Clustering Python Jupyter Notebook

Results

-The generated data consists of 20 points that each include how many months the customer account has been active (AccountAge) and the Current Mileage Balance (CurrentMileageBalance) reward points.

-Histograms are created with seaborn histplot to provide insight into the account age and current mileage balance distributions. The histogram plots show the distribution of the two variables are different, so the AccountAge variable is broken down into specific age ranges and displayed on a bar plot.

-The k value is the number of clusters for the k-means algorithm. The visual elbow method consists of plotting the k values and corresponding WCSS values over a given range. This exercise plotted the k values range of 1 to 12.

-There is a noticable elbow in the WCSS curve, which shows k = 3 is likely the best number of clusters for this exercise. The centroid coordinates were identified and plotted with the three data clusters on the Airline Customer AccountAge vs Mileage Balance Clusters graph.

-Similar data could be processed through this methodology in order to identify similar data points as well as anomalies.

Identifying Similar Airline Customers with K-Means Clustering

Results

Contact