PREDICTION OF BREAST CANCER USING K-NEAREST NEIGHBOUR: A SUPERVISED MACHINE LEARNING ALGORITHM
Author(s)
Pandey S1, Sharma A2, Siddiqui MK2, Singla D3, Vanderpuye-Orgle J4
1Parexel International, Lucknow, UP, India, 2Parexel International, Mohali, PB, India, 3Parexel International, Bangalore, India, 4PAREXEL, Glendale, CA, USA
OBJECTIVES: Mammograms are not 100% accurate in the identifying the breast cancer. Better methods are needed to predict the breast cancer without the need of surgical biopsies. The study evaluated the prediction accuracy of breast cancer using the K-nearest neighbor (k-NN) classifier algorithm. METHODS: The breast cancer dataset (containing 569 records and 32 attributes) was obtained from University of California Irvine (UCI) machine learning repository. Applying supervised machine learning technique to patient characteristics including tumor features (radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension), k-NEAREST NEIGHBOUR (k-NN) was used to detect whether the mass was malignant or benign. Data were segmented into a training dataset containing the first 469 observations to build the k-NN model and a testing dataset containing the remaining observations was used to simulate new patients. Normalization of the data points was applied to rescale the features to a standard range of values. The initial choice of k = 21, approximately square root of 469 patients in our training dataset was used. Alternative k-values (k = 1,5,11,15,21,27) were also tested to optimize the model performance. The analysis was conducted using “class” package of R (v3.6.2). RESULTS: In 100 simulations, 98% accuracy was achieved by the k-NN algorithm -- i.e., only 2 out of 100, or 2 percent of masses were incorrectly classified. Choice of k=21 seems more accurate than any other choices as it has the minimum number of incorrect identification of cancerous cells. CONCLUSIONS: Supervised machine learning algorithm was shown to be capable of tackling extremely complex tasks such as identification of cancerous masses with reasonable accuracy. The application of this analysis could be an important resource for early detection of cancerous tumors and their treatment.
Conference/Value in Health Info
2020-05, ISPOR 2020, Orlando, FL, USA
Value in Health, Volume 23, Issue 5, S1 (May 2020)
Acceptance Code
AI3
Topic
Epidemiology & Public Health, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Disease Classification & Coding
Disease
Oncology