Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

© 2019 IEEE. This study investigates the feasibility of applying state of the art deep learning techniques to detect precancerous stages of squamous cell carcinoma (SCC) cancer in real time to address the challenges while diagnosing SCC with subtle appearance changes as well as video processing speed. Two deep learning models are implemented, which are to determine artefact of video frames and to detect, segment and classify those no-artefact frames respectively. For detection of SCC, both mask-RCNN and YOLOv3 architectures are implemented. In addition, in order to ascertain one bounding box being detected for one region of interest instead of multiple duplicated boxes, a faster non-maxima suppression technique (NMS) is applied on top of predictions. As a result, this developed system can process videos at 16-20 frames per second. Three classes are classified, which are 'suspicious', 'high grade' and 'cancer' of SCC. With the resolution of 1920x1080 pixels of videos, the average processing time while apply YOLOv3 is in the range of 0.064-0.101 seconds per frame, i.e. 10-15 frames per second, while running under Windows 10 operating system with 1 GPU (GeForce GTX 1060). The averaged accuracies for classification and detection are 85% and 74% respectively. Since YOLOv3 only provides bounding boxes, to delineate lesioned regions, mask-RCNN is also evaluated. While better detection result is achieved with 77% accuracy, the classification accuracy is similar to that by YOLOYv3 with 84%. However, the processing speed is more than 10 times slower with an average of 1.2 second per frame due to creation of masks. The accuracy of segmentation by mask-RCNN is 63%. These results are based on the date sets of 350 images. Further improvement is hence in need in the future by collecting, annotating or augmenting more datasets.

Original publication




Conference paper

Publication Date



1606 - 1612