Project 1

Chromosome Image Classification

And image pre-processing, deep learning, and model optimization.

How many chromosomes do humans have? How many pairs are there? Can you tell which one is which? How long would it take you? Click on the image below and try it. It's not easy or simple, is it?

What if I told you that using something called a convolutional neural network, we can identify each chromosome in this karyotype in a split second? Let's walk through how.


Karyotype from Oncohema Key

Last summer, I had the privilege to diagnose chromosomal abnormalities from client blood samples at Carpermor Laboratory in Mexico City. I eventually learned to identify chromosomes by hand - memorizing all 46! I first started out looking at genes for hours on end. It was a challenge, especially when they involved translocations, deletions, and insertions. Chromosome identification got a lot easier as I learned the ropes of cytogenetics - it's certainly a field fundamental.

Photo from Thermofisher

Regardless, it is a time consuming part of the workflow. When you're working on dozens of samples per day, those minutes add up. As I spent more and more time in the lab, I thought about ways that I could optimize this lengthy step. When I came back to the states, I started working on ideas to automate it using what I knew about data science. This introduced me to deep learning and computer vision - which led me to my research question: Is it possible to accurately and efficiently categorize chromosomes using deep learning?


Of course, the application of deep learning in genetics is not new. In 2009, Wang et al. created a two-layer classification system using an artificial neural network with accuracy rates above 90% - grouping chromosomes first into seven classes, and then identifying them from those categories. Then, Zhang et al. (2018) applied convolutional neural network based deep learning to achieve an accuracy of 92.5%.


Beyond the realm of deep learning, though, programs that can automate and identify chromosomes from karyotypes have been available. I used one when when I worked at the lab. But these programs could only identify normal chromosomes, and abnormal ones still had to be identified by hand. The magic of deep learning is that we can train our network to identify all kinds of chromosomes - ranging from healthy ones to abnormal or even rare ones. As we provide more data from novel genetic signatures, the model will adjust and become more and more accurate over time. This application alone has the potential to be incredibly helpful for pathologists, cytogeneticists, and the field of medicine in general.

I wanted to go a step further and use what I learned from my experience in cytogenetics to inform my deep learning model. This focused on the application of three main concepts which I hope will further increase accuracy and efficiency:


  • First, it was that each chromosome has characteristic banding patterns going across it, and that these are read from top to bottom vertically. This is how chromosomes are interpreted by humans. If we augment our chromosomes in a uniform orientation via image pre-processing, we can teach our model to do the same.


  • Second, it was that chromosomes are organized into groups based on their physical features. Categorization using a 2-part classification system (first, the group letter, then the chromosome number) may improve our model's odds for accuracy.


  • And third, that identifying a chromosome (especially two of the same kind) in a (normal) karyotype significantly decreases the likelihood of that chromosome appearing again. Identifying chromosomes within the context of a karyotype is therefore incredibly helpful, because it can help you rule out which ones you already have. By extension, an accurate model should, too.


With these three feature-based characteristics, I hope to apply a novel way to improve our current understanding of chromosome classification.

A rare form of chromosome 13 called a ring chromosome. Figure from Rodrigues et al. (2014)

Dataset and Methods


The first thing I needed was a bunch of karyotypes for my model. I was fortunate enough to find a diverse one that is publicly available from the University of Passau. It contained 612 karyotypes, which was perfect. Then, I started working on data pre-processing - splitting up each karyotype by individual chromosomes. Using Python's CV2 package, I converted all images to binary grayscale and then created contours to form bounding rectangles around each chromosome based on their edges on the karyotype image. Then, I created a function to automatically tag each chromosome by chromosome number based on its position on the karyotype image (chromosomes are shown in numerical order in karyotypes). These chromosomes were then placed in the middle of a black rectangular background image.

From left to right: A karyotype image, an inverted binary image with regions of interest highlighted in green, an exported chromosome on a black background.

I then split the total of 13,057 chromosome images into training and validation categories, with 8712 and 4345 in each respective folder. To increase my dataset's robustness and to prevent overfitting, I augmented all images using flips, rotations, and magnification, and added them to my training and validation libraries.

From left to right: The original chromosome image and then four subsequent augmentations using a combination of horizontal flipping, height and width alteration, rotation, and magnification. This helps the model recognize the same chromosome in different ways, making it more robust.

Using a machine learning image classification library from Tensorflow, and Keras, an open source neural network library, I then optimized a nine-layer convolutional neural network (CNN) to process the chromosomes.


The basic concept behind a CNN is that it is an algorithm to analyze a visual image based on the arrangement and characteristics of its features. Each layer performs a different function and therefore processes the image in a different way. In the convolution layer, a group of pixels from the input chromosome image is computed by what's called a neuron, which extracts high-level features from our image and simplifies it into a single output. For example, it may detect the edge, band, or centromere of that chromosome. Then, a pooling layer takes this information and reduces it to its important features only. In this step, noise, such as the black background, is reduced, while maintaining the features we would like to keep, such as parts of the chromosome. A fully connected layer is a column vector which then optimizes weights to prioritize recurring features (such as the long shape and central centromere of chromosome one) for each chromosome, and then selects a chromosome output that has the highest weighted value. The model iterates the dataset many times, in what is called epochs, and adjusts values between each layer to best optimize the algorithm. A very cool simulation of CNN can be seen here.

A diagram of our model's convolutional neural network.

Baseline Model

One of the first things I did was create a baseline model without any of three proposed changes I discussed in the background. Without any vertical alignment, grouping, or karyotype context, the model performed surprisingly well already! In 70 iterations, the model achieved a validation accuracy of 92.2% and a training accuracy of 87.7%. That means that it can detect the right chromosome 9/10 times in a near instant. This metric is already so much faster than manual detection, even for a field expert. If I iterated it for longer, it may have increased in accuracy further, but I think this was already a promising start.

A figure of our model's training validation and accuracy across epochs.

Method 1: Image Pre-Processing

Next, I started to look for ways to straighten the chromosomes prior to processing them into our neural network. I had to do a lot of for-loops, but basically, I collected an array of all the average X coordinates inside each chromosome. Then, I collected the highest, median, and lowest Y coordinate from that array. Finally, I split the image into an upper and lower half about the median point and performed what's called an affine transformation on each half-image. This straightens the image so that the points of interest I collected are vertically aligned. Then, I performed a concatenation, which is a fancy term for putting the images back together. Let's see how it looks.

One chromosome with points of interest marked before and after concatenation.

The same chromosome before and after, with and without points of interest marked.

It works! The median Y coordinate from the array is an easy, albeit somewhat inaccurate, approximation for the chromosome centromere, since not all chromosomes have their centromere in the middle (see telocentric, acrocentric, and metacentric). Finding a better way to calculate the centromere would be a great experiment for the future.


Another concern I had was the image distortion that can make the banding harder to see. Especially because these chromosome images were collected directly from karyotype images, the low resolution further exacerbates this issue. The chromosomes are far from perfect, but I think most output chromosomes still looked better than they did before.

Two figure showing our model's training validation and accuracy before and after image pre-processing.

Model accuracy improved by 6% (87.7% to 93.7%)! Interestingly, you can also observe a reduction in accuracy stochasticity, especially once training and validation accuracy reaches around 90%. That's super interesting. If I were to guess, it probably has something to do with the visual uniformity that vertical alignment may have with the chromosomes in pre-processing. Let's see how else we can improve our model.

Method 2: Grouping


The rest of my project is a work-in-progress. I hope to update more soon!

  • Facebook - Black Circle
  • Instagram - Black Circle
  • LinkedIn
Photo from Oncohema Key