Sequencing costs

As Illumina sequencings are becoming more and more popular, there are various sequencing centers available for sequencing services around the nation. Here at UT-Austin, we have a renowned sequencing center (UT-Austin GSAF) that has been greatly helpful to my own work. However, recently we have had good experience with MD Anderson Science Park Next-Generation Sequencing (NGS) Facility. So in this post, I am going to compare the sequencing prices between these two centers for future reference.

Since the two sequencing centers have their sequencing prices posted online, I will do some web scraping to collect the data, and data cleaning for visualizing the comparisons.

%matplotlib inline

import urllib.request
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sequencing_tools.viz_tools import okabeito_palette

Science park price table

For MD Anderson Science park price, I will download the data from their website.

mda = 'https://www.mdanderson.org/research/research-resources/core-facilities/next-generation-sequencing-core/services.html'
mda_html = urllib.request.urlopen(mda)
soup = BeautifulSoup(mda_html, 'lxml')
tabs = soup.find_all('table')
table = tabs[-1]
mda_df = pd.read_html(str(table), 
                      flavor='bs4',
                     header=0,
                     index_col = 0)[0]
mda_df

	NGS User Group Member Price (per lane)	MDACC Faculty, UT System, BCM w/ MOU Price (per lane)	*External Out-of-Network Price (per lane)
Service
HiSeq 3000	NaN	NaN	NaN
50 bp, single read	$966.41	$1,100.71	$2,177.47
75 bp, paired end	$1,746.41	$1,880.71	$3,425.47
100 bp, paired end	$2,089.61	$2,223.91	$3,974.59
150 bp, paired end	$2,431.25	$2,565.55	$4,521.22
NextSeq 500	NaN	NaN	NaN
High Output:	NaN	NaN	NaN
75 bp, single read	$1,632.95	$1,853.96	$3,238.11
75 bp, paired end	$3,085.83	$3,306.84	$5,562.72
150 bp, paired end	$4,904.79	$5,125.80	$8,473.06
Mid Output:	NaN	NaN	NaN
75 bp, paired end	$1,232.55	$1,453.56	$2,597.47
150 bp, paired end	$1,941.83	$2,162.84	$3,732.32
MiSeq	NaN	NaN	NaN
50 bp, single read (v2 chemistry)	$948.99	$1,132.53	$1,358.78
150 bp, paired end (v2 chemistry)	$1,202.19	$1,481.73	$1,707.98
300 bp, single read (v2 chemistry)	$1,202.19	$1,481.73	$1,707.98
300 bp, paired end (v3 chemistry)	$1,782.99	$2,177.73	$2,403.98

Now, I need to clean up the table to make it more easily to manipulate. I will need a function to assign the platform (sequencing machines) for each run type (X bp, paired|single end). Then, I will only look at the member price, since that’s what we pay for.

def clean_mda_index(idx, data_col):
    platform = ''
    clear = False
    new_idx = []
    for i, dc in zip(idx,data_col):
        if pd.isnull(dc):
            if clear:
                platform = ''
                clear=False
            platform = platform + i + '_'
            new_idx.append(platform)
        else:
            new_idx.append(platform + i)
            clear = True
    return new_idx

mda_price_df = mda_df \
    .assign(seq_type = lambda d: clean_mda_index(d.index, d.iloc[:,0])) \
    .assign(seq_type = lambda d: np.where(d.seq_type.str.contains('Mid Output'),
                                         'NextSeq 500_' + d.seq_type,
                                         d.seq_type))\
    .reset_index() \
    .drop('Service', axis=1) \
    .pipe(lambda d: d[~pd.isnull(d.iloc[:,1])]) \
    .assign(platform = lambda d: d.seq_type.str.split('_', expand=True).iloc[:,0]) \
    .assign(basepair = lambda d: d.seq_type.str.extract('_([0-9]+)', expand=False).astype(int)) \
    .assign(ends = lambda d: d.seq_type.str.extract('(single|paired)', expand=False)) \
    .pipe(lambda d: d[~d.seq_type.str.contains('Mid')]) \
    .assign(md_price = lambda d: d.iloc[:,0].str.replace('[,$]','').astype(float))\
    .assign(machine = lambda d: d.platform.str.replace(' [0-9]+','')) \
    .assign(total_base = lambda d: np.where(d.ends == "paired", d.basepair * 2, d.basepair))\
    .assign(ends = lambda d: np.where(d.machine=="MiSeq", 'single', d.ends))\
    .filter(regex = 'platform|total|ends|md_price|machine')  \
    .drop_duplicates()
mda_price_df

	platform	ends	md_price	machine	total_base
1	HiSeq 3000	single	966.41	HiSeq	50
2	HiSeq 3000	paired	1746.41	HiSeq	150
3	HiSeq 3000	paired	2089.61	HiSeq	200
4	HiSeq 3000	paired	2431.25	HiSeq	300
7	NextSeq 500	single	1632.95	NextSeq	75
8	NextSeq 500	paired	3085.83	NextSeq	150
9	NextSeq 500	paired	4904.79	NextSeq	300
14	MiSeq	single	948.99	MiSeq	50
15	MiSeq	single	1202.19	MiSeq	300
17	MiSeq	single	1782.99	MiSeq	600

UT GSAF price table#

For UT GSAF price, we will download the table from UT GSAF website.

gsaf = 'https://wikis.utexas.edu/display/GSAF/Library+Prep+and+NGS+Pricing+Descriptions'
gsaf_html = urllib.request.urlopen(gsaf)
soup = BeautifulSoup(gsaf_html, 'lxml')
tabs = soup.find_all('table')
gsaf_df = pd.read_html(tabs[0].prettify(), 
             flavor='bs4',
             index_col = None,
            header=0)[0] 
gsaf_df

	Platform	Run Type	Internal / UT	External Academic	External Commercial
0	HiSeq 2500	SR 50	$1052	$1331	$1357
1	HiSeq 2500	SR 100	$1428	$1806	$1842
2	HiSeq 2500	PE 125	$2520	$3187	$3250
3	HiSeq 4000	SR 50 (1)	$ 1,043	$ 1,319	$1,345
4	HiSeq 4000	PE 150 (1)	$ 2,562	$ 3,239	$ 3,266
5	HiSeq 4000	SR 50	$ 1,697	$ 2,146	$ 2,173
6	HiSeq 4000	PE 75	$ 14,494	$ 18,327	$ 18,531
7	NextSeq 500	SR 75 H.O (1,3)	$ 2,302	$ 2,906	$ 3,060
8	NextSeq 500	PE 75 H.O . (1,3)	$ 3,826	$ 4,834	$ 4,988
9	NextSeq 500	PE 150 H.O.	$ 5,735	$ 7,250	$ 7,404
10	MiSeq	V2 - 300 cycles	$ 1,627	$ 2,054	$ 2,157
11	MiSeq	V2 - 500 cycles (1)	$ 1,771	$ 2,236	$ 2,339
12	MiSeq	V3 - 150 cycles	$ 1,463	$ 1,847	$ 1,950
13	MiSeq	V3 - 600 cycles (1)	$ 2,183	$ 2,758	$ 2,860

The resulting table is much more easier to be interpreted, since it is in a clean format. Again, I will only look at the Internal prices.

gsaf_price_df = gsaf_df\
    .iloc[:, :3] \
    .assign(machine = lambda d: d.Platform.str.replace(' [0-9]+','')) \
    .assign(basepair = lambda d: d['Run Type'].str.extract(' ([0-9]+)', expand=False).astype(int)) \
    .assign(ends = lambda d: np.where(d['Run Type'].str.contains('PE'), 'paired', 'single')) \
    .assign(total_base = lambda d: np.where(d.ends == "paired", 
                                      d.basepair * 2, 
                                      d.basepair))\
    .assign(gsaf_price = lambda d: d['Internal / UT'].str.replace('[$.,]','').astype(float)) \
    .filter(regex = 'total|price|machine|ends|Platform')
gsaf_price_df

	Platform	machine	ends	total_base	gsaf_price
0	HiSeq 2500	HiSeq	single	50	1052.0
1	HiSeq 2500	HiSeq	single	100	1428.0
2	HiSeq 2500	HiSeq	paired	250	2520.0
3	HiSeq 4000	HiSeq	single	50	1043.0
4	HiSeq 4000	HiSeq	paired	300	2562.0
5	HiSeq 4000	HiSeq	single	50	1697.0
6	HiSeq 4000	HiSeq	paired	150	14494.0
7	NextSeq 500	NextSeq	single	75	2302.0
8	NextSeq 500	NextSeq	paired	150	3826.0
9	NextSeq 500	NextSeq	paired	300	5735.0
10	MiSeq	MiSeq	single	300	1627.0
11	MiSeq	MiSeq	single	500	1771.0
12	MiSeq	MiSeq	single	150	1463.0
13	MiSeq	MiSeq	single	600	2183.0

I notice there’s a duplicated price of HiSeq 4000 single end 50-nt (1043 USD vs 1697.0 USD), so I will take the lowest one, and merge with the science park prices.

merge_df = gsaf_price_df\
    .groupby(['Platform','machine','ends','total_base'], as_index=False)\
    .agg({'gsaf_price':np.min})\
    .merge(mda_price_df, how ='outer')
merge_df

	Platform	machine	ends	total_base	gsaf_price	platform	md_price
0	HiSeq 2500	HiSeq	paired	250	2520.0	NaN	NaN
1	HiSeq 2500	HiSeq	single	50	1052.0	HiSeq 3000	966.41
2	HiSeq 4000	HiSeq	single	50	1043.0	HiSeq 3000	966.41
3	HiSeq 2500	HiSeq	single	100	1428.0	NaN	NaN
4	HiSeq 4000	HiSeq	paired	150	14494.0	HiSeq 3000	1746.41
5	HiSeq 4000	HiSeq	paired	300	2562.0	HiSeq 3000	2431.25
6	MiSeq	MiSeq	single	150	1463.0	NaN	NaN
7	MiSeq	MiSeq	single	300	1627.0	MiSeq	1202.19
8	MiSeq	MiSeq	single	500	1771.0	NaN	NaN
9	MiSeq	MiSeq	single	600	2183.0	MiSeq	1782.99
10	NextSeq 500	NextSeq	paired	150	3826.0	NextSeq 500	3085.83
11	NextSeq 500	NextSeq	paired	300	5735.0	NextSeq 500	4904.79
12	NextSeq 500	NextSeq	single	75	2302.0	NextSeq 500	1632.95
13	NaN	HiSeq	paired	200	NaN	HiSeq 3000	2089.61
14	NaN	MiSeq	single	50	NaN	MiSeq	948.99

Let’s see if prices from both centers are correlated.

plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.rc('axes',labelsize=15)
p = sns.FacetGrid(data = merge_df,
             hue = 'machine',
             size = 5)
p.map(plt.scatter, 'md_price','gsaf_price')
p.add_legend(title = '')
p.set(xlabel = 'MD Anderson Science park\nsequencing cost (USD)',
     ylabel = 'UT-Austin GSAF\nsequencing cost (USD)')

# label outlier
for i, row in merge_df.iterrows():
    if row['gsaf_price'] > 10000:
        label = row['machine'] + ' ' + str(row['total_base']) + ' bp'
        p.fig.axes[0].text(row['md_price'] + 200,
                           row['gsaf_price'],
                           label)

p.fig.axes[0].plot(range(5000), color = 'black')

[<matplotlib.lines.Line2D at 0x11cd434a8>]

png

We can see that sequencing prices from both centers are proportion to each other, with NextSeq runs tend to be of higher cost per sequenced base.

I noticed there was a HiSeq run being an outlier in the plot (~14000 USD at UT-Austin GSAF vs ~2000 USD at MDA Smithville), it is possibly a mis-labeling of a full run (8 lanes for HiSeq) price instead of a per-lane-price. For the next analysis, I will assume that’s the case.

I am interested at the head-to-head comparisons between the two sequencing centers for different sequencing run types,

merge_df = gsaf_price_df\
    .groupby(['Platform','machine','ends','total_base'], as_index=False)\
    .agg({'gsaf_price':np.min})\
    .merge(mda_price_df, how ='inner') \
    .filter(regex = 'price|machine|ends|base')\
    .pipe(pd.melt, id_vars = ['machine','ends','total_base'],
                 var_name = 'center',
                 value_name = 'price')\
    .assign(center = lambda d: np.where(d['center'].str.contains('gsaf'),
                                        'UT-Austin GSAF',
                                        'MD Anderson Science park')) \
    .assign(price = lambda d: np.where(d.price > 10000, d.price / 8, d.price))

p = sns.FacetGrid(data = merge_df,
             col = 'machine',
             sharex=False,
             size = 5)
p.map(sns.barplot, 'total_base',
                  'price',
                  'center',
                  palette = okabeito_palette(),
                  hue_order = ['MD Anderson Science park',
                              'UT-Austin GSAF'])
p.add_legend(bbox_to_anchor=(0.5,0.8),
            fontsize=12)
p.set(xlabel = '',
     ylabel = '')
p.set_titles('{col_name}')
p.fig.text(0.4, 0, 'Sequencing cycle (nt)', fontsize = 15)
p.fig.text(0, 0.8, 'Per-lane-price (USD)', fontsize = 15, rotation=90)

png

From the figure, it looks like UT Austin GSAF has a higher price for every sequencing type.

For the comparison in HiSeq runs, GSAF is running HiSeq 4000 while MD Anderson Science Park is running HiSeq 3000, so that maybe one of the reason. For the other ones, seems like the machines are the same.

This work is licensed under a Creative Commons Attribution 4.0 International License. If you liked this post, you can share it with your followers or follow me on Twitter!

Douglas C. Wu

Sequencing costs

Science park price table

UT GSAF price table#