沈梦圆的博客 怕什么真理无穷, 进一寸有一寸的欢喜。

CheckM:微生物基因组组装/metagenome基因组重构完整度和杂合度评估

2016-12-22

有一种错觉感觉学习生信就是在不断地安装、调试参数,每次碰到一个新软件,多么希望安装顺利些。CheckM 可以用来单菌基因组组装或者Metagenome Binning(重构)的基因组的完整度、杂合度质量评估信息等,以及根据Marker基因鉴定基因组的系统分类。CheckM已被用于meta_seq数据分析流程,用于评估重构的完整度检测。 更多详细的功能描述见:Github Wiki页面。

1

软件的安装

  • 软件官网:https://ecogenomics.github.io/CheckM/

  • 安装依赖:https://github.com/Ecogenomics/CheckM/wiki/Installation

  • HMMER (>=3.1b1) # 该软件的安装包被我删了,网站进不去了,不过我已经安装好啦,直接按照INSTALL进行即可,make install 用到了root权限才可以用的,其实应该用普通用户权限就可以。
  • prodigal (2.60 or >=2.6.1) # 必须使用的是prodigal而不是prodigal.linux;# 本来就安装好啦,应该也不难;
  • pplacer (>=1.1) # guppy是pplacer的一部分也必须添加到环境变量中;pplacer 二进制安装包在pplacer GitHub 页; # 这个也不难,我下载的是linux平台安装包,挺简单的;
# 下载软件
wget https://github.com/Ecogenomics/CheckM/archive/v1.0.7.zip
unzip v1.0.7.zip
cd CheckM-1.0.7
另一种安装方法,官网建议用这种方法安装,这里也是我用anaconda安装的,因为checkm是个python程序;
pip install numpy
install checkm-genome

下载数据包,一直不进行下载,我取消了,进行手动下载。

wget -c https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_v1.0.7.tar.gz
tar xvzf checkm_data_v1.0.7.tar.gz
checkm data setRoot /home/shenmy/biosoft/metabat_data # 然后加载数据包
*******************************************************************************
 [CheckM - data] Check for database updates. [setRoot]
*******************************************************************************
Path [/home/shenmy/biosoft/metabat_data] exists and you have permission to write to this folder.
(re) creating manifest file (please be patient).
You can run 'checkm data update' to ensure you have the latest data files.
Data location successfully changed to: /home/shenmy/biosoft/metabat_data
## 可以正常使用了~
shenmy@bigmem ~/biosoft/metabat_data
$ checkm

                ...::: CheckM v1.0.7 :::...

  Lineage-specific marker set: # 世系特殊标记集
    tree         -> Place bins in the reference genome tree
    tree_qa      -> Assess phylogenetic markers found in each bin
    lineage_set  -> Infer lineage-specific marker sets for each bin

  Taxonomic-specific marker set: # 分类特殊标记集
    taxon_list   -> List available taxonomic-specific marker sets
    taxon_set    -> Generate taxonomic-specific marker set

  Apply marker set to genome bins: # 将标记集应用到宏基因组分箱组中
    analyze      -> Identify marker genes in bins # 识别组中的标记基因
    qa           -> Assess bins for contamination and completeness  # 估计组中的完整度和杂合度

  Common workflows (combines above commands): # 普通工作流程(包含了上述命令)
    lineage_wf   -> Runs tree, lineage_set, analyze, qa
    taxonomy_wf  -> Runs taxon_set, analyze, qa

  Bin QA plots: # 绘制分箱质控图
    bin_qa_plot  -> Bar plot of bin completeness, contamination, and strain heterogeneity # 绘制组的完整度、杂合度和种异质度的柱状图

  Reference distribution plots: # 其他参考分布图
    gc_plot      -> Create GC histogram and delta-GC plot # GC图和delta-GC图
    coding_plot  -> Create coding density (CD) histogram and delta-CD plot  # 编码密度柱状图和delta-CD图
    tetra_plot   -> Create tetranucleotide distance (TD) histogram and delta-TD plot # 绘制四核甘酸距离柱状图和delta-四核甘酸距离图
    dist_plot    -> Create image with GC, CD, and TD distribution plots together # 同时产生上述所有图

  General plots:
    nx_plot      -> Create Nx-plots # 绘制Nx图
    len_plot     -> Cumulative sequence length plot # 累积序列长度图
    len_hist     -> Sequence length histogram # 序列的长度分布图
    marker_plot  -> Plot position of marker genes on sequences # 标记基因的分布图
    par_plot     -> Parallel coordinate plot of GC and coverage # GC和覆盖度的平行坐标图
    gc_bias_plot -> Plot bin coverage as a function of GC # 覆盖度

  Sequence subspace plots:
    cov_pca      -> PCA plot of coverage profiles # 覆盖度的PCA分析图
    tetra_pca    -> PCA plot of tetranucleotide signatures # 四核甘酸的信号的PCA分析图

  Bin exploration and modification: # 组的探索和修饰
    unique       -> Ensure no sequences are assigned to multiple bins # 保证没有序列被多次分到不同的组
    merge        -> Identify bins with complementary sets of marker genes # 识别组的标记基因集的完整性
    bin_compare  -> Compare two sets of bins (e.g., from alternative binning methods) # 比较两个组的特殊标记集(比较多种分箱方法)
    bin_union    -> [Experimental] Merge multiple binning efforts into a single bin set # 合并多个分箱成一个组集
    modify       -> [Experimental] Modify sequences in a bin # 修饰组中的序列
    outliers     -> [Experimental] Identify outlier in bins relative to reference distributions # 识别和参考分布基因组相关的序列

  实用小功能:
    unbinned     -> Identify unbinned sequences # 识别未被分箱的序列 
    coverage     -> Calculate coverage of sequences # 计算序列的覆盖度
    tetra        -> Calculate tetranucleotide signature of sequences # 计算序列四核甘酸的信号
    profile      -> Calculate percentage of reads mapped to each bin # 计算每个分组比对到的reads百分比
    join_tables  -> Join tab-separated value tables containing bin information # 合并表格
    ssu_finder   -> Identify SSU (16S/18S) rRNAs in sequences # 识别序列中的16S/18S

  Use: 'checkm data' to find, download and install database updates

  Use: checkm <command> -h for command specific help

运行以上功能是最好加上-x fa 或–genes -x faa来说明序列的类型

# 测试
shenmy@bigmem ~/biosoft/metabat_data
$ checkm test ~/checkm_test_results

*******************************************************************************
[CheckM - Test] Processing E.coli K12-W3310 to verify operation of CheckM.
*******************************************************************************
[Step 1]: Verifying tree command.
*******************************************************************************
 [CheckM - tree] Placing bins in reference genome tree.
*******************************************************************************
  Identifying marker genes in 1 bins with 1 threads:
    Finished processing 1 of 1 (100.00%) bins.
  Saving HMM info to file.
  Calculating genome statistics for 1 bins with 1 threads:
    Finished processing 1 of 1 (100.00%) bins.
  Extracting marker genes to align.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
  Extracting 43 HMMs with 1 threads:
    Finished extracting 43 of 43 (100.00%) HMMs.
  Aligning 43 marker genes with 1 threads:
    Finished aligning 43 of 43 (100.00%) marker genes.
  Reading marker alignment files.
  Concatenating alignments.
  Placing 1 bins into the genome tree with pplacer (be patient).
  { Current stage: 0:03:23.877 || Total: 0:03:23.877 }
  [Passed]
[Step 2]: Verifying tree_qa command.
*******************************************************************************
 [CheckM - tree_qa] Assessing phylogenetic markers found in each bin.
*******************************************************************************
  Reading HMM info from file.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
  QA information written to: /home/shenmy/checkm_test_results/results/tree_qa_test.tsv
  { Current stage: 0:00:01.052 || Total: 0:03:24.930 }
  [Passed]
[Step 3]: Verifying lineage_set command.
*******************************************************************************
 [CheckM - lineage_set] Inferring lineage-specific marker sets.
*******************************************************************************
  Reading HMM info from file.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
  Determining marker sets for each genome bin.
    Finished processing 1 of 1 (100.00%) bins (current: 637000110).
  Marker set written to: /home/shenmy/checkm_test_results/results/lineage_set_test.tsv
  { Current stage: 0:00:01.153 || Total: 0:03:26.083 }
*******************************************************************************
 [CheckM - lineage_set] Inferring lineage-specific marker sets.
*******************************************************************************
  Reading HMM info from file.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
  Determining marker sets for each genome bin.
    Finished processing 1 of 1 (100.00%) bins (current: 637000110).
  Marker set written to: /home/shenmy/checkm_test_results/results/lineage_set_test.tsv
  { Current stage: 0:00:01.036 || Total: 0:03:27.119 }
  [Passed]
[Step 4]: Verifying analyze command.
*******************************************************************************
 [CheckM - analyze] Identifying marker genes in bins.
*******************************************************************************
  Identifying marker genes in 1 bins with 1 threads:
    Finished processing 1 of 1 (100.00%) bins.
  Saving HMM info to file.
  { Current stage: 0:06:47.569 || Total: 0:10:14.689 }
  Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
  Aligning marker genes with multiple hits in a single bin:
    Finished processing 1 of 1 (100.00%) bins.
  { Current stage: 0:00:01.355 || Total: 0:10:16.044 }
  Calculating genome statistics for 1 bins with 1 threads:
    Finished processing 1 of 1 (100.00%) bins.
  { Current stage: 0:00:00.363 || Total: 0:10:16.407 }
  [Passed]
[Step 5]: Verifying qa command.
*******************************************************************************
 [CheckM - qa] Tabulating genome statistics.
*******************************************************************************
  Calculating AAI between multi-copy marker genes.
  Reading HMM info from file.
  Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
  QA information written to: /home/shenmy/checkm_test_results/results/qa_test.tsv
  { Current stage: 0:00:02.398 || Total: 0:10:18.806 }
  [Passed]
  { Current stage: 0:00:00.017 || Total: 0:10:18.824 }

2


评论