Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Overview

Analysis of SARS-CoV-2 reads in sequencing of 2018-2019 Antarctica samples in PRJNA692319

The samples analyzed here are described in this preprint, which is a pre-print by Istvan Csabai and co-workers that describes SARS-CoV-2 reads in samples from Antarctica sequencing in China. I was originally alerted to the pre-print by Carl Zimmer on Dec-23-2021. Istvan Csabai and coworkers subsequently posted a second pre-print that also analyzes the host reads.

Repeating key parts of the analysis

The code in this repo independently repeats some of the analyses.

To run the analysis, build the conda environment in environment.yml and then run the analysis using Snakefile. To do this on the Hutch cluster, using run.bash:

sbatch -c 16 run.bash

The results are placed in the ./results/ subdirectory. Most of the results files are not tracked due to file-size limitations, but the following key files are tracked:

  • results/alignment_counts.csv gives the number of reads aligning to SARS-CoV-2 for each sample. This confirms that three accessions (SRR13441704, SRR13441705, and SRR13441708) have most of the SARS-CoV-2 reads, although a few other samples also have some.
  • results/variant_analysis.csv reports all variants found in the samples relative to Wuhan-Hu-1.
  • results/variant_analysis_to_outgroup.csv reports the variants found in the samples that represent mutations from Wuhan-Hu-1 towards the two closest bat coronavirus relatives, RaTG13 and BANAL-20-52. Note that some of the reads contain three key mutations relative to Wuhan-Hu-1 (C8782T, C18060T, and T28144C) that move the sequence closer to the bat coronavirus relatives. These mutations define one of the two plausible progenitors for all currently known human SARS-CoV-2 sequences (see Kumar et al (2021) and Bloom (2021)).

Archived links after initially hearing about pre-print

I archived the following links on Dec-23-2021 after hearing about the pre-print from Carl Zimmer:

Deletion of some samples from SRA

On Jan-3-2022, I received an e-mail one of the pre-print authors, Istvan Csabai, saying that three of the samples (appearing to be the ones with the most SARS-CoV-2 reads) had been removed from the SRA. He also noted that bioRxiv had refused to publish their pre-print without explanation; the file he attached indicates the submission ID was BIORXIV-2021-472446v1. I confirmed that three of the accessions had indeed been removed from the SRA as shown in the following archived links:

I also e-mailed Richard Sever at bioRxiv to ask why the pre-print was rejected, and explained I had repeated and validated the key findings. Richard Sever said he could not give details about the pre-print review process, but that in the future the authors could appeal if they thought the rejection was unfounded.

Details from Istvan Csabai

On Jan-4-2022, I chatted with Istvan Csabai. He had contacted the authors of the pre-print, and shared their reply to him. The authors had prepped the samples in early 2019, and submitted to Sangon BioTech for sequencing in December, getting the results back in early January.

Second pre-print from Csabai and restoration of deleted files

Istvan Csabai then worked on a second pre-print that analyzed host reads and made various findings, including co-contamination with African green monkey (Vero?) and human DNA. He sent me pre-print drafts on Jan-16-2022 and on Jan-24-2022, and I provided comments on both drafts and agreed to be listed in the Acknowledgments.

On Feb-3-2022, Istvan Csabai told me that the second pre-print had also been rejected from bioRxiv. Because I had previously contacted Richard Sever when I heard the first pre-print was rejected, I suggested Istvan could CC me on an e-mail to Richard Sever appealing the rejection, which he did. Unfortunately, Richard Sever declined the appeal, so instead Istvan posted the pre-print on Resarch Square.

At that point on Feb-3-2022, I also re-checked the three deletion accessions (SRR13441704, SRR13441705, and SRR13441708). To my surprise, all three were now again available by public access. Here are archived links demonstrating that they were again available:

I confirmed that the replaced accessions were identical to the deleted ones.

Inquiry to authors of PRJNA692319

On Feb-8-2022, I e-mailed the Chinese authors of the paper to ask about the sample deletion and restoration. They e-mailed back almost immediately. They confirmed what they had told Istvan: they had sequenced the samples with Sangon Biotech (Shanghai) after extracting the DNA in December 2019 from their samples. The suspect that contamination of the samples happened at Sangon Biotech. They deleted the three most contaminated samples from the Sequence Read Archive. They do not know why the samples were then "un-deleted."

Owner
Jesse Bloom
I research the evolution of viruses and proteins.
Jesse Bloom
LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

Simon Boehm 183 Jan 02, 2023
Local trajectory planner based on a multilayer graph framework for autonomous race vehicles.

Graph-Based Local Trajectory Planner The graph-based local trajectory planner is python-based and comes with open interfaces as well as debug, visuali

TUM - Institute of Automotive Technology 160 Jan 04, 2023
Process JSON files for neural recording sessions using Medtronic's BrainSense Percept PC neurostimulator

percept_processing This code processes JSON files for streamed neural data using Medtronic's Percept PC neurostimulator with BrainSense Technology for

Maria Olaru 3 Jun 06, 2022
Unet network with mean teacher for altrasound image segmentation

Unet network with mean teacher for altrasound image segmentation

5 Nov 21, 2022
pix2pix in tensorflow.js

pix2pix in tensorflow.js This repo is moved to https://github.com/yining1023/pix2pix_tensorflowjs_lite See a live demo here: https://yining1023.github

Yining Shi 47 Oct 04, 2022
PyDEns is a framework for solving Ordinary and Partial Differential Equations (ODEs & PDEs) using neural networks

PyDEns PyDEns is a framework for solving Ordinary and Partial Differential Equations (ODEs & PDEs) using neural networks. With PyDEns one can solve PD

Data Analysis Center 220 Dec 26, 2022
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

Spotify 10.6k Jan 04, 2023
RoFormer_pytorch

PyTorch RoFormer 原版Tensorflow权重(https://github.com/ZhuiyiTechnology/roformer) chinese_roformer_L-12_H-768_A-12.zip (提取码:xy9x) 已经转化为PyTorch权重 chinese_r

yujun 283 Dec 12, 2022
Monify: an Expense tracker Program implemented in a Graphical User Interface that allows users to keep track of their expenses

💳 MONIFY (EXPENSE TRACKER PRO) 💳 Description Monify is an Expense tracker Program implemented in a Graphical User Interface allows users to add inco

Moyosore Weke 1 Dec 14, 2021
Custom studies about block sparse attention.

Block Sparse Attention 研究总结 本人近半年来对Block Sparse Attention(块稀疏注意力)的研究总结(持续更新中)。按时间顺序,主要分为如下三部分: PyTorch 自定义 CUDA 算子——以矩阵乘法为例 基于 Triton 的 Block Sparse A

Chen Kai 2 Jan 09, 2022
Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Clay Mullis 82 Oct 13, 2022
Official PyTorch implementation of "Evolving Search Space for Neural Architecture Search"

Evolving Search Space for Neural Architecture Search Usage Install all required dependencies in requirements.txt and replace all ..path/..to in the co

Yuanzheng Ci 10 Oct 24, 2022
Face Recognition Attendance Project

Face-Recognition-Attendance-Project In This Project You will learn how to mark attendance using face recognition, Hello Guys This is Gautam Kumar, Thi

Gautam Kumar 1 Dec 03, 2022
RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into tables through jointly extracting intervention, outcome and outcome measure entities and their relations.

Randomised controlled trial abstract result tabulator RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into

2 Sep 16, 2022
The official homepage of the COCO-Stuff dataset.

The COCO-Stuff dataset Holger Caesar, Jasper Uijlings, Vittorio Ferrari Welcome to official homepage of the COCO-Stuff [1] dataset. COCO-Stuff augment

Holger Caesar 715 Dec 31, 2022
CSD: Consistency-based Semi-supervised learning for object Detection

CSD: Consistency-based Semi-supervised learning for object Detection (NeurIPS 2019) By Jisoo Jeong, Seungeui Lee, Jee-soo Kim, Nojun Kwak Installation

80 Dec 15, 2022
Official Code For TDEER: An Efficient Translating Decoding Schema for Joint Extraction of Entities and Relations (EMNLP2021)

TDEER 🦌 🦒 Official Code For TDEER: An Efficient Translating Decoding Schema for Joint Extraction of Entities and Relations (EMNLP2021) Overview TDEE

33 Dec 23, 2022
EEGEyeNet is benchmark to evaluate ET prediction based on EEG measurements with an increasing level of difficulty

Introduction EEGEyeNet EEGEyeNet is a benchmark to evaluate ET prediction based on EEG measurements with an increasing level of difficulty. Overview T

Ard Kastrati 23 Dec 22, 2022
Python package for visualizing the loss landscape of parameterized quantum algorithms.

orqviz A Python package for easily visualizing the loss landscape of Variational Quantum Algorithms by Zapata Computing Inc. orqviz provides a collect

Zapata Computing, Inc. 75 Dec 30, 2022
The Environment I built to study Reinforcement Learning + Pokemon Showdown

pokemon-showdown-rl-environment The Environment I built to study Reinforcement Learning + Pokemon Showdown Been a while since I ran this. Think it is

3 Jan 16, 2022