janleipzig / viennacccdb Goto Github PK

Python 100.00%

viennacccdb's Introduction

ViennaCCCdb

Combination of three existing cell-cell communication databases, i.e. Liana Consensus database v0.1.7, CellPhoneDB v4 and a manual selection based on Pavlicev et al. 2017. If an interaction exists in more than one database it is included only once.

Methods

Download source data

Extract liana consensus database from https://github.com/saezlab/liana

Install liana

library(liana)
consensus<-select_resource(c('Consensus'))
write.table(consensus, file="liana-db_0.1.12.txt")

Download CellPhoneDB v4.1 from https://github.com/ventolab/cellphonedb-data/releases/tag/v4.1.0

Download additional interactions originally from Pavlicev et al. (2017) and curated by D. Stadtmauer from https://gitlab.com/wandplabs/ligrec-enzymes

Convert to liana format

python scripts/convert_cpdb_to_LianaFormat.py source_databases/cpdb_v4.1.0/ > source_databases/cpdb_lianaformat_4.1.txt

python scripts/convert_customData_to_LianaFormat.py source_databases/interaction_input_CellChatDB.csv > source_databases/customData_lianaformat.txt

Create "raw version" of combined database

This file is the basic input necessary for cell-cell interaction pipelines. Annotation of these interactions is stored in ViennaCCCdb_annotation.csv

python scripts/createCombinedDatabase.py source_databases/liana-db_0.1.12.txt source_databases/cpdb_lianaformat_4.1.txt source_databases/customData_lianaformat.txt > ViennaCCCdb_raw.csv

Notes

Manual editions to CellPhoneDB

for 'CCL3L1' no uniprot id was given, 'P16619' was used manually
interactions with 'IFNA*' genes are excluded since no uniprot ids were given in 'protein_input.csv', see #1
interactions with 'HLA' genes are excluded since no uniprot ids were given in 'protein_input.csv' possibly due to them beeing manually curated. Manual uniprot ids would have to be consistent with liana consensus

viennacccdb's People

Contributors

Watchers

Forkers

dnjst

viennacccdb's Issues

Upload liana_0.1.9 into source_databases

Genes with more than one name

PTPRC and CD45 refer to the same gene P08575.
In cpdb both names are used in different files, in liana only PTPRC.

Can or does this cause problems?

Processing of similar but not identical interactions

Currently all interactions which are not identical to one from another database are added, therefore if one database has an additional protein in a complex it will be added as an additional interaction.
Example:
CPDB) Complex:ProtA_Prot_B Complex:ProtC_ProtD
Liana) Complex:ProtA_Prot_B Complex:ProtC_ProtD
=> This interaction is only added once to our combined database

CPDB) Complex:ProtA_Prot_B Complex:ProtC_ProtD
Liana) Complex:ProtA_Prot_B Complex:ProtC_ProtD_ProtE
=> Both interactions are added to the combined database

Conflicting uniprot ids

In a few cases there are conflicting uniprot ids. We should either show only version or all of them.
Currently, none is given in the ViennaCCCdb_raw.csv file.

Add column indicating the source database(s) of an interaction

A column which shows from which source database an interaction came from. In case of multiple it should show all.

Small molecule annotation from cpdb

The cpdb small molecule annotation is currently lost, e.g.
Progesterone_byHSD3B1 is only represented by its corresponding uniprot id (P14060)

We could create a 'CPDB_small_molecule' column and save "Progesterone" there.

IFNA1* in cpdbv4

Currently excluded due to some inconsistent representation in the cpdb.

Should be looked into to rescue the IFNA* interactions

Cellphone db double entries for complexes due to different small molecule

This interaction occurs twice:
COMPLEX:ALOX5_ALOX5AP_LTC4S CYSLTR1 COMPLEX:P09917_P20292_Q16873

In cpdb it is stored for different small molecules, i.e.:
LipoxinA4_byALOX5,P09917,P20292,Q16873,,False,False,True,CHEBI:6498,True,False,,False,True,biosynthesis_enzyme,,FALSE,,
LeukotrieneC4_byLTC4S,P20292,P09917,Q16873,,False,False,True,CHEBI:16978,True,False,,False,True,biosynthesis_enzyme,,FALSE,,

If we just know the genes we can not distinguish if LipoxinA4 or LeukotrieneC4 is present. Both should be represented in the small molecule annotation but only one interaction kept.

Currently, only one interaction is kept, small molecule annotation is still missing.

Cellphone db double entries for complexes

IFNL3 has two different interactions. The complexes IL28_receptor and Type_III_IFNR.
Both complexes consist of the same two proteins, IFNLR1 and IL10RB therefore this results
in a duplicated interaction in the database, i.e.
IFNL3 COMPLEX:IL10RB_IFNLR1 Q8IZI9 COMPLEX:Q08334_Q8IU57

For now the interaction with "Type_III_IFNR" is hardcoded to be ignored in the conversion script 'convert_cpdb_to_LianaFormat.py'

Any better solutions, or is this a mistake in the cpdb database?

Annotation of interactions

Currently, no additional annotation is stored in the database.

Which annotations do we want
Where do we get the annotations from

Merge source databases based on gene name

Merging based on uniprot ids leads to duplicates (most likely due to different ids of different isoforms of the same gene). Try to prevent this by using gene names for merging.
Check if this prevents duplicates from appearing in the database.

CellPhoneDBv4.1 processing returns 1989 interactions, not 2923

Why are ~1000 interactions fewer after running the pipeline than in the input data?

I am looking at interaction_input.csv versus cpdb_lianaformat.txt

cpdb v4.1 interaction doublets

POMC OPRM1 P01189 P35372 Liana_v0.1.7;Cpdb_v4;Cpdb_v4
POMC OPRK1 P01189 P41145 Liana_v0.1.7;Cpdb_v4;Cpdb_v4
POMC OPRD1 P01189 P41143 Liana_v0.1.7;Cpdb_v4;Cpdb_v4

POMC is listed as individual interaction, as well as with b-Endorphin_byPOMC. Causes doublet

issmallmolecule

There are 270 interactions annotated as small molecules:
268 False False True
2 True False True
Supposedly it indicates "whether the interaction was sourced from the Pavlicev2017"

there are only 223 interactions in 'interaction_input_CellChatDB.csv' (the file is in source_database/ in this repository)
in addition there are some small molecules in the cellphonedb which we should annotate as "issmallmolecule" as well

intercell_vienna = pd.read_csv("https://raw.githubusercontent.com/JanLeipzig/ViennaCCCdb/main/ViennaCCCdb.csv", sep="\t")

# add complexes
intercell_vienna["source_genesymbol"] = ["COMPLEX:" + x if "_" in x and "COMPLEX:" not in x else x for x in intercell_vienna["source_genesymbol"]]
intercell_vienna["target_genesymbol"] = ["COMPLEX:" + x if "_" in x and "COMPLEX:" not in x else x for x in intercell_vienna["target_genesymbol"]]

But if other tools will mess up with it (need to double check others), maybe best to leave out? I think I vaguely remember the cell2cell vectors needing without this string, and maybe LIANA as well.