GithubHelp home page GithubHelp logo

viennacccdb's Introduction

ViennaCCCdb

Combination of three existing cell-cell communication databases, i.e. Liana Consensus database v0.1.7, CellPhoneDB v4 and a manual selection based on Pavlicev et al. 2017. If an interaction exists in more than one database it is included only once.

Methods

Download source data

Extract liana consensus database from https://github.com/saezlab/liana

  • Install liana
library(liana)
consensus<-select_resource(c('Consensus'))
write.table(consensus, file="liana-db_0.1.12.txt")

Download CellPhoneDB v4.1 from https://github.com/ventolab/cellphonedb-data/releases/tag/v4.1.0

Download additional interactions originally from Pavlicev et al. (2017) and curated by D. Stadtmauer from https://gitlab.com/wandplabs/ligrec-enzymes

Convert to liana format

python scripts/convert_cpdb_to_LianaFormat.py source_databases/cpdb_v4.1.0/ > source_databases/cpdb_lianaformat_4.1.txt

python scripts/convert_customData_to_LianaFormat.py source_databases/interaction_input_CellChatDB.csv > source_databases/customData_lianaformat.txt

Create "raw version" of combined database

This file is the basic input necessary for cell-cell interaction pipelines. Annotation of these interactions is stored in ViennaCCCdb_annotation.csv

python scripts/createCombinedDatabase.py source_databases/liana-db_0.1.12.txt source_databases/cpdb_lianaformat_4.1.txt source_databases/customData_lianaformat.txt > ViennaCCCdb_raw.csv

Notes

Manual editions to CellPhoneDB

  • for 'CCL3L1' no uniprot id was given, 'P16619' was used manually
  • interactions with 'IFNA*' genes are excluded since no uniprot ids were given in 'protein_input.csv', see #1
  • interactions with 'HLA' genes are excluded since no uniprot ids were given in 'protein_input.csv' possibly due to them beeing manually curated. Manual uniprot ids would have to be consistent with liana consensus

viennacccdb's People

Contributors

dnjst avatar janleipzig avatar

Watchers

 avatar  avatar

Forkers

dnjst

viennacccdb's Issues

Genes with more than one name

PTPRC and CD45 refer to the same gene P08575.
In cpdb both names are used in different files, in liana only PTPRC.

Can or does this cause problems?

Processing of similar but not identical interactions

Currently all interactions which are not identical to one from another database are added, therefore if one database has an additional protein in a complex it will be added as an additional interaction.
Example:
CPDB) Complex:ProtA_Prot_B Complex:ProtC_ProtD
Liana) Complex:ProtA_Prot_B Complex:ProtC_ProtD
=> This interaction is only added once to our combined database

CPDB) Complex:ProtA_Prot_B Complex:ProtC_ProtD
Liana) Complex:ProtA_Prot_B Complex:ProtC_ProtD_ProtE
=> Both interactions are added to the combined database

Conflicting uniprot ids

In a few cases there are conflicting uniprot ids. We should either show only version or all of them.
Currently, none is given in the ViennaCCCdb_raw.csv file.

Small molecule annotation from cpdb

The cpdb small molecule annotation is currently lost, e.g.
Progesterone_byHSD3B1 is only represented by its corresponding uniprot id (P14060)

We could create a 'CPDB_small_molecule' column and save "Progesterone" there.

IFNA1* in cpdbv4

Currently excluded due to some inconsistent representation in the cpdb.

Should be looked into to rescue the IFNA* interactions

Cellphone db double entries for complexes due to different small molecule

This interaction occurs twice:
COMPLEX:ALOX5_ALOX5AP_LTC4S CYSLTR1 COMPLEX:P09917_P20292_Q16873

In cpdb it is stored for different small molecules, i.e.:
LipoxinA4_byALOX5,P09917,P20292,Q16873,,False,False,True,CHEBI:6498,True,False,,False,True,biosynthesis_enzyme,,FALSE,,
LeukotrieneC4_byLTC4S,P20292,P09917,Q16873,,False,False,True,CHEBI:16978,True,False,,False,True,biosynthesis_enzyme,,FALSE,,

If we just know the genes we can not distinguish if LipoxinA4 or LeukotrieneC4 is present. Both should be represented in the small molecule annotation but only one interaction kept.

Currently, only one interaction is kept, small molecule annotation is still missing.

Cellphone db double entries for complexes

IFNL3 has two different interactions. The complexes IL28_receptor and Type_III_IFNR.
Both complexes consist of the same two proteins, IFNLR1 and IL10RB therefore this results
in a duplicated interaction in the database, i.e.
IFNL3 COMPLEX:IL10RB_IFNLR1 Q8IZI9 COMPLEX:Q08334_Q8IU57

For now the interaction with "Type_III_IFNR" is hardcoded to be ignored in the conversion script 'convert_cpdb_to_LianaFormat.py'

Any better solutions, or is this a mistake in the cpdb database?

Annotation of interactions

Currently, no additional annotation is stored in the database.

  • Which annotations do we want
  • Where do we get the annotations from

Merge source databases based on gene name

Merging based on uniprot ids leads to duplicates (most likely due to different ids of different isoforms of the same gene). Try to prevent this by using gene names for merging.
Check if this prevents duplicates from appearing in the database.

cpdb v4.1 interaction doublets

POMC OPRM1 P01189 P35372 Liana_v0.1.7;Cpdb_v4;Cpdb_v4
POMC OPRK1 P01189 P41145 Liana_v0.1.7;Cpdb_v4;Cpdb_v4
POMC OPRD1 P01189 P41143 Liana_v0.1.7;Cpdb_v4;Cpdb_v4

POMC is listed as individual interaction, as well as with b-Endorphin_byPOMC. Causes doublet

issmallmolecule

There are 270 interactions annotated as small molecules:
268 False False True
2 True False True
Supposedly it indicates "whether the interaction was sourced from the Pavlicev2017"

  • there are only 223 interactions in 'interaction_input_CellChatDB.csv' (the file is in source_database/ in this repository)
  • in addition there are some small molecules in the cellphonedb which we should annotate as "issmallmolecule" as well

cpdb 4.1

Integrate new cellphone db 4.1 data into combine database
-file formats have changed, convert script must be adapted

"COMPLEX:" in gene symbol columns?

Should we have COMPLEX: in the gene symbol columns as well as the Uniprot columns?

For using squidpy and chinpy, it expects this, so I wrote this code to update the list:

intercell_vienna = pd.read_csv("https://raw.githubusercontent.com/JanLeipzig/ViennaCCCdb/main/ViennaCCCdb.csv", sep="\t")

# add complexes
intercell_vienna["source_genesymbol"] = ["COMPLEX:" + x if "_" in x and "COMPLEX:" not in x else x for x in intercell_vienna["source_genesymbol"]]
intercell_vienna["target_genesymbol"] = ["COMPLEX:" + x if "_" in x and "COMPLEX:" not in x else x for x in intercell_vienna["target_genesymbol"]]

But if other tools will mess up with it (need to double check others), maybe best to leave out? I think I vaguely remember the cell2cell vectors needing without this string, and maybe LIANA as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.