Brain morphology and behaviour analyses
- I) Importing all the available data|Importing all the available data
- II) Extract the data for males and females|Extract the data for males and females
- III) Define the subset of phenotypes to use|Define the subset of phenotypes to use
- IV) Find the studies ID|Find the studies ID
- V) Extraction of the relevant behavioural data|Extraction of the relevant behavioural data
The repository to get all the scripts : https://github.com/Sbourgeat/brain_behavior
The script for this part is dgrpool_behav.R
data_all_pheno <- readRDS("/Users/skumar/Documents/PhD/BrainAnalysis/Behavior/data.all_pheno_21_03_23_filtered.rds")
Web link https://github.com/DeplanckeLab/dgrpool
- From the DGRPool GitHub, we downloaded the .rds file containing the information of all the experiments done with DGRP flies
- A script is also given to get the most up to date version of the dataset
To reproduce exactly their data gathering steps see the following excerpt from their git page :
In order to be fully reproducible, we downloaded the phenotypes on the website at a given timepoint. The script used, download_phenotypes.R, access our API to download a "studies.json" file containing all metadata for each study. Then it uses the same API to download the phenotypes study by study, and format everything in a common format. It then generates a RDS file with a given timestamp, which is the common file used by all other methods, so that all scripts are using the same data, collected at a given timestamp. We here provided the data used in the latest version of the manuscript in data.all_pheno_21_03_23_filtered.rds, but you can run again download_phenotypes.R to generate a new RDS with the latest up-to-date phenotyping data.
# Extract the data for males, females and NA
data_male <- data_all_pheno[["M"]]
data_female <- data_all_pheno[["F"]]
data_na <- data_all_pheno[["NA"]]
Now we have all the data existing for the DGRP lines male, female, and NA. Here is an example of a table that can be found in data_male or data_female with the given column names (DGRP, study_id_1, study_id_2, study_id_...):
DGRP | study_id_1 | study_id_2 | study_id_n |
---|---|---|---|
DGRP 1 | Data 1 | Data 2 | Data 3 |
DGRP 2 | Data 4 | Data 5 | Data 6 |
DGRP 3 | Data 7 | Data 8 | Data 9 |
For this step, a .csv file is created defining the phenotypes to keep. Here is the head of the file used for this analysis
phenotypes | type_of_behavior |
---|---|
FarPoint_Butanedione | Olfactory |
LocRatio_Butanedione | Olfactory |
Resp_Butanedion_30pc | Olfactory |
The exact names of the phenotypes were obtained by manual queries on the DGRPool website: https://dgrpool.epfl.ch/phenotypes
First, we generate a subset to only keep behavioural data from the data we gathered from the website.
#open phenotypes_to_use.csv
phenotypes_to_use <- read.csv("/Users/skumar/Documents/PhD/BrainAnalysis/Behavior/brain_behavior/phenotypes_to_use.csv")
# drop the rows having a type_of_behavior different than olfactory, aggresive, and locomotor
phenotypes_to_use <- phenotypes_to_use[phenotypes_to_use$type_of_behavior == " olfactory" |
phenotypes_to_use$type_of_behavior == " aggresive" | phenotypes_to_use$type_of_behavior == " locomotor"
| phenotypes_to_use$type_of_behavior == " food"
| phenotypes_to_use$type_of_behavior == " sleep"
| phenotypes_to_use$type_of_behavior == " phototaxi",]
Then, we need to use phenotypes_to_use
to filter the data objects for both males and females.
However, we nee to find the id of the studies to be able to extract the relevant behavioural data from the .rds file.
To get the studies IDs, we need to extract information from https://dgrpool.epfl.ch/phenotypes.json?all=1 The JSON file is organised with the following headings:
- ID: ID generated by download_phenotypes.R
- Name: Phenotype name
- Description: Description of the phenotype
- Created at: Date of creation of the data entry
- Updated at: Date of update of the data entry
- Study ID: ID of the study download_phenotypes.R
- Obsolete: Indicates if the phenotype is obsolete or no longer used
- Number of lines in NBER data group: Number of data lines in the NBER data group associated with the phenotype
- Number of male individuals: Number of individuals in the dataset that are classified as male
- Number of female individuals: Number of individuals in the dataset that are classified as female
- Number of individuals with unknown sex: Number of individuals in the dataset with unknown sex
- Is summary: Indicates if the phenotype is a summary or aggregate measure
- Is numeric: Indicates if the phenotype is a numeric value
- Is continuous: Indicates if the phenotype is a continuous measure
- Dataset ID: ID of the dataset where the phenotype data is stored
- Summary type ID: ID indicating the type of summary associated with the phenotype
- Unit ID: ID of the unit in which the phenotype is measured
- Sex breakdown by data group: Breakdown of the number of males and females in each data group associated with the phenotype
Using the JSON file, we can search and find the studies ID by using their names and create a new table containing the study id and name.
library(jsonlite)
json_phenotypes <- fromJSON("https://dgrpool.epfl.ch/phenotypes.json?all=1")
json_phenotypes <- json_phenotypes[with(json_phenotypes, order(id)),]
rownames(json_phenotypes) <- json_phenotypes$id
message(nrow(json_phenotypes), " phenotypes found")
#print the head of json_phenotypes
head(json_phenotypes)
#iterate through the phenotypes and if the value after "name" is in phenotypes_to_use.csv then add the id in a list called list_id and add name in a name list
list_id <- c()
name <- c()
list_id <- list()
for (i in 1:nrow(json_phenotypes)) {
if (json_phenotypes[i, "name"] %in% phenotypes_to_use$phenotype) {
list_id <- c(list_id, json_phenotypes[i, "id"])
name <- c(name, json_phenotypes[i, "name"])
}
}
list_id <- unlist(list_id)
print(name)
print(list_id)
# create a new dataframe with the transpose of list_id and name
phenotypes_for_analysis <- data.frame(list_id, name)
# write the dataframe phenotypes_for_analysis to a csv file, without the rows index
write.csv(phenotypes_for_analysis, "/Users/skumar/Documents/PhD/BrainAnalysis/Behavior/brain_behavior/phenotypes_for_analysis.csv", row.names = F)
The phenotypes_for_analysis table looks like that:
list_id | name |
---|---|
1311 | mn_RespBenzaldeh_3_5 |
1312 | mn_RespAcetophen_3_5 |
1313 | mn_RespHexanol_3_5 |
1314 | mn_RespHexanol_0_3 |
1316 | mn_Resp_Hexanal |
1317 | mn_Resp_Citral |
To finally extract the data we want, we need only need to change the columns names to match the id by only keeping the last 4 characters. The following code does that and filter the data as we want.
# change the colnames of data_male to keep only the end 4 characters which are the actual id of the phenotypes
# Get the current column names
current_names <- colnames(data_male)
# Create new column names by keeping only the last four characters
new_names <- substr(current_names, nchar(current_names) - 3, nchar(current_names))
# Assign the new column names to the dataframe
colnames(data_male) <- new_names
print(colnames(data_male))
# change the colnames of data_female to keep only the end 4 characters
# Get the current column names
current_names <- colnames(data_female)
# Create new column names by keeping only the last four characters
new_names <- substr(current_names, nchar(current_names) - 3, nchar(current_names))
# Assign the new column names to the dataframe
colnames(data_female) <- new_names
print(colnames(data_female))
# change the colnames of data_na to keep only the end 4 characters
# Get the current column names
current_names <- colnames(data_na)
# Create new column names by keeping only the last four characters
new_names <- substr(current_names, nchar(current_names) - 3, nchar(current_names))
# Assign the new column names to the dataframe
colnames(data_na) <- new_names
print(colnames(data_na))
#write data_male as csv file
write.csv(data_male, "/Users/skumar/Documents/PhD/BrainAnalysis/Behavior/brain_behavior/data_male.csv", row.names = F)
#write data_female as csv file
write.csv(data_female, "/Users/skumar/Documents/PhD/BrainAnalysis/Behavior/brain_behavior/data_female.csv", row.names = F)
#write data_na as csv file
write.csv(data_na, "/Users/skumar/Documents/PhD/BrainAnalysis/Behavior/brain_behavior/data_na.csv", row.names = F)
**The script for this part is `extract_behaviour.py This code performs several data manipulation tasks on multiple datasets and saves the results as separate CSV files. Here is a summary of what the code does:
- The code imports the data from two CSV files, 'data_male.csv' and 'data_female.csv', which are stored as separate dataframes:
data_male
anddata_female
.
data_male = pd.read_csv("data_male.csv")
data_female = pd.read_csv("data_female.csv")
- It imports another dataset from 'entropy_vol_sep2023.csv' and stores it as a dataframe called
vol_entropy
.
vol_entropy = pd.read_csv("entropy_vol_sep2023.csv")
- The code normalizes the column names in the
vol_entropy
dataframe by changing the column name 'DGRP' to 'dgrp'. If the length of an item in the 'dgrp' column is 2, it adds 'DGRP_0' as a prefix. Otherwise, it adds 'DGRP_' as a prefix. The resulting dataframe is returned asvol_entropy
.
vol_entropy['dgrp'] = vol_entropy['DGRP'].apply(lambda x: 'DGRP_0' + str(x) if len(str(x)) == 2 else 'DGRP_'+ str(x))
- The
vol_entropy
dataframe is then separated into two separate dataframes:male
andfemale
, based on the 'Sex' column.
male = vol_entropy[vol_entropy['Sex'] == 'male']
female = vol_entropy[vol_entropy['Sex'] == 'female']
- The
data_male
dataframe is merged with themale
dataframe based on the 'dgrp' column. The resulting merged dataframe is returned asmerged_data_male
.
merged_data_male = pd.merge(data_male, male, on='dgrp')
- Similarly, the
data_female
dataframe is merged with thefemale
dataframe based on the 'dgrp' column. The resulting merged dataframe is returned asmerged_data_female
.
merged_data_female = pd.merge(data_female, female, on='dgrp')
- Finally, the
merged_data_male
andmerged_data_female
dataframes are saved as separate CSV files named 'dgrpool_brain_behavior_male.csv' and 'dgrpool_brain_behavior_female.csv', respectively
**The script for this part is `behaviour_analysis.py
This code performs various computations and visualizations on a merged dataframe that combines brain and behavior data. Here's a breakdown of what each function does:
-
split_string_with_dgrp(df)
: This function takes a dataframe as input and splits the 'genotype' column into two separate lists: DGRP and sex. It iterates over each row in the dataframe and checks if the 'genotype' column contains the string 'dgrp'. If it does, it extracts the value and appends it to the DGRP list. It also extracts the corresponding value from the 'sex' column and appends it to the sex list. Finally, it returns the DGRP and sex lists. -
merge_data(brain, behav, DGRP, sex)
: This function merges the brain and behavior data based on the DGRP and sex values. It selects rows from the behavior dataframe where the 'genotype' column is in the DGRP list, the 'sex' column is in the sex list, and the 'head_scanned' column is True. It then applies some modifications to the 'genotype' column values in both dataframes to ensure consistency. It renames the 'genotype' column in the behavior dataframe to 'DGRP' and performs the merge operation on the 'DGRP' and 'sex' columns. Finally, it returns the merged dataframe. -
calculate_pvalues(df)
: This function calculates the p-values for the correlation matrix of a dataframe. It creates an empty dataframe with the same columns as the input dataframe. It then iterates over all pairs of columns in the input dataframe and calculates the p-value for their correlation using thepearsonr
function from thescipy.stats
library. The p-value is rounded to 4 decimal places and stored in the corresponding cell of the output dataframe. Finally, it returns the p-values dataframe.
The code also loads two CSV files, summary.csv
and vol_hratio.csv
, into the behav
and brain
dataframes, respectively. It calls the split_string_with_dgrp
function to extract the DGRP and sex lists from the behav
dataframe. It then calls the merge_data
function to merge the brain
and behav
dataframes based on the DGRP and sex values, resulting in the merged_df
dataframe.
# -*- coding: utf-8 -*-
#!/usr/bin/env python3
import pandas as pd
import plotly.graph_objects as go
from scipy.stats import pearsonr
def split_string_with_dgrp(df):
"""
This function takes a dataframe as input and splits the 'genotype' column into two separate lists: DGRP and sex.
Parameters:
df (DataFrame): The input dataframe containing the 'genotype' and 'sex' columns.
Returns:
DGRP (list): A list of DGRP values extracted from the 'genotype' column.
sex (list): A list of sex values extracted from the 'sex' column.
"""
DGRP=[]
sex = []
for i in range(len(df)):
if 'dgrp' in df.iloc[i, 1]:
DGRP.append(df.iloc[i, 1])
s= df.iloc[i,2]
sex.append(s)
return DGRP,sex
def merge_data(brain, behav, DGRP, sex):
"""
This function merges the brain and behavior data based on the DGRP and sex values.
Parameters:
brain (DataFrame): The brain data dataframe.
behav (DataFrame): The behavior data dataframe.
DGRP (list): A list of DGRP values.
sex (list): A list of sex values.
Returns:
merged_df (DataFrame): The merged dataframe containing the brain and behavior data.
"""
data = behav[behav['genotype'].isin(DGRP) & behav['sex'].isin(sex) & behav["head_scanned"]==True]
data['genotype'] = data['genotype'].apply(lambda x: 'DGRP_0' + x.split('dgrp')[1] if len(x.split('dgrp')[1]) == 2 else 'DGRP_' + x.split('dgrp')[1])
brain['DGRP'] = brain['DGRP'].apply(lambda x: 'DGRP_0' + x if len(x) == 2 else 'DGRP_'+ x)
data.rename(columns={'genotype': 'DGRP'}, inplace=True)
merged_df = pd.merge(brain, data, on=['DGRP', 'sex'])
return merged_df
def calculate_pvalues(df):
"""
This function calculates the p-values for the correlation matrix of a dataframe.
Parameters:
df (DataFrame): The input dataframe.
Returns:
pvalues (DataFrame): The dataframe containing the p-values for the correlation matrix.
"""
dfcols = pd.DataFrame(columns=df.columns)
pvalues = dfcols.transpose().join(dfcols, how='outer')
for r in df.columns:
for c in df.columns:
tmp = df[df[r].notnull() & df[c].notnull()]
pvalues[r][c] = round(pearsonr(tmp[r], tmp[c])[1], 4)
return pvalues
behav = pd.read_csv("/Users/skumar/Documents/PhD/BrainAnalysis/Behavior/summary.csv")
brain = pd.read_csv("/Users/skumar/Project/PHD_work/GWAS/dataset/vol_hratio.csv", sep=",")
DGRP,sex = split_string_with_dgrp(behav)
merged_df = merge_data(brain, behav, DGRP, sex)
correlation_matrix = merged_df[["abs_volume","h_ratio","activity","correct_choices","frac_time_on_shocked"]].corr()
"""
fig = px.imshow(correlation_matrix)
fig.show()
fig = px.imshow(calculate_pvalues(merged_df[["abs_volume","h_ratio","activity","correct_choices","frac_time_on_shocked"]]))
fig.show()
"""
p_values = calculate_pvalues(merged_df[["abs_volume", "h_ratio", "activity", "correct_choices", "frac_time_on_shocked"]])
fig = go.Figure(data=go.Heatmap(
z=correlation_matrix.values,
x=correlation_matrix.columns,
y=correlation_matrix.columns,
colorscale="Viridis",
colorbar=dict(title="Correlation Coefficient")
))
annotations = []
for i, row in enumerate(correlation_matrix.values):
for j, value in enumerate(row):
annotations.append(
dict(
x=correlation_matrix.columns[j],
y=correlation_matrix.columns[i],
text=f"p-value: {p_values.iloc[i, j]:.3f}",
showarrow=False,
font=dict(color="white" if abs(value) > 0.5 else "black")
)
)
fig.update_layout(
title="Correlation Coefficient and p-values",
annotations=annotations,
xaxis=dict(title="Variable"),
yaxis=dict(title="Variable"),
)
fig.show()
The output are stored as HTML plots and can be seen in the folder results.