Ethoscopy
Ethoscopy user manual
- Getting started
- Metadata design
- Loading the data
- Behavpy
- Behavpy - Circadian analysis
- Behavpy - Hidden Markov Model
- Jupyter tutorials
Getting started
Installing ethoscopy as a docker container with ethoscope-lab (recommended).
The ethoscope-lab docker container is the recommended way to use ethoscopy. A docker container is a pre-made image that will run inside any computer, independent of the operating system you use. The docker container is isolated from the rest of the machine and will not interfere with your other Python or R installations. It comes with its own dependencies and will just work. The docker comes with its own multi-user jupyter hub notebook so lab members can login into it directly from their browser and run all the analyses remotely from any computer, at home or at work. In the Gilestro laboratory we use a common workstation with the following hardware configuration.
CPU: 12x Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
GPU: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
Hard drive: 1TB SSD for OS, 1TB
SSD for homes and cache, 7.3 TB for ethoscope data
Memory: 64GB
The workstation is accessible via the internet (behind VPN) so that any lab member can login into the service and run their analyses remotely. All the computational power is in the workstation itself so one can analyse ethoscope data from a tablet, if needs be. Follow the instruction below to install the ethoscope-lab docker container on your machine.
On linux (recommended)
The best solution is to install this on the same computer that collects the ethoscope data so ethoscopy can have access to the db
files directly stored in the machine. For most small installations, this computer could be the node.
To install the docker you will have to find out the following information:
- what is the internet name or address of the computer you want to access? This can be the IP or an actual name.
- Where are your ethoscope data stored? On a regular node they would be in
/ethoscope_data/results
Once these info are clear, you can proceed.
# Optional. Update the system to the latest version. You may want to restart after this.
sudo pamac update
# install docker
sudo pacman -S docker
# start the docker service
sudo systemctl enable --now docker
# and finally download and run the ethoscope-lab docker container
# the :ro flag means you are mounting that destination in read-only
sudo docker run -d -p 8000:8000 \
--name ethoscope-lab \
--volume /ethoscope_data/results:/mnt/ethoscope_results:ro \
--restart=unless-stopped \
ggilestro/ethoscope-lab
Installation on Windows or MacOS makes sense if you have actual ethoscope data on those machines, which is normally not the case. If you go for those OSs, I won't provide detailed instruction or support as I assume you know what you're doing.
On MacOS
Install the docker software from here. Open the terminal and run the same command as above, e.g.:
# download and run the ethoscope-lab docker container
# the :ro flag means you are mounting that destination in read-only
sudo docker run -d -p 8000:8000 \
--name ethoscope-lab \
--volume /path/to/ethoscope_data/:/mnt/ethoscope_results:ro \
--restart=unless-stopped \
ggilestro/ethoscope-lab
On Windows
Install the docker software from here. After installation, open the window terminal and issue the same command as above, only replacing the folder syntax as appropriate. For instance, if your ethoscope data are on z:\ethoscope_data
and the user data are on c:\Users\folder
use the following:
sudo docker run -d -p 8080:8080 \
--name ethoscope-lab \
--volume /z/ethoscope_data:/mnt/ethoscope_results:ro \
--restart=unless-stopped \
ggilestro/ethoscope-lab
Storing user data on the machine, not on the container (recommended)
ethoscopelab runs on top of a jupyterhub environment, meaning that it supports organised and simultaneous access by multiple users. Users will need to have their own credentials and their own home folder. The default user is ethoscopelab
, with password ethoscopelab
and this user will save all of their work in the folder called /home/ethoscopelab
. In the examples above, the users' folders are stored inside the container itself which is not ideal. A better solution is to mount the home folders to a local point in your machine. In the example below, we would use the folder /mnt/my_user_homes
.
sudo docker run -d -p 8000:8000 \
--name ethoscope-lab \
--volume /ethoscope_data/results:/mnt/ethoscope_results:ro \
--volume /home:/mnt/my_user_homes \
--restart=unless-stopped \
ggilestro/ethoscope-lab
Make sure that your local home location contains an ethoscopelab
folder that can be accessed by the ethoscopelab
user! In the example below, you would need to create a folder called /mnt/my_user_homes/ethoscopelab
.
Any folder in /mnt/my_user_homes
will become accessible to ethoscopelab. In our lab, we sync those using owncloud (an opensource Dropbox clone) so that every user has their files automatically synced across all their machines.
Creating new users
If you wan to add new users, you will have to do it from the command line. On the linux computer running ethoscopelab (normally the node) use the following commands:
#enter in a bash shell of the container
sudo docker exec -it ethoscope-lab /bin/bash
#create the username
useradd myusername -m
#set the password for the username you just created
passwd myusername
You will now be able to login into jupyter with these new credentials. The data will be stored in the newly created folder.
Persistent user credentials
In linux, user credentials are saved inside three files: /etc/passwd
, /etc/shadow
, /etc/group
. It is possible to store those on the host computer (e.g. the node) and then mount them to the container. This is called a persistent volume because the data will remain on the host computer even if the container is deleted. An example of a container running in this way is the following:
sudo docker run -d -p 8000:8000 \
--name ethoscope-lab \
--volume /mnt/data/results:/mnt/ethoscope_results:ro \
--volume /mnt/data/ethoscope_metadata:/opt/ethoscope_metadata \
--volume /mnt/homes:/home \
--volume /mnt/cache:/home/cache \
--restart=unless-stopped \
-e VIRTUAL_HOST="jupyter.lab.gilest.ro" \
-e VIRTUAL_PORT="8000" \
-e LETSENCRYPT_HOST="jupyter.lab.gilest.ro" \
-e LETSENCRYPT_EMAIL="giorgio@gilest.ro" \
--volume /mnt/secrets/passwd:/etc/passwd:ro \
--volume /mnt/secrets/group:/etc/group:ro \
--volume /mnt/secrets/shadow:/etc/shadow:ro \
--cpus=10 \
ggilestro/ethoscope-lab.gilest.ro:latest
Lines 12-14 indicate the location of the user credentials. This configuration allows to maintain user information even when upgrading ethoscopelab to newer versions.
Troubleshooting
If your Jupyter starts but hangs on the following image
It means that the ethoscopelab user does not have access to its own folder. This most likely indicates that you are running the container mounting the folder onto your local machine but the ethoscope home folder is either not present or does not have reading and writing access.
Install ethoscopy in your Python space
Ethoscopy is on github and on PyPi. You can install the latest stable version with pip3.
pip install ethoscopy
As of version 1.2.25, the required dependencies are:
Python >= 3.8
Pandas ^ 1.4.2
Numpy ^ 1.22.3
Scipy ^ 1.8
Hmmlearn ^ 0.3.0
Color ^ 0.1.5
Astropy ^ 5.1.1
PyWavelets ^ 1.4.1
plotly ^ 5.7.0
kaleido ^ 0.2.1
Metadata design
What is the metadata?
The metadata is a simple .csv
file that contains the information to (1) find and retrieve the data saved to a remote server and (2) to segment and transform the data to compare experimental conditions. I would recommend recording as many experimental variables as possible, just to be safe. Each row of the .csv
file is an individual from the experiment. It is mandatory to have at least the following columns: 'machine_name', 'region_id', and 'date' with date in a YYY-MM-DD format. Without these columns the data cannot be retrieved.
Top tips.
- Make the metadata exhaustive. Record everything just encase you need information in the future.
- Once you've finished all your replicates, put them in the same
csv
file with a column "replicate" to identify them. It's easier than loading them separately. - Make your metadata files straight away, as soon as you start your experiment. Time will erode the little details!
Loading the data
Setting up
To begin you need three paths saved as variables:
- the path to the metadata
.csv
file - the full path (including folder) the ftp location (eg: ftp://myethoscopedata.com/results/)
-
the path of a local folder to save downloaded
.db
files (if your .db files are already downloaded on the same machine running ethoscopy, then this will be the path to the folder containing them)
import ethoscopy as etho
# Replace with your own file/sever paths
meta_loc = 'USER/experiment_folder/metadata.csv'
remote = 'ftp://ftpsever/auto_generated_data/ethoscope_results'
local = 'USER/ethoscope_results'
# This will download the data remotely via FTP onto the local machine
# If your ethoscopy is running via ethoscope-lab on the same machine
# where the ethoscope data are, then this step is not necessary
etho.downlaod_from_remote_dir(meta_loc, remote, local)
download_from_remote_dir - Function reference
This function is used to import data from the ethoscope node platform to your local directory for later use. The ethoscope files must be saved on a remote FTP server and saved as .db files, see the Ethoscope manual for how to setup a node correctly.https://www.notion.so/giorgiogilestro/Ethoscope-User-Manual-a9739373ae9f4840aa45b277f2f0e3a7
This only needs to be run if like the Gilestro lab you have all your data saved to a remote ftp sever. If not you can skip straight to the the next part.
Create a modified metadata DataFrame
This function creates a modified metadata DataFrame with the paths of the saved .db files and generates a unique id for each experimental individual. This function only works for a file structure locally saved to whatever computer you are running and is saved in a nested director structure as created by the ethoscopes, i.e.
'LOCAL_DIR/ethoscope_results/00190f0080e54d9c906f304a8222fa8c/ETHOSCOPE_001/2022-08-23_03-33-59/DATABASE.db'
For this function you only need the path to the metadata file and the path the the first higher level of your database directories, as seen in the example below. Do not provide a path directly to the folder with your known .db file in it, the function searches all the saved data directories and selects the ones that match the metadata file.
meta_loc = 'USER/experiment_folder/metadata.csv'
local = 'USER/ethoscope_results' # remember to just provide the path up to the directory where the individual ethoscope files are saved
meta = etho.link_meta_index(meta_loc, local)
Load and modify the ethoscope data
The load function takes the raw ethoscope data from its .db format and modifies it into a workable pandas DataFrame format, changing the time (seconds) to be in reference to a given hour (usually lights on). Min and max times can be provided to filter the data to only recordings between those hours. With 0 being in relation to the start of the experiment not the reference hour.
data = etho.load_ethoscope(meta, min_time = 24, max_time = 48, reference_hour = 9.0)
# you can cache the each specimen as the data is loaded for faster load times when run again, just add a file path to a folder of choice, the first time it will save, the second it will search the folder and load straight from there
# However this can take up a lot of memory and it's recommended to save the whole loaded dataset at the end and to load from this each time. See the end of this page
data = etho.load_ethoscope(meta, min_time = 24, max_time = 48, reference_hour = 9.0, cache = 'path/ethoscope_cache/')
Additionally, an analysing function can be also called to modify the data as it is read. It's recommended you always call at least max_velocity_detector or sleep_annotation function when loading the data as it generates columns that are needed for the analysis / plot generating methods.
from functools import partial
data = etho.load_ethoscope(meta, reference_hour = 9.0, FUN = partial(etho.sleep_annotation, time_window_length = 60, min_time_immobile = 300))
# time_window_length is the amount of time each row represents. The ethoscope can record multiple times per second, so you can go as low as 10 seconds for this.
# The default for time_window_length is 10 seconds
# min_time_immobile is your sleep criteria, 300 is 5 mins the general rule of sleep for flies, see Hendricks et al., 2000.
Ethoscopy has 2 general functions that can be called whilst loading:
- max_velocity_detector: Aggregates variables per the given time window, finding their means. Sleep_annotation uses this function before finding sleep bouts, so use this when you don't need to know the sleep bouts.
- sleep_annotation: Aggregates per time window and generates a new boolean column of sleep, as given by the time immobile argument.
sleep_annotation - Function reference
Ethoscopy also has 2 functions for use with AGO or mAGO ethoscope module (odour delivery and mechanical stimulation):
- stimulus_response: Finds the interaction times and then searches a given window post interaction for movement.
- stimulus_prior: A modifcation of stimulus_response, the function finds all interaction times and their response whilst retaining all the previous variables information in a given time window. I.e. if the window is 300, all 300 seconds worth of data will be retained prior to the second when the stimulus is given.
stimulus_response - Function reference
stimulus_prior - Function reference
Saving the data
Loading the ethoscope data each time can be a long process depending on the length of the experiment and number of machines. It's recommended to save the loaded/modified DataFrame as a pickle .pkl
file. See here for more information about pandas and pickle saves. The saved behavpy object can then be loaded in instantly at the start of a new session!
# Save any behavpy or pandas object with the method below
import pandas as pd
df.to_pickle('path/behapvy.pkl') # replace string with your file location/path
# Load the saved pickle file like this. It will retain all the metadata information
df = pd.read_pickle('path/behapvy.pkl')
Behavpy
About
Behavpy is the default object in ethoscopy, a way of storing your metadata and data in a single structure, whilst adding methods to help you manipulate and analyse your data.
Metadata is crucial for proper statistical analysis of the experimental data. In the context of the Ethoscope, the data is a long time series of recorded variables, such as position and velocity for each individual. It is easier and neater to always keep the data and metadata together. As such, behavpy class is a child class of Pandas dataframe, the widely used data table package in python, but enhanced by the addition of a metadata as a class variable. Both the metadata and data must be linked by a common unique identifier for each individual (the 'id' column), which is automatically done if you've loaded with Ethoscopy (If not see the bottom of the page for the requirements for creating a behavpy object from alternative data sources).
The behavpy class has variety of methods designed to augment, filter, and inspect your data which we will go into later. However, if you're not already familiar with Pandas, take some time to look through their guide to get an understanding of its many uses.
Initialising behavpy
To create a behavpy object you need matching metadata and data. To match they both need to have an id column that has linked/same id's in each. Don't worry, if you downloaded/formatted the data using the built in ethoscopy functions shown previously they'll be in there already.
import ethoscopy as etho
# You can load you data using the previously mentioned functions
meta = etho.link_meta_index(meta_loc, remote, local)
data = etho.load_ethoscope(meta, reference_hour = 9.0, FUN = partial(etho.sleep_annotation, time_window_length = 60, min_time_immobile = 300))
# Or you can load then from a saved object of your choice into a pandas df,
# mine is a pickle file
meta = pd.read_pickle('users\folder\experiment1_meta.pkl')
data = pd.read_pickle('users\folder\experiment1_data.pkl')
# to initialise a behavpy object, just call the class with the data and meta as arguments. Have check = True to ensure the ids match between metadata and data.
# As of version 1.3.0 you can choose the colour palette for the plotly plots - see https://plotly.com/python/discrete-color for the choices
# The default is 'Safe' (the one used before), but this example we'll use 'Vivid'
df = etho.behavpy(data = data, meta = meta, colour = 'Vivid', check = True)
Using behavpy with non-ethoscope data
The behavpy class is made to work with the ethoscope system, utilising the data structure it records to create the analytics pipeline you'll see next. However, you can still use it on non-ethoscope data if you follow the same structure.
Data sources:
You will need the metadata file as discussed prior, however you will need to manually create a column called id that contains a unique id per specimen in the experiment.
Additionally, you will need a data source where each row is a log of a time per specimen. Each row must have the following
- id column, that references back to the unique id in the metadata
- t column, the time (in seconds) the row is describing, i.e. 0, 60, 120, 180, 240
- moving column, a boolean column (true/false) of whether the specimen is moving
The above columns are necessary for all the methods to work, but feel free to add other columns with extra information per timestamp. Both these data sources must be converted to pandas DataFrames, which can then be used to create a behavpy class object as shown above.
Basic methods
Behavpy has lots of built in methods to manipulate your data. The next few sections will walk you through a basic methods to manipulate your data before analysis.
Filtering by the metadata
One of the core methods of behavpy. This method creates a new behavpy object that only contains specimens whose metadata matches your inputted list. Use this to separate out your data by experimental conditions for further analysis.
# filter your dataset by variables in the metadata wtih .xmv()
# the first argument is the column in the metadata
# the second can be the variables in a list or as subsequent arguments
df = df.xmv('species', ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak', 'D.sims'])
# or
df = df.xmv('species', 'D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak', 'D.sims')
# the new data frame will only contain data from specimens with the selected variables
Removing by the metadata
The inverse of .xmv()
. Remove from both the data and metadata any experimental groups you don't want. This method can be called also on individual specimens by specifying their id
and their unique identifier.
# remove specimens from your dataset by the metadata with .remove()
# remove acts like the opposite of .xmv()
df = df.remove('species', ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak', 'D.sims'])
# or
df = df.remove('species', 'D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak', 'D.sims')
# both .xmv() and .remove() can filter/remove by the unique id if the first argument = 'id'
df = df.remove('id', '2019-08-02_14-21-23_021d6b|01')
Filtering by time
Often you will want to remove from the analyses the very start of the experiments, when the data isn't as clean because animals are habituating to their new environment. Or perhaps you'll want to just look at the baseline data before something occurs. Use .t_filter()
to filter the dataset between two time points.
# filter you dataset by time with .t_filter()
# the arguments take time in hours
# the data is assumed to represented in seconds
df = df.t_filter(start_time = 24, end_time = 48)
# Note: the default column for time is 't', to change use the parameter t_column
Concatenate
Concatenate allows you to join two or more behavpy data tables together, joining both the data and metadata of each table. The two tables do not need to have identical columns: where there's a mismatch the column values will be replaced with NaNs
.
# An example of concatenate using .xmv() to create seperate data tables
df1 = df.xmv('species', 'D.vir')
df2 = df.xmv('species', 'D.sec')
df3 = df.xmv('species', 'D.ere')
# a behapvy wrapper to expand the pandas function to concat the metadata
new_df = df1.concat(df2)
# .concat() can process multiple data frames
new_df = df1.concat(df2, df3)
Analyse a single column
Sometimes you want to get summary statistics of a single column per specimen. This is where you can use .analyse_column()
. The method will take all the values in your desired column per specimen and apply a summary statistic. You can choose from a basic selection, e.g. mean, median, sum. But you can also use your own function if you wish (the function must work on array data and return a single output).
# Pivot the data frame by 'id' to find summary statistics of a selected columns
# Example summary statistics: 'mean', 'max', 'sum', 'median'...
pivot_df = df.analyse_column('interactions', 'sum')
output:
interactions_sum
id
2019-08-02_14-21-23_021d6b|01 0
2019-08-02_14-21-23_021d6b|02 43
2019-08-02_14-21-23_021d6b|03 24
2019-08-02_14-21-23_021d6b|04 15
2020-08-07_12-23-10_172d50|18 45
2020-08-07_12-23-10_172d50|19 32
2020-08-07_12-23-10_172d50|20 43
# the output column will be a string combination of the column and summary statistic
# each row is a single specimen
Re-join
Sometimes you will create an output from the pivot table or just a have column you want to add to the metadata for use with other methods. The column to be added must be a pandas series of matching length to the metadata and with the same specimen IDs.
# you can add these pivoted data frames or any data frames with one row per specimen to the metadata with .rejoin()
# the joining dataframe must have an index 'id' column that matches the metadata
df = df.rejoin(pivot_df)
Binning time
Sometimes you'll want to aggregate over a larger time to ensure you have consistent readings per time points. For example, the ethoscope can record several readings per second, however sometimes tracking of a fly will be lost for short time. Binning the time to 60 means you'll smooth over these gaps.
However, this will just be done for 1 variable so will only be useful in specific analysis. If you want this applied across all variables remember to set it as your time window length in your loading functions.
# Sort the data into bins of time with a single column to summarise the bin
# bin time into groups of 60 seconds with 'moving' the aggregated column of choice
# default aggregating function is the mean
bin_df = df.bin_time('moving', 60)
output:
t_bin moving_mean
id
2019-08-02_14-21-23_021d6b|01 86400 0.75
2019-08-02_14-21-23_021d6b|01 86460 0.5
2019-08-02_14-21-23_021d6b|01 86520 0.0
2019-08-02_14-21-23_021d6b|01 86580 0.0
2019-08-02_14-21-23_021d6b|01 86640 0.0
... ... ...
2020-08-07_12-23-10_172d50|19 431760 1.0
2020-08-07_12-23-10_172d50|19 431820 0.75
2020-08-07_12-23-10_172d50|19 431880 0.5
2020-08-07_12-23-10_172d50|19 431940 0.25
2020-08-07_12-23-10_172d50|20 215760 1.0
# the column containg the time and the aggregating function can be changed
bin_df = df.bin_time('moving', 60, t_column = 'time', function = 'max')
output:
time_bin moving_max
id
2019-08-02_14-21-23_021d6b|01 86400 1.0
2019-08-02_14-21-23_021d6b|01 86460 1.0
2019-08-02_14-21-23_021d6b|01 86520 0.0
2019-08-02_14-21-23_021d6b|01 86580 0.0
2019-08-02_14-21-23_021d6b|01 86640 0.0
... ... ...
2020-08-07_12-23-10_172d50|19 431760 1.0
2020-08-07_12-23-10_172d50|19 431820 1.0
2020-08-07_12-23-10_172d50|19 431880 1.0
2020-08-07_12-23-10_172d50|19 431940 1.0
2020-08-07_12-23-10_172d50|20 215760 1.0
Wrap time
The time in the ethoscope data is measured in seconds, however these numbers can get very large and don't look great when plotting data or showing others. Use this method to change the time column values to be a decimal of a given time period, the default is the normal day (24) and will change time to be in hours from reference hour or experiment start.
# Change the time column to be a decimal of a given time period, e.g. 24 hours
# wrap can be performed inplace and will not return a new behavpy
df.wrap_time(24, inplace = True)
# however if you want to create a new dataframe leave inplace False
new_df = df.wrap_time(24)
Remove specimens with low data points
Sometimes you'll run an experiment and have a few specimens that were tracked poorly or just have fewer data points than the rest. This can be really affect some analysis, so it's best to remove it.
Specify the minimum number of data points you want per specimen, any lower and they'll be removed from the metadata and data. Remember the minimum points per a single day will change with the frequency of your measurements.
# removes specimens from both the metadata and data when they have fewer data points than the user specified amount
# 1440 is 86400 / 60. So the amount of data points needed for 1 whole day if the data points are measured every minute
new_df = df.curate(points = 1440)
Remove specimens that aren't sleep deprived enough
In the Gilestro lab we'll sleep deprive flies to test their response. Sometimes the method won't work and you'll a few flies mixed in that have slept normally. Call this method to remove all flies that have been asleep for more than a certain percentage over a given time period. This method can return two difference outputs depending on the argument for the remove parameter. If it's a integer between 0 and 1 then any specimen with more than that fraction as asleep will be removed. If left as False then a pandas data frame is returned with the sleep fraction per specimen.
# Here we are removing specimens that have slept for more than 20% of the time between the period of 24 and 48 hours.
dfn = df.remove_sleep_deprived(start_time = 24, end_time = 48, remove = 0.2, sleep_column = 'asleep', t_column = 't')
# Here we will return the sleep fraction per specimen
df_sleep_fraction = df.remove_sleep_deprived(start_time = 24, end_time = 48, sleep_column = 'asleep', t_column = 't')
Interpolate missing results
Sometimes you'll have missing data points, which is not usually too big of a problem. However, sometimes you'll need to do some analysis that requires regularly measured data. Use the .interpolate() method to set a recording frequency and interpolate any missing points from the surrounding data. Interpolate is a wrapper for the scipy interpolate function.
# Set the varible you want to interpolate and sampling frequency you'd like (step_size)
# step size is given in seconds. Below would interpolate the data for every 5 mins from the min time to max time
new_df = df.interpolate(variable = 'x', step_size = 300)
Baseline
Not all experiments are run at the same time and you'll often have differing number of days before an interaction (such as sleep deprivation) occurs. To have all the data aligned so the interaction day is the same include in your metadata .csv
file a column called baseline
. Within this column, write the number of additional days that needs to be added to align to the longest set of baseline experiments.
# add addtional time to specimens time column to make specific interaction times line up when the baseline time is not consistent
# the metadata must contain a a baseline column with an integer from 0 - infinity
df = df.baseline(column = 'baseline')
# perform the operation inplace with the inplace parameter
Add day numbers and phase
Add new columns to the data, one called phase will state whether it's light or dark given your reference hour and a normal circadian rhythm (12:12). However, if you're working with different circadian hours you can specify the time it turns dark.
# Add a column with the a number which indicates which day of the experiment the row occured on
# Also add a column with the phase of day (light, dark) to the data
# This method is performed in place and won't return anything.
# However you can make it return a new dataframe with the inplace = False
df.add_day_phase(t_column = 't') # default parameter for t_column is 't'
# if you're running circadian experiments you can change the length of the days the experiment is running
# as well as the time the lights turn off, see below.
# Here the experiments had days of 30 hours long, with the lights turning off at ZT 15 hours.
# Also we changed inplace to False to return a modified behavpy, rather than modify it in place.
df = df.add_day_phase(day_length = 30, lights_off = 15, inplace = False)
Estimate Feeding
If you're using the ethoscope we can approximate the amount of time feeding by labelling micro-movements near the end of he tube with food in it as feeding times. This method relies upon your data having a micro column which should be generated if you load the data with the motion_detector or sleep_annotation loading function.
This method will return a new behavpy object that has an additional column called 'feeding' with a boolean label (True/False). The subsequent new column can then be plotted as is shown on the next page.
# You need to declare if the food is positioned on the outside or inside so the method knows which end to look at
new_df = df.feeding(food_position = 'outside', dist_from_food = 0.05, micro_mov = 'micro', x_position = 'x') # micro_mov and x_position are the column names and defaults
# The default for distance from food is 0.05, which is a hedged estimate. Try looking at the spread of the x position to get a better idea what the number should be for your data
Automatically remove dead animals
Sometimes the specimen dies or the tracking is lost. This method will remove all data of the specimen after they've stopped moving for a considerable length of time.
# a method to remove specimens that havent moved for certain amount of time
# only data past the point deemed dead will be removed per specimen
new_df = df.curate_dead_animals()
# Below are the standard numbers and their variable names the function uses to remove dead animals:
# time_window = 24: The window in which death is defined, set to 24 houurs or 1 day
# prop_immobile = 0.01: The proportion of immobility that counts as "dead" during the time window
# resoltion = 24: How much the scanning window overlap, expressed as a factor
Find lengths of bouts of sleep
# break down a specimens sleep into bout duration and type
bout_df = df.sleep_bout_analysis()
output:
duration asleep t
id
2019-08-02_14-21-23_021d6b|01 60.0 True 86400.0
2019-08-02_14-21-23_021d6b|01 900.0 False 86460.0
... ... ... ...
2020-08-07_12-23-10_172d50|05 240.0 True 430980.0
2020-08-07_12-23-10_172d50|05 120.0 False 431760.0
2020-08-07_12-23-10_172d50|05 60.0 True 431880.0
# have the data returned in a format ready to be made into a histogram
hist_df = df.sleep_bout_analysis(as_hist = True, max_bins = 30, bin_size = 1, time_immobile = 5, asleep = True)
output:
bins count prob
id
2019-08-02_14-21-23_021d6b|01 60 0 0.000000
2019-08-02_14-21-23_021d6b|01 120 179 0.400447
2019-08-02_14-21-23_021d6b|01 180 92 0.205817
... ... ... ...
2020-08-07_12-23-10_172d50|05 1620 1 0.002427
2020-08-07_12-23-10_172d50|05 1680 0 0.000000
2020-08-07_12-23-10_172d50|05 1740 0 0.000000
# max bins is the largest bout you want to include
# bin_size is the what length runs together, i.e. 5 would find all bouts between factors of 5 minutes
# time_immobile is the time in minutes sleep was defined as prior. This removes anything that is small than this as produced by error previously.
# if alseep is True (the default) the return data frame will be for asleep bouts, change to False for one for awake bouts
Plotting a histogram of sleep_bout_analysis
# You can take the output from above and create your own histograms, or you can use this handy method to plot a historgram with error bars from across your specimens
# Like all functions you can facet by your metadata
# Here we'll compare two of the species and group the bouts into periods of 5 minutes, with up to 12 of them (1 hour)
# See the next page for more information about plots
fig = df.plot_sleep_bouts(
sleep_column = 'asleep',
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere'],
facet_labels = ['D.virilis', 'D.erecta'],
bin_size = 5,
max_bins = 12
)
fig.show()
Find bouts of sleep
# If you haven't already analysed the dataset to find periods of sleep,0
# but you do have a column containing the movement as True/False.
# Call this method to find contiguous bouts of sleep according to a minimum length
new_df = df.sleep_contiguous(time_window_length = 60, min_time_immobile = 300)
Sleep download functions as methods
# some of the download functions mentioned previously can be called as methods if the data wasn't previously analysed
# dont't call this method if your data was already analysed!
# If it's already analysed it will be missing columns needed for this method
new_df = df.motion_detector(time_window_length = 60)
Visualising your data
Once the behavpy object is created, the print function will just show your data structure. If you want to see your data and the metadata at once, use the built in method .display()
# first load your data and create a behavpy instance of it
df.display()
You can also get quick summary statistics of your dataset with .summary()
df.summary()
# an example output of df.summary()
output:
behavpy table with:
individuals 675
metavariable 9
variables 13
measurements 3370075
# add the argument detailed = True to get information per fly
df.summary(detailed = True)
output:
data_points time_range
id
2019-08-02_14-21-23_021d6b|01 5756 86400 -> 431940
2019-08-02_14-21-23_021d6b|02 5481 86400 -> 431940
Be careful with the pandas method .groupby()
as this will return a pandas object back and not a behavpy object. Most other common pandas actions will return a behavpy object.
Visualising your data
Whilst summary statistics are good for a basic overview, visualising the variable of interest over time is usually a lot more informative.
Heatmaps
The first port of call when looking at time series data is to create a heatmap to see if there are any obvious irregularities in your experiments.
# To create a heatmap all you need to write is one line of code!
# All plot methods will return the figure, the usual etiquette is to save the variable as fig
fig = df.heatmap('moving') # enter as a string which ever numerical variable you want plotted inside the brackets
# Then all you need to do is the below to generate the figure
fig.show()
Plots over time
For an aggregate view of your variable of interest over time, use the .plot_overtime() method to visualise the mean variable over your given time frame or split it into sub groups using the information in your metadata.
# If wrapped is True each specimens data will be aggregated to one 24 day before being aggregated as a whole. If you want to view each day seperately, keep wrapped False.
# To achieve the smooth plot a moving average is applied, we found averaging over 30 minutes gave the best results
# So if you have your data in rows of 10 seconds you would want the avg_window to be 180 (the default)
# Here the data is rows of 60 seconds, so we only need 30
fig = df.plot_overtime(
variable = 'moving',
wrapped = True,
avg_window = 30
)
fig.show()
# the plots will show the mean with 95% confidence intervals in a lighter colour around the mean
# You can seperate out your plots by your specimen labels in the metadata. Specify which column you want fromn the metadata with facet_col and then specify which groups you want with facet_args
# What you enter for facet_args must be in a list and be exactly what is in that column in the metadata
# Don't like the label names in the column, rename the graphing labels with the facet_labels parameter. This can only be done if you have a same length list for facet_arg. Also make sure they are the same order
fig = df.plot_overtime(
variable = 'moving',
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak'],
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba']
)
fig.show()
# if you're doing circadian experiments you can specify when night begins with the parameter circadian_night to change the phase bars at the bottom. E.g. circadian_night = 18 for lights off at ZT 18.
Quantifying sleep
The plots above look nice, but we often want to quantify the differences to prove what our eyes are telling us. The method .plot_quantify()
plots the mean and 95% CI of the specimens to give a strong visual indication of group differences.
# Quantify parameters are near identical to .plot_overtime()
# For all plotting methods in behavpy we can add grid lines to better view the data and you can also add a title, see below for how
# All quantifying plots sill return two objects, the first will be the plotly figure as normal and the second a pandas dataframe with
# the calculated avergages per specimen for you to perform statistics on. You can do this via the common statistical package scipy or
# we recommend a new package DABEST, that produces non p-value related statisics and visualtion too.
# We'll be looking to add DABEST into our ecosytem soon too!
fig, stats_df = df.plot_quantify(
variable = 'moving',
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak']
title = 'This is a plot for quantifying',
grids = True
)
fig.show()
# Tip! You can have a facet_col argument and nothing for facet_args and the method will automatically plot all groups within that column. Careful though as the order/colour might not be preserved in similar but different plots
Performing Statistical tests
If you want to perform normal statistical tests then it's best to use Scipy, a very common python package for statistics with lots of online tutorial on how and when to use it. We'll run through a demonstration of one below. Below you'll see an example table of the data from the plot above.
Given we don't know the distribution type of the data and that each group is independent we'll use a non-parametric group test, the Kruskal-Wallis H-test (the non-parametric version of a one-way ANOVA).
# First we need to import scipy stats
from scipy import stats
# Next we need the data for each specimen in a numpy array
# Here we iterate through the columns and send each column as a numpy array into a list
stat_list = []
for col in stats_df.columns.tolist():
stat_list.append(stats_df[col].to_numpy())
# Alternatively you can set each one as it's own variable
dv = stats_df['D.vir'].to_numpy()
de = stats_df['D.ere'].to_numpy()
... etc
# Now we call the Kruskal function, remember to always have the nan_policy set as 'omit' otherwise the output will be a NaN
# If using the first method (the * unpacks the list)
stats.kruskal(*stat_list, nan_policy = 'omit')
# If using the second
stats.kruskal(dv, de, ..., nan_policy = 'omit')
## output: KruskalResult(statistic=134.17956605297556, pvalue=4.970034343600611e-28)
# The pvalue ss below 0.05 so we can say that not all the groups have the same distribution
# Now we want to do some post hoc testing to find inter group significance
# For that we need another package scikit_posthocs
import scikit_posthocs as sp
p_values = sp.posthoc_dunn(stat_list, p_adjust = 'holm')
print(p_values)
## output: 1 2 3 4 5
1 1.000000e+00 2.467299e-01 0.000311 1.096731e-11 1.280937e-16
2 2.467299e-01 1.000000e+00 0.183924 1.779848e-05 7.958752e-08
3 3.106113e-04 1.839239e-01 1.000000 3.079278e-03 3.884385e-05
4 1.096731e-11 1.779848e-05 0.003079 1.000000e+00 4.063266e-01
5 1.280937e-16 7.958752e-08 0.000039 4.063266e-01 1.000000e+00
print(p_values > 0.05)
## output: 1 2 3 4 5
1 True True False False False
2 True True True False False
3 False True True False False
4 False False False True True
5 False False False True True
Quantify day and night
Often you'll want to compare a variable between the day and night, particularly total sleep. This variation of .plot_quantify()
will plot the mean and 96% CI of a variable for a user defined day and night.
# Quantify parameters are near identical to .plot_quantify() with the addtion of day_length and lights_off which takes (in hours) how long the day is for the specimen (defaults 24) and at what point within that day the lights turn off (night, defaults 12)
fig, stats = df.plot_day_night(
variable = 'asleep',
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak'],
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba'],
day_length = 24,
lights_off = 12
)
fig.show()
Compare variables
You may sometimes want to compare different but similar variables from your experiment, often comparing them across different experimental groups. Use .plot_compare_variables()
to compare up to 24 different variables (we run out of colours after that...).
# Like the others it takes the regualar arguments. However, rather than the variable parameter it's variables, which takes a list of strings of the different columns you want to compare. The final variable in the list will have it's own y-axis on the right-hand side, so save this one for different scales.
# The most often use of this with the ethoscope data is to compare micro movements to walking movements to better understand the locomotion of the specimen, so lets look at that.
fig, stats = df.plot_compare_variables(
variables = ['mirco', 'walk']
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak'],
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba']
)
fig.show()
# Lets add the velocity of the specimen to see how it looks with different scales.
fig, stats = df.plot_compare_variables(
variables = ['mirco', 'walk', 'mean_velocity']
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak', 'D.sims'],
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba', 'D.Simulans'],
# Add grids to the plot so we can better see how the plots align
grids = True
)
fig.show()
Head to the Overview Tutorial for interactive examples of some of these plots. It also shows you how to edit the fig object after it's been generated if you wish to change things like axis titles and axis range.
Saving plots
As seen above all plot methods will produce a figure structure that you use .show() to generate the plot. To save these plots all we need to do is place the fig object inside a behavpy function called etho.save_figure()
You can save your plots as PDFs, SVGs, JPEGs or PNGS. It's recommended you save it as either a pdf or svg for the best quality and then the ability to alter then later.
You can also save the plot as a html file. Html files retain the interactive nature of the plotly plots. This comes in most useful when using jupyter notebooks and you want to see your plot full screen, rather than just plotted under the cell.
# simply have you returned fig as the argument alongside its save path and name to save
# remember to change the file type at the end of path as you want it
# you can set the width and height of the saved image with the parameters width and height (this has no effect for .html files)
etho.save_figure(fig, './tutorial_fig.pdf', width = 1500, height = 1000)
# or to get a better view of it and keep the interactability save it as a html file
etho.save_figure(fig, './tutorial_fig.html') # you can't change the height and width wehn saved as a html
Visualising mAGO data
Within the Gilestro lab we have special adaptations to the ethoscope which includes the mAGO, a module that can sleep deprive flies manually and also deliver a puff of an odour of choice after periods of rest. See the documentation here: ethoscope_mAGO.
If you've performed a mAGO experiment then the data needs to be loaded in with the function puff_mago() which decides if a fly has responded if it's moved with the time limit post puff (default is 10 seconds).
Quantify response
The method .plot_response_quantify() will produce a plot of the mean response rate per group. Within the software for delivering a puff it can be set to have a random chance of occurring. I.e. if it's set to 50%, and the immobility criteria is met then the chance a puff will be delivered is 50/50. This gives us two sets of data, true response rate and the underlying likelihood a fly will just randomly move, which is called here Spontaneous Movement (Spon. Mov).
# The parameters other than response_col (which is the column of the response per puff) are the same as other quantify methods
fig, stats = df.plot_response_quantify(
resonse_col = 'has_responded',
facet_col = 'species',
)
fig.show()
Quantify response overtime
You can also view how the response rate changes over time with .plot_response_overtime(). For this method you'll need to load in the normal dataset, i.e. with motion_detector or sleep_annotation as the method needs to know at what point the puff occurred. The plot is per minute, so it's best to load it in with a time window of 60. If you have it higher the method won't work.
# Call the method on the normal behavpy table with the first argument the puff behavpy table
# The seconds argument decides if you're looking at runs of inactivty or activity,
# if you set the puff chance low enough you can probe activity too
fig, stats = df.plot_response_overtime(
response_df = puff_df,
activity = 'inactive',
facet_col = 'species'
)
fig.show()
Behavpy - Circadian analysis
At the Gilestro lab we work on sleep and often need to do a lot of circadian analysis to compliment our sleep analysis. So we've got a dedicated suite of methods to analyse and plot circadian analysis. For the best run through of this please use the jupyter notebook tutorial that can be found at the end of this whole tutorial.
Circadian methods and plots
The below methods and plots should give a good insight into your specimens circadian rhythm. If you think another method should be added please don't hesitate to contact us and we'll see what we can do.
Head to the circadian notebook for an interactive run through of everything below.
Actograms
Actograms are one of the best initial methods to quickly visualise changes in the rhythm of a specimen over several days. An actogram double plots each days variable (usually movement) so you can compare each day with its previous and following day.
# .plot_actogram() can be used to plot indivdual specimens actograms or plot the average per group
# below we'll demonstrate a grouped example
fig = df.plot_actogram(
mov_variable = 'moving',
bin_window = 5, # the default is 5, but feel free to change it to smooth out the plot or vice versa
facet_col = 'period_group',
facet_arg = ['short', 'wt', 'long'],
title = 'Grouped Actogram')
fig.show()
# plot_actogram_tile will plot every specimen in your behavpy dataframe
# Be careful if your dataframe has lots of flies as you'll get a very crowded plot!
fig = df.plot_actogram_tile(
mov_variable = 'moving',
labels = None, # By default labels is None and will use the ID column for the labels.
# However if individual labels in the metadata add that column here. See the tutorial for an example
title = 'Tiled Actogram')
fig.show()
Anticipation Score
Many animals including Drosophila have peaks of activity in the morning and evening as lights turn on and off respectively. Given this daily activity the activity score looks to quantify when the specimens internal clock anticipates these moments. The score is calculated as the ratio of the final 3 hours of activity prior to lights on and off compared to the whole 6 hours prior.
# Simply call the plot_anticipation_score() method to get a box plot of your results
# the day length and lights on/off time is 24 and 12 respectively, but these can be changed if you're augmenting the environment
fig, stats = df.plot_anticipation_score(
mov_variable = 'moving',
facet_col = 'period_group',
facet_arg = ['short', 'wt', 'long'],
day_length = 24,
lights_off = 12,
title = 'Anticipation Score')
fig.show()
Periodograms
Initialising the Periodogram class
To access the methods that calculate and plot periodograms you need to initialise your data frame not as the base behavpy but as the periodogram behavpy class: etho.behavpy_periodogram()
per_df = etho.behavpy_periodogram(data, metadata, check = True)
# If you've already initialised it as a behavpy object
per_df = etho.behavpy_periodogram(df, df.meta, check = True)
Calculating Periodograms
# The below method calculates the output from a periodogram function, returning a new dataframe with information about each period in your given range
# and the power for each period
# the periodogram can only compute 4 types of periodograms currently, choose from this list ['chi_squared', 'lomb_scargle', 'fourier', 'welch']
per_chi = per_df.periodogram(
mov_variable = 'moving',
periodogram = 'chi_squared', # change the argument here to the other periodogram types
period_range = [10,32], # the range in time (hours) we think the circadian frequency lies, [10,32] is the default
sampling_rate = 15, # the time in minutes you wish to smooth the data to. This method will interpolate missing results
alpha = 0.01 # the cutoff value for significance, i.e. 1% confidence
)
# Hint! Sometimes increasing the sampling_rate can improve you're periodograms. On average we found 15 to work best with our data, but it can be different for everyone
Plotting Periodograms
Tile Plots
To get a good understanding of each individual specimen you can use the tiled periodogram plot
# Like the actogram_tile method we can rely on the ids of each specimen or we can use labels we created
fig = per_chi.plot_periodogram_tile(
labels = 'tile_labels',
find_peaks = True, # make find_peaks True to add a marker over significant peaks
title = 'Tiled Chi Squared Periodogram'
)
fig.show()
Grouped plots
Plot the average of each specimens periodograms to get a better view of the difference between experimental groups.
fig = per_chi.plot_periodogram(
facet_col = 'period_group',
facet_arg = ['short', 'wt', 'long'],
title = 'Grouped Chi Squared Periodogram'
)
fig.show()
Quantifying plots
Like with the ethograms we can plot the above as quantification plot to see if what we see by eye is true statistically.
# Quantify the periodograms by finding the highest power peak and comparing it across specimens in a group
# call .quantify_periodogram() to get the results
fig, stats = per_chi.quantify_periodogram(
facet_col = 'period_group',
facet_arg = ['short', 'wt', 'long'],
title = 'Quantification of Chi Squared Periodogram'
)
fig.show()
Wavelets
Wavelets are a special branch of periodograms as they are calculated in 3 dimensions rather than 2. Finding the period and power not for the whole given time, but rather broken down from the start to end. This then gives you an understanding of how rhythmicity changes over the course of your experiment, such as when you shift from LD to DD.
Wavelet plots are quite visually intensive, therefore the wavelet method will only compute and plot the average of all the data present in the data frame. Therefore, you will need to do some filtering beforehand to look at individual groups or an individual specimen.
# This method is powered bythe python package pywt, Head to https://pywavelets.readthedocs.io/en/latest/ for information about the pacakage and the other wavelet types
# filter your df prior to calling the plot
wt_df = per_df.xmv('period_group', 'wt')
fig = wt_df.wavelet(
mov_variable = 'moving',
sampling_rate = 15, # increasing the sampling rate will increase the resolution of lower frequencues (16-30 hours), but lose resolution of higher frequencies (2-8 hours)
scale = 156, # the defualt is 156, increasing this number will increase the resolution of lower frequencies (and lower the high frequencies), but it will take longer to compute
wavelet_type = 'morl', # the default wavelet type and the most commonly used in circadian analysis. Head to the website above for the other wavelet types or call the method .wavelet_types() to see their names
title = 'Wavelet: WT'
)
fig.show()
Behavpy - Hidden Markov Model
Behavpy_HMM is an extension of behavpy that revolves around the use of Hidden Markov Models (HMM) to segment and predict sleep and awake stages in Drosophila. This method is based on Wiggin et al PNAS 2020
Training an HMM
Behavpy_HMM is an extension of behavpy that revolves around the use of Hidden Markov Models (HMM) to segment and predict sleep and awake stages in Drosophila. A good introduction to HMMs is available here: https://web.stanford.edu/~jurafsky/slp3/A.pdf.
The provide a basic overview, HMMS are a generative probabilistic model, in which there is an assumption that a sequence of observable variables is generated by a sequence of internal hidden states. The transition probabilities between hidden states are assumed to follow the principles of a first order Markov chain.
Within this tutorial we will apply this framework to the analysis of sleep in Drosophila, using the observable variable movement
to predict the hidden states: deep sleep
, light sleep
, quiet awake
, active awake
.
Hmmlearn
BehavpyHMM utilises the brilliant Python package Hmmlearn under the hood. Currently behavpy_HMM only uses the categorical version of the models available, as such it can only be trained and decode sequences that are categorical and in an integer form. If you wish to use their other models (multinomial, gaussian and guassian mixture models) or want to read more on their capabilities, head to their docs page for more information: https://hmmlearn.readthedocs.io/en/latest/index.html
Initialising behavpy_HMM
The HMM variant of behavpy is initialised in the same manner as the behavpy class. In fact, it is just a sub-classed version of behavpy so all the previous methods are still applicable, alongside the usual Pandas capabilities.
import ethoscopy as etho
# You can load you data using the previously mentioned functions
meta = etho.link_meta_index(meta_loc, remote, local)
data = etho.load_ethoscope(meta, min_time = 24, max_time = 48, reference_hour = 9.0)
# Remember you can save the data and metadata as pickle files, see the downloading section for a reminder about it
# to initialise a behavpy object, just call the class with the data and meta as arguments. Have check = True to ensure the ids match.
df = etho.behavpy_HMM(data, meta, check = True)
Data curation
The data passed to the model needs to be good quality with long sequences containing minimal gaps and all a decent length in total.
Be careful when loading in your data. If you use the sleep_annotation()
loading analysis function then it will automatically interpolate missing data as sleep giving you NaN
(filler values) throughout your dataset. It's preferential to use the max_velocity_detector()
function if you plan to use the HMM.
Use the .curate()
and .bin_time()
methods mentioned previously to curate the data. It is recommended to bin the data points to 60 second bins when analysing sleep and movement. But feel free to experiment with what works with your data.
import ethoscopy as etho
# load your data and metadata in and intialse your behavpy_HMM object
df = etho.behavpy_HMM(data, meta, check = True)
# it's often a good idea to filter out the start and beginning of experiments
df = df.t_filter(start_time = 24, end_time = 96)
# remove specimens with less than a days worth of data (if binned to 60 seconds)
df = df.curate(points = 1440)
# Any NaN values will throw an error when trying to train the HMM. It's good practice to check the data for any
# If below returns True you have NaN values in your data
print(np.any(np.isnan(df['YOUR COLUMN'].to_numpy())))
# If it is True then look at those values and replace or remove (the whole specimen) from the dataset
# Use below to view which rows have NaNs, see pandas documnetation for more information on how to replace NaN values
df[df['YOUR COLUMN'].isnull()]
# binning the variable can be done within the .hmm_train() method, see next!
Training a HMM
Behavpy_HMM provides a wrapper function for hmmlearn multinomial model. The method extends the hmmlearn model by automatically manipulating the data into a readable format for training. Additionally, the method provides the option to randomise the initialised transmission / observable parameters to train the model several times. Each iterative model is compared to the best previous performing model with the hopes of avoiding reach a local minima during hmmlearns own internal iterative process.
As mentioned previously the HMM used here is categorical, therefore only select data that is categorically discrete. For example here we will be using activity, with not moving being one state and moving the other. These states as data must be in integer form. So this example, not moving will = 0 and moving will = 1. Follow this example if using more states, so if you have three states use the integers 0, 1, 2.
In setting up the HMM you have to specify how many hidden states your system will have and the what states can transition into one another. See below for an example.
Setting up
# the setup for .hmm_train()
# name your hidden states and observables. This is just purely for visualising the output and setting the number of hidden states!
hidden_states = ['Deep sleep', 'Light sleep', 'Light awake', 'Active awake']
observable_variables = ['inactive', 'active']
# if you believe your hidden states can only transition into specific states or there's a flow you can setup the model to train with this in mind. Have the transitions you want to happen set as 'rand' and those that can't as 0
t_prob = np.array([['rand', 'rand', 'rand', 0.00],
['rand', 'rand', 'rand', 0.00],
[0.00, 'rand', 'rand', 'rand'],
[0.00, 'rand', 'rand', 'rand']])
# Make sure your emission matrix aligns with your categories
em_prob = np.array([[1, 0],
[1,0],
['rand', 'rand'],
['rand', 'rand']])
# the shape of each array should be len(hidden_states) X len(names). E.g. 4x4 and 4x2 respectively
You can leave trans_probs
and em_prob
as None
to have all transitions randomised before each iteration
Training
# add the above variables to the hmm_train method
# iterations is the number of loops with a new randomised parameters
# hmm_interactions is the number of loops within hmmlearn before it stops. This is superceeded by tol, with is the difference in the logliklihood score per interation within hmmlearn
# save the best performing model as a .pkl file
df.hmm_train(
states = hidden_states,
observables = observable_variables,
var_column = 'moving',
trans_probs = t_prob,
emiss_probs = em_prob,
start_probs = None,
iterations = 10,
hmm_iterations = 100,
tol = 2,
t_column = 't',
bin_time = 60,
file_name = 'experiment_hmm.pkl', # replace with your own file name
verbose = False)
# If verbose is true the logliklihood per hmm iteraition is printed to screen
Diving deeper! To find the best model the dataset will be split into a test (10%) and train (90%) datasets. Each successive model will be scored against the best on the test dataset. The best scoring model will be the final save to your file name.
Visualising with the HMM
The best way to get to grips with your newly trained HMM is to decode some data and has a look at it visually.
Single plots
# Like plot_overtime() this method will take a single variable and trained hmm, and plot them over time.
# If you're using behavpy_HMM to look at sleep using the 4 state structure you don't have to specify labels and colours, as they are pre written. However, if you aren't please specify.
# use below to open your saved trained hmmlearn object
with open('/Users/trained_hmms/4_states_hmm.pkl', 'rb') as file:
h = pickle.load(file)
fig = df.plot_hmm_overtime(
hmm = h,
variable = 'moving',
labels = ['Deep sleep', 'Light sleep', 'Light awake', 'Full awake'],
# colours = ['blue', 'green', 'yellow', 'red'], # example colours
wrapped = True,
bin = 60
)
fig.show()
# You cannot facet using this method, see the next plot for faceting
Faceted plots
To reduce clutter when faceting the plot will become a group of subplots, with each subplot the collective plots of each specimen for each hidden state.
# plot_hmm_split works just like above, however you need to specify the column to facet by. You can also specify the arguments if you don't want all the groups from the facet column
fig = df.plot_hmm_split(
hmm = h,
variable = 'moving',
facet_col = 'species',
wrapped = True,
bin = 60
)
fig.show()
# You don't always want to decode every group with the same hmm. In fact with different species, mutants,
# subgroups you should be training a fresh hmm
# if you pass a list to the hmm parameter the same length as the
# facet_arg argument the method will apply the right hmm to each group
fig = df.plot_hmm_split(
hmm = [hv, he, hw, hs, hy],
variable = 'moving',
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba']
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak'],
wrapped = True,
bin = [60, 60, 60, 60, 60]
)
fig.show()
Quantify time in each state
# Like plot_quantify() this method will quantify how much of the time each specimen is within each state.
fig, stats = df.plot_hmm_quantify(
hmm = [hv, he, hw, hs, hy],
variable = 'moving',
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba']
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak'],
bin = [60, 60, 60, 60, 60]
)
fig.show()
Quantifying length of each state
Its a good idea to look at the length of each state to gain an understanding of how the model is separating your data. There are two length methods 1) one to plot the mean lengths per state per specimen and 2) a plot to show the maximum and minimum state length per group.
# all hmm quantifying plots have the same parameters, so just change the method name and you're good to go
fig, stats = df.plot_hmm_quantify_length(
hmm = [hv, he, hw, hs, hy],
variable = 'moving',
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba']
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak'],
bin = [60, 60, 60, 60, 60]
)
fig.show()
# The length min/max method only shows the the min and maximum points as a box
# this plot doesnt return any stats, as it is just the min/max and mean values
fig = df.plot_hmm_quantify_length_min_max(
hmm = [hv, he, hw, hs, hy],
variable = 'moving',
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba']
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak'],
bin = [60, 60, 60, 60, 60]
)
fig.show()
# Below you can see when the model seperates light sleep from deep sleep
Quantifying transitions
The time in each state and the average length are good overview stats, but can be misleading about how often a state occurs if the state is short. This next method quantifies the amount of times a state is transitioned into, effectively counting the instances of the state regardless of time.
fig, stats = df.plot_hmm_quantify_transition(
hmm = [hv, he, hw, hs, hy],
variable = 'moving',
facet_labels = ['D.virilis', 'D.erecta', 'D.willistoni', 'D.sechellia', 'D.yakuba']
facet_col = 'species',
facet_arg = ['D.vir', 'D.ere', 'D.wil', 'D.sec', 'D.yak'],
bin = [60, 60, 60, 60, 60]
)
fig.show()
Raw decoded plot
Sometimes you will want to view the raw decoded plots of individual specimens to get a sense of the daily ebb and flow. .plot_hmm_raw() will plot 1 or more individual specimens decoded sequences.
# The ID of each specimen will be printed to the screen in case you want o investigate one or som further
fig = df.plot_hmm_raw(
hmm = h,
variable = 'moving',
num_plots = 5, # the number of different specimens you want in the plot
bin = 60,
title = ''
)
fig.show()
# Addtionally, if you have response data from mAGO experiments you can add that behavpy dataframe to
# highlight points of interaction and their response
# Purple = response, lime = no response
## df.plot_hmm_raw(mago_df = mdf, hmm = h, variable = 'moving', num_plots = 5, bin = 60, title = '', save = False)
Jupyter tutorials
Ethoscopy is best used in a Jupyter instance. We provide a pre-baked docker container that comes with multiuser JupyterHub and the ethoscopy package: ethoscope-lab. Ethoscope-lab runs the Python 3.8 Kernel and the R Kernel and features an installation of rethomics too.
One of the biggest advantages of using Jupyter Notebook is that data treatment can be shared publicly in the most readable format. This is a great tool to aid scientific reproducibility and transparency to your publications.
We provide here below several example notebooks.
- Overview tutorial notebook (here)
- HMM tutorial notebook (here)
- Circadian tutorial notebook (here)
- Navigating databases (here)
- Converting data to HCTSA format (here)