Neural Computation Workshop 2022
The aim is for current and former members of Doya unit to exchange recent progress and new ideas.
Registration deadline: December 9, 2022
Please register from the link below:
30 minutes for the external speakers and 20 minutes for internal speakers in each session.
“Development of digital biomarker for subthreshold depression ”
“Exploration of learning by dopamine D1 and D2 receptors by spiking network model “
14:30 Miles Desforges (Okinawa Institute of Science and Technology)
“Model-based imitation learning using entropy regularization of policy and model ”
“Proppo: a message passing framework for customizable and composable learning algorithms”
16:30 Qiong Huang (Okinawa Institute of Science and Technology)
“Theoretical Analysis of Regularization in RL”
Cortical integration of prior values and sensory inputs
Optimal decision requires integrating the sensory evidence with prior knowledge of action outcomes and/or context estimation. Because many studies often independently test the sensory- and outcome-based decisions as the perceptual and value-based decision making, respectively, it is unclear how the brain integrates the sensory evidence and prior knowledge to guide behavior. We updated our previous behavioral task in head-fixed mice (Funamizu, iScience, 2021) to investigate how the frontal, motor, and sensory cortices are involved in the integration.
In the task, mice selected either a left or right spout depending on the tone frequency to receive a water reward. We randomly presented either a long or short sound stimulus and biased the amount of reward for each option. The choice behavior was less accurate and more biased toward the large-reward side in short- than in long-stimulus trials. In addition, the outcomes based on difficult-frequency- or short-duration-sounds affected the choice in the next trial compared to the outcomes based on easy- or long-sounds, suggesting that the sensory confidence affected the updating of reward expectation. Based on these behavioral results, we modeled the mice choice behaviors with reinforcement learning (RL) model which had reward expectations for left and right choice (i.e., prior values). Using the prior values and sensory inputs, the RL model determined the choice. The model then updated the prior values using the difference between the outcome and sensory-confidence-dependent action value (confidence value) which modulated the prior values with coming sensory inputs.
During the task, we electrophysiologically recorded the neural activity of the medial prefrontal cortex (mPFC), the secondary motor cortex (M2), and the auditory cortex (AC) with Neuropixels 1.0 probe. In summary, we found that the confidence values, choices, and sensory inputs were selectively represented in the mPFC, M2, and AC, respectively. In contrast, the prior values were represented in all the recorded regions. Our results suggest a localized and global computation of task variables required in short- and long-time scales, respectively, in the cerebral cortex.
Investigation of Bayesian sensory-motor integration in the cerebral cortex
Animals need to estimate their environment to optimize behavior. For example, when pulling a desired object closer, the load should be estimated to apply sufficient force. In animals, the estimation of environmental variables was shown to depend on both sensory input and prior experience. The statistically optimal way to combine sensory evidence with prior expectation is described with Bayes’ rule.
In my experiment, I aim to shed light on how probabilistic estimations of the load can
be implemented by the cerebral cortex. I used mice that performed a lever pulling task while imaging neural activity in multiple layers of the primary somatosensory area. Based on the cortical anatomy, I hypothesize that the superficial pyramidal neurons encode sensory evidence while the activity of the deep pyramidal neurons encodes prior and posterior values.
Dopamine Release in the Nucleus Accumbens during Backward Conditioning
Temporal-difference reinforcement learning proposed that dopamine signals reward prediction errors, which backpropagate scalar values inherent in rewards to reward-predictive cues. This theory is supported by studies of dopamine activity in cue-reward learning, where a reward-paired cue acquires the dopamine response previously evoked by the reward. However, dopamine has recently been implicated in backward reward-cue learning. Specifically, a recent optogenetic study used a backward conditioning paradigm, where sensory cues are presented following the delivery of rewards, and showed that inhibition of dopaminergic neurons at the onset of backward cues during learning disrupted the use of these cues to guide actions. This showed the necessity of dopamine neurons in backward reward-cue learning. However, the dynamics of dopamine release during backwards conditioning are unclear. Here, we used fiber photometry recordings to examine dopamine release while rats performed a backward conditioning task. We measured dopamine release in the nucleus accumbens using GRABDA, a dopamine biosensor. In cue- reward learning (i.e., forward-conditioning task), dopamine release progressively decreased at reward delivery and increased at the onset of forward cues, consistent with well-established findings. However, in the backward-conditioning task, the response to reward increased across time. Further, dopamine release at the onset of backward cues was initially high and diminished with additional training. In order to further examine dopamine response to backward cues, we performed a summation test where a forward cue was put in compound with a backward cue predicting the same or different outcome. We found stronger dopamine responses to the compound of forward and backward cues, regardless of whether they predicted the same outcome, revealing a general excitatory response to the backward cues when delivered unexpectedly. These results suggest dopamine release contributes to backward reward-cue associations, demonstrating dopamine acts as a universal teaching signal for learning regardless of motivational significance of the predicted outcome.
Elucidation of the mechanism of serotonin over optimism and pessimism
We aim to examine what kind of difference will occur in the serotonin neural network, which has been shown to play a role in regulating patience for future rewards, when mice perform the same behavior but the purpose of the behavior is different for "attainment of reward" or
"avoidance of punishment". We hypothesize that serotonin works to regulate "optimism / pessimism" toward achieving the goal and we will examine the serotonin neural network by neural recording and neural manipulation of task performing mice. By clarifying the neural mechanism of “the optimism that creates patience” or “the pessimism that leads to giving up”, we aim to realize a society in which people can improve their "ability to overcome the difficulties of life" and "vitality of the mind".
Development of digital biomarker for subthreshold depression
Subthreshold depression is defined as a decline of cognitive function and motivation caused by depressive mood while the clinical symptoms do not meet the criteria for major depressive disorder (MDD). In addition to a potential risk for progressing to MDD, subthreshold depression is presumably a major factor of presenteeism leading to a significant economic burden. Thus, a handy and reliable tool to detect subthreshold depression is highly demanded.
Here, we present our preliminary studies on collection of psychological/physiological data via smart device and development of “digital biomarkers” to detect an individual risk of subthreshold depression. We recruited 168 undergraduate students at Hiroshima University and separated two groups of subthreshold depression and normal control based on BDI-II (a standard criterion to assess the severity of mental depression). We distributed wearable watches to them to record their heart rate and activity patterns continuously for three days. We also provided digital tablets (iPad) so that the participants underwent digital cognitive tests, voice measurement, and pulse measurement. Using the collected data, we constructed binary classification models to discriminate two groups of subthreshold depression and normal control using the Random Forest algorithm. As a result of two-hold out validation based on data collection timing, the area under curve (AUC) of receiver operating characteristic (ROC) was 0.77, demonstrating a considerable classification ability. In the talk, we will conclude by the current limitations and our future challenges.
The paraventricular thalamus as a critical target region of the serotonin system in mice
Serotonergic fibers originate from a few nuclei in the brainstem but innervate the entire mammalian brain. The serotonin system powerfully affects multiple behavioral and cognitive functions including sleep-wake cycle, motor control, mood, learning, and decision-making. However, a circuitry-based theory of how the serotonin system is organized to carry out its diverse functions remains elusive.
Recent studies using cutting-edge technologies showed that the paraventricular thalamus (PVT) is a nucleus rich in serotonin and has strong serotonergic innervation from the dorsal and median raphe nuclei. PVT has recently begun attracting attention as an important node that integrates internal state information from the hypothalamus and brainstem and prior experience information from the prefrontal cortex for guiding adaptive behavior.
My ongoing thesis research tries to elucidate 1) how dysfunction of serotonergic modulation of PVT neurons alters behavior during decision-making tasks, and 2) how serotonin modulates the physiological activity of PVT neurons during decision-making tasks. In this talk, I will present our recent findings related to 1) that optogenetic manipulation of serotonin afferents in the PVT alters mice's behavior during two kinds of instrumental tasks (reward waiting task and two-arm bandit task). I will also discuss a future experimental plan to tackle with 2).
Evidence of an insula mechanism for interoception and interoceptive awareness
The insular cortex, organized hierarchically into three modules (granular, dysgranular and agranular), plays an important role in interoception, the neural sensing of visceral and physiological signals. The insula is also involved in interoceptive awareness, the subjective and conscious perception of body signals. Disturbances in insula function and structure are predictive of physiological dysfunctions and the development of mental disorders. However, a mechanism for how the insula processes interoceptive information and supports interoceptive awareness is still unknown.
Here, we show that structural covariance networks of the insula are associated with interoceptive awareness of heartbeat signals. Specifically, we found that strengthened intra-insula granular- agranular covariance networks were associated with higher interoceptive awareness. Furthermore, using transcutaneous auricular vagus nerve stimulation (taVNS) and intracranial recordings with EEG, we found a sequential mechanism of interoceptive information propagation within the granular input region of the insula, with signals travelling in a dorsal to ventral direction. We will discuss the implications of these findings to understand psychiatric disorders.
Identification of the Phenotypic Markers of Human Heart-Rate (HR) Through the Study of Its Circadian Dynamics
Suntory Global Innovation Center (SGIC) is developing wearable devices for improving individuals’ wellbeing through continuous acquisition and analysis of their neurophysiological signals and bodily states. Through a joint research project, Neural Computation Unit (NCU) is enabling SGIC in achieving this objective via providing its expertise in modeling and analysis of neural and behavioural data.
In this presentation, we show that the treatment of the HR state-space as an Auto- Regressive (AR) model, allows for capturing the HR dynamics in terms of its informational capacity whose change throughout circadian cycle reflects HR’s functional variation. In addition, we highlight the relation between AR parameters and HR’s informational capacity, on the one hand, and behavioural and habitual data such as sleep-quality and smoking, on the other hand.
These preliminary results suggest the potentials that the AR model of human subjects’ HR state- space can offer to identification of the population-level phenotypic markers of cardiac system’s circadian dynamics.
Exploration of learning by dopamine D1 and D2 receptors by a spiking network model
The basal ganglia (BG) play a crucial role in action-selection and reinforcement learning (RL), but how multiple nuclei, transmitters and receptors realize computations for reward-based learning is still unclear.
We built a topologically organized spiking BG model. Striatal medium spiny neurons (MSN) were classified based on the expression of dopamine D1 and D2 receptors. We implemented spike- timing dependent plasticity and structural parameters: i) the asymmetry of connections between MSN’s; and ii) the overlap between direct and indirect pathways.
In action-selection simulations, we assumed two functional channels representing competitive sensory inputs and actions. We activated two neighboring ensembles of cortical neurons and observed the responses on two adjacent MSN ensembles and downstream nuclei.
In RL simulations, we investigated transient increase and decreases of dopamine in a generalization-discrimination task. In generalization-learning (classical conditioning), upon the selection of the preferred channel, reward was delivered as dopamine burst, causing the potentiation of connections to MSN-D1. After several episodes, tests showed the preferred channel selection across both stimuli.
In discrimination-learning, the previously learned action-selection upon a non-preferred channel triggered reward omission as dopamine dip, causing the potentiation of cortical synapses to MSN- D2. After several episodes, the prediction was refined, producing the corresponding channel selection for each stimulus.
Our simulation results show that discrimination learning, converges faster for higher values of ii). This suggest that overlapping pathways may provide learning advantages, which support the idea of a functional cooperation between direct and indirect pathways. This was possible given a high asymmetry i), with sparse connections from MSN-D1 to MSN-D2.
Based on our results, we hypothesize that lateral inhibition from MSN-D2 to other MSN’s increases during dopamine dips, and this modulation is crucial for discrimination learning convergence.
Simultaneous recording of neuromodulator and calcium spatiotemporal activity reveals state dependent differences between dopaminergic, noradrenergic and serotonergic cortical activity
The precise release dynamics of neuromodulators in the cortex is not well-known. In particular, whether neuromodulators are released in localised pockets or via volume transport across the tissue. Using two-photon microscopy we visualise extracellular neuromodulator release and intracellular calcium activity in the M2 region. In each hemisphere we express a biosensor for either noradrenaline (GRABNE1m), dopamine (dLight1.2) or serotonin (GRAB5HT3.5), as well as a red-shifted calcium indicator (jRGECO1a). Mice performed a Go-NoGo task running or resting on a treadmill. We found both a strong state dependent effect on both calcium and neuromodulator activity, with surprisingly minimal difference between each neuromodulator signal. During rest calcium and neuromodulator signals were not correlated and there was not evidence of a large coherent release across the tissue. During sustained movement there were weak correlations between calcium and noradrenaline (0.3) , serotonin (0.29) and dopamine (0.2); and activity was not correlated with running speed. However, during state transitions each neuromodulator and calcium activity was correlated (0.5-0.6) across the whole imaging window. Neuromodulator and calcium activity were also correlated with running speed: movement onset (>0.5) and movement offset (>0.3). This suggests noradrenaline, dopamine and serotonin are all involved in locomotion state transitions in the cortex and this release activates nearby neurones.
Model-based imitation learning using entropy regularization of policy and model
Approaches based on generative adversarial networks for imitation learning are promising because they are sample efficient in terms of expert demonstrations. So far, we have developed model-free Entropy-Regularized Imitation Learning (MF-ERIL). One of the features is that MF- ERIL maintains a "structured" discriminator that distinguishes the actions generated by a learner from those from an expert. However, training a generator still requires many costly interactions with the actual environment because model-free reinforcement learning is adopted to update a policy. To improve the sample efficiency using model-based reinforcement learning, we propose model-based ERIL (MB-ERIL) under the entropy-regularized Markov decision process to reduce the number of interactions with the actual environment. MB-ERIL uses two discriminators. One is a policy discriminator slightly different from the MF-ERIL's discriminator. The other is a model discriminator that distinguishes the state transitions generated by the model from those from the actual environment. The derived discriminators are structured so that the learning of the policy and the model is efficient. Computer simulations and real robot experiments show that MB-ERIL achieves a competitive performance and significantly improves the sample efficiency compared to baseline methods.
Dual Bayesian principle component analysis
Conventional dimensionality reduction methods, such as principal component analysis (PCA) and its probabilistic formulations: probabilistic PCA (PPCA) and factor analysis (FA), require manually setting up a threshold of the explained variance of each latent dimension, which is sometimes difficult to decide, especially when the signal-to-noise ratio is high. A Bayesian treatment of PCA, Bayesian PCA (BPCA), gives a possible way to decide the latent variable's dimensionality automatically. However, when we applied BPCA to some calcium imaging data, the automatically reduced dimensionality was still quite large compared to the number of neurons, which does not match our expectation of the low-dimensional structure of the neural latent space.
Inspired by BPCA, we proposed a new Bayesian treatment of PCA, dual Bayesian PCA (dBPCA), which contains two hierarchical Bayesian formations of parameters. The new method (dBPCA) can automatically determine the latent space's dimensionality to a relatively small value through the Bayesian inference procedure. We theoretically explained why the new method dBPCA could reduce the dimensionality more than BPCA. Moreover, due to the computational intractability of the posteriors, we used a variational approach to solve this hierarchical Bayesian model approximately. To evaluate if the reduced low-dimensional latent variables still capture the essential information of the original data, we first applied the new method to two simulated datasets: i) constant signals with Gaussian noise and ii) simulated spike-calcium signal traces data. We also compared the new method dBPCA with several conventional methods, including PCA, PPCA, FA and BPCA. We evaluated the model performance by the reconstruction ability of the original observations. We then applied these methods to calcium imaging recordings of an auditory localization task taken from the posterior parietal cortex (PPC). We tested if the latent variables could decode the sound locations and compared them with the decoding using the neural population recordings. Across these settings, we found that dBPCA overperformed other models and reduced the dimensionality to a conveniently low number while maintaining a relatively satisfying performance of reconstructing the observations and decoding the external stimuli.
Proppo: a Message Passing Framework for Customizable and Composable Learning Algorithms
I will be discussing a new type of machine learning software that grew out of my PhD research at OIST and that was supported
by the Proof of Concept program. Recently this work was also accepted for publication at NeurIPS, the top machine learning conference.
In the work, I proposed Automatic Propagation (AP) software, a generalization of Automatic Differentiation (AD, the technology behind recent important
machine learning software such as PyTorch and TensorFlow), and I also wrote a prototype AP software library called Proppo.
This library allows flexibly implementing many new algorithms that are not easily implemented in AD software.
Moreover, it allows implementing the algorithms in a reusable way. Even if one invents novel algorithms
with a high performance, their work will not have any impact if other researchers and engineers are not able to use the
algorithm. I hope that my work will overcome such an issue and lead to a proliferation of complex machine learning algorithms.
Multi-Agent Reinforcement Learning for Distributed Solar-Battery Energy Systems
Efficient utilization of renewable energy sources, such as solar energy, is crucial for achieving sustainable development goals. As solar energy production varies in time and space depending on weather conditions, how to combine it with distributed energy storage and exchange systems with intelligent control is an important research issue. In the presentation today, I explore the use of reinforcement learning (RL) for adaptive control of energy storage in local batteries and energy sharing through energy grids.
I combine reinforcement learning algorithms in each house with the Autonomous Power Interchange System (APIS) from SONY. I consider different design decisions in applying RL: whether to use centralized or distributed control, at what level of detail actions should be learned, what information is used by each agent, and how much information is shared across agents. Based on these considerations, I implemented deep Q-network (DQN) and prioritized DQN to set the parameters of real-time energy exchange protocol of APIS and tested it using the actual data collected from OIST DC-based Open Energy System (DCOES).
The simulation results showed that DQN agents outperform rule-based control on energy sharing and that prioritized experience replay further improves the performance of DQN. Simulation results also suggest that sharing average energy production, storage and usage within the community helps the performance.
Omron Sinic X
Theoretical Analysis of Regularization in RL
Regularization is an indispensable component of modern deep RL algorithms. Nonetheless, its theoretical understanding is relatively limited. In this talk, I explain my recent paper that analyzes value iteration with entropy and KL regularizations under the setting of tabular MDPs with a generative model. I show that the algorithm is minimax-optimal for finding a near-optimal policy without any variance reduction. This is the first work that shows a model-free algorithm as simple as value iteration with regularizations can be minimax-optimal As pure value iteration is known to be suboptimal in this setting, this work also shows a sharp contrast between non-regularized algorithms and regularized ones.
Lab Automation with Robots and Reinforcement Learning by Unity
First, I will introduce our activities at LiNKX, where we use robots and AI to automate tasks. Until now, most of our products have been introduced to production sites, but recently, we have been focusing on automation in research and development sites. Robots can be used to perform pipetting, weighing, pH measurement, viscosity measurement, visual photography, opening and closing bottle lids, and bottle washing. These tasks can be combined to automate a series of laboratory operations. A measuring instrument that cannot be connected to a PC can be integrated into an automated system by the robot's physical operation and a camera.
Second, as a personal approach to reinforcement learning, I would like to introduce Unity, a game development environment in which reinforcement learning can be applied. Unity makes it easy to create 3D environments having a physical properties, such as robots. Then, you can apply reinforcement learning algorithms such as PPO (Proximal Policy Optimization) and SAC (Soft Actor-Critic) to the agent. As an example of the application, I would like to show you a game where you play a rolling tennis with a robot trained by reinforcement learning.