Learning a BProgram as a gym environment

This example demonstrates how to learn a BProgram as a gym environment, using the package’s extension to the OpenAI gym. In this extension, we have incorporated a localReward parameter into the yield statement, reflecting the system’s preferences. The BPEnv class implementation requires a b-program generator - a function that creates a new instance of the b-program and the list of program events. The default observation space for the b-program within BPEnv is represented as a Cartesian product of the b-thread’s execution points, classified as multi-discrete. For developers seeking to tailor observation space to specific needs, alternative implementations can be created by extending the abstract class BPObservationSpace, which includes access to both the b-thread’s execution point and its local variables. The Reward computation at each state is determined through a function that receives the reward statements from all b-threads. The default approach calculates the total reward at each yield point by summing the individual rewards from all active b-threads.

import bppy as bp
from bppy.model.sync_statement import *
from bppy.model.b_thread import b_thread
from bppy.gym import *
import numpy as np

@b_thread
def add_hot():  # request hot 5 times, and specify a reward
    for i in range(5):
        yield {request: bp.BEvent("HOT"), localReward: -0.01}
    yield {waitFor: bp.All(), localReward: 1}

@b_thread
def add_cold():  # request cold 5 times
    for i in range(5):
        yield {request: bp.BEvent("COLD")}


@b_thread
def control():  # blocks HOT from occurring twice in a row
    while True:
        yield {waitFor: bp.BEvent("HOT")}
        yield {waitFor: bp.BEvent("COLD"), block: bp.BEvent("HOT")}


def init_bprogram():  # function to initialize the b-program with the defined b-threads
    return bp.BProgram(bthreads=[add_hot(), add_cold(), control()],
                       event_selection_strategy=bp.SimpleEventSelectionStrategy())


if __name__ == '__main__':
    # define event list
    event_list = [bp.BEvent("HOT"), bp.BEvent("COLD")]

    # initialize environment with the defined b-program generator, observation space, and reward function
    env = BPEnv(bprogram_generator=init_bprogram,
                event_list=event_list,
                observation_space=SimpleBPObservationSpace(init_bprogram, event_list),
                reward_function=lambda rewards: sum(filter(None, rewards)))

    # reset environment and print initial state
    state, _ = env.reset()
    print(state)
    terminated = False
    while not terminated:  # loop until the environment (b-program) terminats
        action = env.action_space.sample()  # sample an action
        print(action)
        state, reward, terminated, _, info = env.step(action)  # take a step with the sampled action
        print(state, reward, terminated, info)

    # importing stable_baselines3 and initializing a PPO model
    from stable_baselines3 import PPO
    model = PPO("MlpPolicy", env, verbose=1)
    model.learn(total_timesteps=100000)

    # running the environment again with the trained model
    state, _ = env.reset()
    print(state)
    terminated = False
    while not terminated:
        action, _states = model.predict(state)
        print(action)
        state, reward, terminated, _, info = env.step(action)
        print(state, reward, terminated, info)