I'm currently planning out a stock trading system that will use an LSTM network, with reinforcement learning to train. The reason I'm going with reinforcement learning is that I'm not interested in predicting the next price directly. Rather, I'd like to have a model that outputs “conviction values” for different possible actions (hold cash, buy stock, and short stock). This is because I'm going to be analyzing a bunch of different stocks and deciding which to trade based on those that have the highest conviction values for either buy or short stock at any given time.
There's another model that outputs a maximum position size it believes is worth holding given a conviction value and some other input. Along with that, I'm currently planning on having two other models: one in charge of opening positions and another in charge of closing them. These will take the conviction values and maximum position size as input, along with volume related data. These models are necessary since I want the system to be able to deal with lower volume stocks where buying/selling large amounts at once could negatively affect how profitable the trade is. While the previous model might determine that it would be ideal to open a huge position, that's not always feasible given market conditions.
My plan was for all of these models to use some measure of profitability as the reward function. The problem is that profitability isn't determined until after a position is closed. Since I'm planning for the system to gradually close a position, there also won't necessarily be a single moment where profitability is determined. Rather, profitability is more a cumulative measure of all the decisions made over time: whether the conviction values were accurate, whether the size of position opened was appropriate, whether the amount bought/sold at every time step was appropriate. This seems to present a problem, since I'm under the assumption that a reward function needs to be applied immediately after an action is taken.
So, my question: is there some way to use delayed rewards like profitability over time in reinforcement learning, or do I need to scrap my current plans and start from scratch?