Abstract
Automated Theorem Proving (ATP), has important applications, e.g., software verification. In the future it is expected that ATP will play an important role verifying the conclusions from statistical AI methods. Conversely, advancements in statistical methods can be used to guide the search of symbolic AI systems. This research deals with guiding the proof-search of the ATP system E. E searches for a proof by repeatedly selecting a statement (clause) that is expected to lead to a proof by contradiction. This process is known as given clause selection. In this research, E is augmented with a neural network policy for given clause selection, trained using the proximal policy optimization reinforcement learning (RL) algorithm. The RL state is a small set of simple features that can be efficiently tracked within E. The action representation is novel, as the agent does not directly select a given clause, but instead selects from a predefined set of clause evaluation functions, and the selected function then chooses the given clause. This RL approach is different from previous work on guiding given clause selection, most of which uses supervised learning and requires extracting features from clauses. In order to provide the RL approach with good base performance, E's automatic mode is used to choose proof-search strategies for all problems in a dataset, and these strategies are merged into one baseline strategy that is used for all problems. The impact of this strategy merging is more profound than any gains due to RL, and lead to further investigations into strategy merging. These approaches are evaluated on three separate datasets: a first-order logic dataset, a typed first-order logic dataset, and a higher-order logic dataset. The results are compared with baselines, including E's "auto" strategy and a fixed (but learned) distribution over actions.