Model-based reinforcement learning has attracted increasing attention recently. However, it is challenging to learn a structured state space for the model when the observations are high-dimensional raw pixel input. Existing work addresses the problem by first interacting with the environment to collect state transition samples and then factorize the state space by analyzing the samples. We, however, realize that the interaction and factorization procedures should be mutually beneficial – interaction helps to collect samples for factorization, and the more structured state space from factorization helps to interact with higher efficacy. To verify the hypothesis, we first design a basic multi-object game with entity independence and controllable assumption. Our proposed algorithm, named InterFact (Interaction-based Factorization), can solve the game with higher performance compared with baselines. Our framework is end-to-end, which trains an inverse dynamics model with group sparsity constraint, and a policy network on a shared latent space, which is in turn learned using a variational autoencoder. We show the interpretation of learned representation with only intrinsic motivation and study their performance on policy training with environment reward.