We consider the problem of segmenting unformatted text and transcripts linearly based on their topical structure. While prior approaches explicitly train to predict segment boundaries, our proposed approach solves this task by inferring the hierarchical segmentation structure associated with the input text fragment. Given the lack of a large annotated dataset for this task, we propose a data curation strategy and create a corpus of over 700K Wikipedia articles with their hierarchical structures. We then propose the first supervised approach to generating hierarchical segmentation structures based on these annotations. Our method, in particular, is based on a neural conditional random field (CRF), which explicitly models the statistical dependency between a node and its constituent child nodes. We introduce a new data augmentation scheme as part of our model training strategy, which involves sampling a variety of node aggregations, permutations, and removals, all of which help capture fine-grained and coarse topical shifts in the data and improve model performance. Extensive experiments show that our model outperforms or achieves competitive performance when compared to previous state-of-the-art algorithms in the following settings: rich-resource, cross-domain transferability, few-shot supervision, and segmentation when topic label annotations are provided.