Sensei: Self-Supervised Sensor Name Segmentation

A sensor name, typically an alphanumeric string, encodes the key context (e.g., function and location) of a sensor needed for deploying smart building applications. Sensor names, however, are curated in a building vendor-specific manner using different structures and vocabularies that are often esoteric. They thus require tremendous manual effort to annotate on a per-building basis; even to just segment these sensor names into meaningful chunks. In this paper, we propose a fully automated self-supervised framework, Sensei, which can learn to segment sensor names without any human annotation. Specifically, we employ a neural language model to capture the underlying sensor naming structure and then induce self-supervision based on information from the language model to build the segmentation model. Extensive experiments on five real-world buildings comprising thousands of sensors demonstrate the superiority of Sensei over baseline methods.


Introduction
Sensor name segmentation, aiming at partitioning a sensor name string into a few semantic segments, is an essential task to enable smart building technologies (Weng and Agarwal, 2012), as these technologies fundamentally rely on understanding the context of sensory data. For example, to increase the airflow in a room in view of the ongoing COVID-19 pandemic, one needs to locate the airflow control point of the room. To obtain such context, one needs first to be able to understand the sensor names, which are encoded as a concatenation of segments. Thus, correctly segmenting sensor names into meaningful chunks is a key first step towards such understandings; As illustrated in Figure 1, a sensor name is typically a sequence of alphanumeric charactersthere are multiple segments, each encoding key context about the sensor (building name, location, sensor type, etc). For example, the sensor name SODA4R731__ASO should be segmented as SOD (building name), A4 (equipment id), R731 (room id), and ASO (measurement type -area temperature setpoint). Note that the meanings of the same punctuation may vary; for example, '_' can be a delimiter or part of a segment.
Currently, sensor name segmentation requires domain knowledge and tedious manual effort due to its building-specific nature. Sensor names are created by building vendors, and as we see from Figure 1, in different buildings they usually adopt distinctive structures and vocabularies that are often esoteric. Typically, to build a sensor name segmentation model, it involves a technician with the domain expertise to comprehend these sensor names and then design rules to segment and annotate these names; no universal pre-defined parsing rules such as regular expressions exist for sensor names. Therefore, it remains a major obstacle to the wide adoption of smart building technologies from both cost and efficiency perspectives (Bhattacharya et al., 2015a).
We need an automated solution for sensor name segmentation. Despite the recent progress in applying active learning (Schumann et al., 2014;Balaji et al., 2015;Koh et al., 2018;Shi et al., 2019) and transfer learning (Hong et al., 2015a;Jiao et al., 2020) to sensor name interpretation, all these methods still require human annotation effort and thus they are not fully automated.  Figure 2: Overview of Sensei. We induce pseudo labels for segmentation using the transition probabilities from a character-level neural language model. The hidden states from the language model are also used when training the segmentation model.
In this paper, given all the sensor names in a building, we propose a novel self-supervised segmentation framework, Sensei, to segment these names into meaningful chunks without any human effort. Doing so would facilitate the process of understanding sensor context and make it fundamentally scalable. Figure 2 presents an overview.
We draw inspiration from a key observation that when creating the sensor names within one building, technicians would follow some underlying naming patterns. For instance, in some buildings, the sensor name often starts with the building name, followed by the room id and type of measurement. Also, technicians would use similar phrases to express the same concept (e.g., "temperature" would be encoded as "T", "temp", or "ART"), at least within the same building.
Based on this observation, in Sensei, we first employ a character-level neural language model (Karpathy et al., 2015) to capture the latent generative pattern in sensor names. This language model learns the probability of observing a character in the sensor name given all the preceding characters. Intuitively, the segment boundaries in a sensor name should highly correlate with this probability. Frequent transitions would have a higher probability than the infrequent ones, which might well imply the start of another segment. Therefore, we induce pseudo segmentation labels by setting a pair of thresholds on these transition probabilities, and then build a binary classifier to segment sensor names upon their contextualized representations produced by the language model. Since these pseudo labels may contain noise, we create an ensemble of independent classifiers, each trained on a uniformly random subset of the pseudo labels, in order to further improve the efficacy.
To the best of our knowledge, Sensei is the first framework for sensor name segmentation without human annotation. We conduct extensive experiments on five different buildings with thousands of sensors. Our main contributions are as follows: • We study an important problem of fully automated sensor name segmentation. • We propose a novel self-supervised framework Sensei, which leverages a neural language model to capture the underlying naming patterns in sensor names and produces pseudo segmentation labels for training binary classifiers. • We conduct extensive experiments on five realworld buildings comprising thousands of sensor names. Sensei on average achieves about 82% in F 1 , roughly a 49-point improvement over the best compared method. Reproducibility. Our code and datasets are readily available on Github: https://github.com/ work4cs/sensei.

The Sensei Framework
Our framework Sensei consists of three steps: • Train a neural language model (NLM) at the character level to capture the underlying naming patterns in sensor names; • Generate Tie-Break-Unknown pseudo labels using two thresholds, t 0 and t 1 , decided by inspecting the distribution of transition probabilities (i.e., likelihood of observing the current character given the previous ones); • Train a set of segmentation models based on the pseudo labels to mitigate the effect of noise in these labels. We next elaborate on each step.

Language Model for Underlying Patterns
As sensor names are created by humans (e.g., a technician with knowledge about building particulars), they often follow a certain naming convention (e.g., start with the building name, then room id, and then type). In addition, within a building, segments of sensor names corresponding to the same kind of information (e.g., location or function) would use similar phrases; e.g., the concept of "room" would be encoded as "RM", "R", or similar variants. A natural solution follows here: we would want to model the generative patterns in these names such that given the characters seen by far we can predict the next one. This coincides with the language modeling task in NLP.
Since the sensor name segmentation task works on characters, we adopt a popular character-level neural language model to capture the underlying sensor naming pattern. Specifically, we choose the classical Char-RNN (Karpathy et al., 2015) architecture in our design and use LSTM (Hochreiter and Schmidhuber, 1997) as the RNN model. Note that, our method is compatible with any characterlevel neural language models.
Given a character sequence of length N , X = x 1 , x 2 , . . . , x N , the Char-RNN learns the probability of observing a character given all the previous characters, namely, p(x i+1 |x 1 , x 2 , . . . , x i ). During this process, we will obtain an embedding vector x i for each character x i , and a hidden state vector h i after observing the characters from x 1 to x i . A softmax layer is then applied to h i to predict a distributionp i over the entire vocabulary: where w c is the linear transformation for character c. The cross-entropy betweenp i and the one-hot encoding of x i+1 is used as the loss function for this character. Given a building, we train the Char-RNN on all its sensor names. As each sensor name is independent of each other, we can have the same initial hidden state for each sensor name to ensure sensor names do not interfere with each other. Once the model converges, we apply it to all the sensor names to obtain the character transition probabilities, i.e.,p i (x i+1 ). The perplexity of the trained Char-RNN in our experiments is typically small (i.e., < 0.3 per batch with batch size 32). Therefore, we believe it captures the underlying naming Tie/Break precision curves for an example building. The "sweet spot", achieving a great balance between the tie-and break-precision scores, is highly aligned with the peak in the histogram. pattern within the input building well.

Pseudo Labels from Transition Probabilities
Inspired by (Shang et al., 2018), we use Tie and Break to decide the segmentation results. The transition between two adjacent characters (x i , x i+1 ) is labeled as (1) Break when we should segment after character x i , or (2) Tie otherwise, denoting that the two successive characters belong to the same segment. For a given character sequence x 1 , x 2 , . . . , x N , we hypothesize that the transition probabilitŷ p i (x i+1 ) obtained from Char-RNN is closely related to the Tie/Break relation between x i and x x+1 . Intuitively, the Char-RNN model should produce a high likelihood for common transitions in sensor names, e.g., substrings for building name, room, and common sensor types. Therefore, when Char-RNN suggests a low transition probability, the transition is very likely to be a Break; otherwise, the possibility of a Tie becomes higher.
We empirically verify our hypothesis via data analysis of an example building as shown in Figure 3. We present the probability density from histogram ofp i (x i+1 ). In addition, based on the ground-truth segmentation results, we plot the Tie and Break precision curves w.r.t. different thresholds. The Tie Precision refers to the ratio of Tie transitions among all the transitions above a certain threshold, while the Break Precision refers to the ratio of Break transitions among all the transitions below a certain threshold. One can observe that the "turning points" on the break precision curve are highly correlated to the peaks in the histogram.
If one wants to set up a single threshold on Break} in an unsupervised manner, the highest peak in the "confidence" interval [0.550, 0.950] on the distribution (e.g., 0.771 in Figure 3) would be a good choice to achieve a high F 1 score. We generalize this threshold selection criterion to the other buildings, and as we shall demonstrate in our experiments, such a selection strategy gives results close to grid search that uses ground-truth labels.
In addition to Tie and Break, we mark those uncertain transitions as Unknown. We need to decide on two thresholds, t 0 and t 1 , and categorize the transitions according to three transition probability intervals, [0, t 0 ], (t 0 , t 1 ), and [t 1 , 1], denoting Break, Unknown, and Tie, respectively, as the pseudo labels. We wish these pseudo labels would be of high accuracy while having a sufficient amount of labels. Based on our observations, the above single threshold criterion satisfies t 1 . Considering that Breaks are considerably fewer than Ties, we should decide on a Break more carefully. The highest peak in a narrowed high precision interval [0.050, 0.150] would be appropriate (e.g., 0.101 in Figure 3).

Ensemble to De-noise Pseudo Labels
There could exist errors in these automatically induced pseudo labels, so we leverage the idea of ensemble learning to mitigate the effects of these label errors on the final predictions (Breiman, 1996). Specifically, we independently sample a subset of pseudo labels to train K binary classifiers and then average their predictions. In the pseudo labels, the number of Tie transitions is usually much higher than that of Break. To balance the training data, we sample · M Tie and Break labels, respectively, from all the pseudo labels, where M is the number of Break transitions and is a small coefficient between 0 to 1 for sampling a subset (e.g., = 0.1). Such a sampling strategy makes the label errors less likely to affect every binary classifier, so the final prediction becomes more accurate.
All types of binary classifiers could be used to construct the ensemble, and we adopt a multi-layer Perceptron (MLP) as our binary classifier. For the i-th transition, we retrieve the hidden state vector h i yielded by the Char-RNN and feed it as input to the MLP. The final prediction is the average of predictions from the K classifiers. As the training data is sampled in a balanced way, we simply use 0.5 as the threshold to decide on Tie or Break.

Experiments
We empirically evaluate Sensei on datasets from real-world buildings and discuss our results as well as findings from some interesting cases.

Datasets and Pre-processing
To evaluate Sensei, we collect the sensor names from five office buildings (named A through E) of four different building vendors at three different sites located in different geographic regions. We also collect the character-level ground-truth labels of these names from their building vendors. We adopt the BIO tagging scheme in generating labels, marking the beginning (B), inside (I), and outside (O) of each segment (e.g., for location or function). The details of each building are summarized in Table 1.
Digits. The digits in sensor names indicate detailed and specific information such as room or equipment identifiers, so preserving the variety in numbers does not help our segmentation task. Conversely, it disturbs the transition probability distribution and thus confuses the model in predicting the next characters -the model would only need to learn and recognize the transitions from digit to digit, as opposed to the specific values (e.g., "1" to "2" or "4" to "3"). Therefore, we replace all numerical digits with the same digit "0".
Punctuation and Whitespace. There are symbols such as underscores and whitespace in sensor names, which are inserted by technicians at the time of metadata construction. We leave them as-is for our model to learn their meanings because the meanings of these characters vary from case to case. This is in fact one of the major challenges in this sensor name segmentation problem. For example, the sensor name "SODH1______L_L" should be segmented as "SOD|H1|______|L_L", with the three segments corresponding to its building name, equipment id, and measurement type, respectively. The underscores between "H1" and "L_L" are padded to make the sensor name fixedlength, while the underscore inside "L_L" connects two initial letters (i.e., for a Lead-Lag sensor, commonly existing in water pumps).

Evaluation Metrics
We evaluate the performance of all the considered methods by the F 1 , precision, and recall scores. A segment is represented as a span with the starting and the ending character indices. A predicted segment is correct if and only if there exists an exactly same segment in the ground truth. Therefore, we define the precision and recall as follows: where S GT is the set of ground-truth spans and S P red is the predicted set. The F 1 score is the harmonic mean of precision and recall. We report the averaged F 1 score of all sensor names, which is relatively unbiased (Opitz and Burst, 2019). As we mentioned before, there will be some extra delimiters between segments. Therefore, during the evaluation, we ignore segments containing only delimiter(s) in both ground truth and predicted segments. When calculating the start and end indices for predicted segments, we also skip their prefix and suffix delimiters. The same process here applies to the evaluation of all methods.

Compared Methods
We compare Sensei with the following methods: • Delimiter. There are punctuation (such as "-" and "_") and whitespace characters in sensor names, and they could indicate the boundaries between segments. Therefore, this method segments a sensor name by delimiters (i.e., nonalphanumeric characters). This method mainly serves as a sanity check. • NLTK TweetTokenizer. NLTK (Bird et al., 2009) provides a tweet tokenizer to segment a string into tokens according to predefined regular expressions (regexes). We directly apply it to segment our sensor names. • CoreNLP. We adopt the pre-trained tokenizer in the CoreNLP package 1 (Manning et al., 2014), which adopts the Universal Dependencies 2 version 2 (UD v2) standard for segmentation . • Stanza. We also adopt Stanza 3 and use its builtin neural tokenizer (Qi et al., 2020) following UD v2. This method combines convolutional filters and bidirectional LSTM to realize tokenization and sentence segmentation as a tagging task (Qi et al., 2018). • BayesSeg. Topic segmentation divides a document into topic-coherent segments. An unsupervised Bayesian model, BayesSeg 4 (Eisenstein and Barzilay, 2008), is used to segment characters of sensor names as a topic segmentation task that decides the boundary between sentences. However, this method requires to manually specify the number of segments, which is a parameter we do not know without human input. • ToPMine. ToPMine (El-Kishky et al., 2014) provides a method that groups frequent words into phrases in an unsupervised manner and incorporates these phrases into topic modeling. We adapt the model to work at the character level. That is, we regard each character of sensor names as a word in document and group characters into segments as group words into phrases. Note that, we do not use custom regular expressions (regexes) to segment sensor names because they require tremendous manual effort to create in order to exhaustively cover all the possible substring patterns, which deviates from our selfsupervised problem setting. Moreover, since different buildings follow different sensor naming conventions, manual effort is required from domain experts to create regexes on a per-building basis, which is a costly process. We also compare with two ablations of our method: • Sensei-Forward (Sensei-FW). It leaves out the self-supervised ensemble learning. Specifically, we keep the Char-RNN to obtain the distribution of observing next characters, and then find the single threshold as stated in Section 2.2. • Sensei-Backward (Sensei-BW). This is similar to the forward counterpart. The only difference is that the Char-RNN takes as input the reversed sensor names. As we shall see in the results, this method does not add much value to our task due to the intrinsic irregularity of sensor names when examined backward. We further examine a method using grid search based on ground truth for threshold tuning to verify the effectiveness of our threshold decision: • Sensei-GridSearch (Sensei-GS). Compared to Sensei-FW, this method finds the best threshold for deciding Tie using ground-truth labels, i.e., it searches through all the possible threshold values on the transition probability distribution and picks the one that produces the best segmentation results. Note that this method is only used to demonstrate that a single threshold chosen based on the transition distribution (as detailed in Section 2.2) gives results reasonably close to the best we can achieve for Sensei-FW using the ground truth.

Experimental Setup
We modify the Char-RNN library 5 and use Keras (Chollet et al., 2015) to implement our method. As our method is unsupervised, we do not employ the commonly used early-stopping scheme when training the Char-RNN. Instead, we train our models for 100 epochs and empirically find this to be sufficient. All the thresholds have three decimal places. We assign Ties as positives and Breaks as negatives. For binary classifier, any supervised learning algorithm (e.g., logistic regression, SVM, etc) would accommodate our need in this work. We choose a vanilla Multilayer Perceptron with 2 fully-connected layers, each with 64 cells. We set the number of binary classifiers in our ensemble, K, at 100. The subsampling rate for the ensemble, , is 10% and for each subsampling, we use pandas with the iteration index as seed. Training a Sensei model on a Colab GPU with 12GB RAM takes less than 40 minutes for each building. For the other compared methods, we tune at our best based on the recommended settings in their papers or repositories and report the best performance.

Result Analysis
Experimental results for all the methods are summarized in Table 2. Overall, Sensei outperforms all the compared methods significantly, attributed to its strategy of complementing the language model with a self-supervised ensemble classifier. Besides the variants of Sensei, the baseline Delimiter, though simple, has achieved the second best performance among all others methods. On average, Delimiter achieves 33.81% in F 1 across all the buildings. By contrast, our Sensei achieves over 80% in F 1 , which demonstrates a 49-point improvement over Delimiter. When looking at the F 1 scores of the other baselines, including ToPMine, BayesSeg, and the off-the-shelf tokenizers in NLTK, Stanza, and CoreNLP, they are not competitive; this highlights the need of a solution to our challenging problem.
The performance of Delimiter also confirms the fact that the semantics of these delimiters are mixed. If one recalls the examples in Table 1, vendors usually use delimiters in sensor names. Sometimes, these delimiters well indicate the segment boundaries. However, as we illustrated in the example sensor name "SOD|H1|______|L_L", punctuation could be also used within the segment, and therefore simply segmenting at delimiters results in a considerable amount of false positives.
From Sensei-FW to Sensei, there is a significant boost, roughly 27 points in F 1 on average. Since the major difference between Sensei and Sensei-FW is our self-supervised ensemble learning module, we empirically verified its power. Comparing Sensei-FW and Sensei-BW, one can observe that the forward version performs dramatically better. As shown in Table 2, Sensei-FW performs better than Delimiter, ToPMine, and all the pre-trained tokenizers in all cases. By contrast, Sensei-BW takes the reversed sensor names as input but performs much worse than Sensei-FW. We notice that this is because there are not sufficient variations in the sensor string patterns when being looked at backward, compared to the forward case. For example, there are names like "SODA4R731__ASO" and "SODA1R516__VAV", and the Sensei-FW model can see various substrings (e.g., "ASO" and "VAV") following the common pattern "SODA0R000__". Variations as such provide enough information for the model to learn where to segment. However, when reversed, the above example becomes "OSA__000R0ADOS" and the prefix "OSA" sees no variations following, which makes it nearly impossible for Sensei-BW to figure out the right segmentation. Consequently, Sensei-FW better captures generative patterns while Sensei-BW achieves poor segmentation results.
Comparing Sensei-FW and Sensei-GS, one can observe that, in most cases (4 datasets out of 5), Sensei-FW finds the best single threshold found by Sensei-GS. Note that Sensei-GS utilizes the ground truth to exhaustively search among all the possible thresholds, while Sensei-FW decides the threshold based on the transition distribution without requiring any labels. This small difference in performance indicates that our data-driven threshold finding solution based on the distribution is reasonable and reliable.

Performance w.r.t. Number of Sensors
Since our Sensei framework is fully automated, its performance is solely decided by the amount and variety of available sensor names. As shown in Table 3, Sensei generally gets better performance with more sensor names available with an exception of Building A. We hypothesize that the performance relates more closely to the variety of sensor name patterns in the dataset rather than the size.

Case Studies and Discussions
We next showcase some examples that Sensei correctly segments, in order to illustrate its capability.
"Flukes" for False Positives. In Building B, some of the Breaks are recognized as Ties by Sensei-FW and Sensei-GS. For example, 0F|_|SRVC|_|D0D0D0D00, GF|_|SRVC|_|QR000_000, are mistakenly segmented as 0F_SRVC|_|D0D0D0D00, GF_SRVC|_|QR000_000. By contrast, Sensei avoids the mistakes by learning the pattern from many other sensor names. The following case is a great example.
"Flukes" for False Negatives. Building E contains many cases as follows: SOD|A0|R000|__|ASO, SOD|A0|R000|__|AGN. Sensei-FW, and even Sensei-GS which employs the ground truth, are not able to segment these names correctly; they instead segment them as SOD|A0|R000|__|A|SO, SOD|A0|R000|__|A|GN, because of the same prefix "SODA0R000__A". By contrast, Sensei is able to correctly segment them owing to the self-supervised ensemble learning, which is more robust to noise in pseudo labels.
Discussion. We notice that even though Sensei on average achieves about 80% in F 1 , it still has limitations. Sensei is sensitive to the variation of patterns in datasets-the patterns cannot be too varied or too monotonous.

Related Work
Our work is related to three lines of work, namely, sensor metadata mapping, language model, and phrase mining.
Sensor Metadata Tagging. Sensor Metadata Tagging refers to the process of parsing and annotating the sensor metadata (or sensor name) for understanding a sensor's key context, including the measurement type (Balaji et al., 2015;, location (Bhattacharya et al., 2015b), relationships with others (Koh et al., 2018), and many more (Schumann et al., 2014). The majority body of work exploits an active learning-based procedure (Settles, 2009), where it iteratively selects an "informative" and "representative" metadata example for a domain expert to label, in order to learn a model to annotate the metadata. Complementary to the use of textual metadata, there are also efforts exploring the use of time-series data for inferring the sensor context (Koc et al., 2014;Pritoni et al., 2015). While they can significantly reduce the amount of required manual labeling, they still rely on the availability of at least one human annotator to segment, parse, and provide labels.
By contrast, the method proposed in this work is fully automated, i.e., completely removing humans from the process, and we demonstrate its use in an essential first step-segmenting a sensor name string into meaningful substrings.
Language Model and Tokenization. Language models originate from the areas of natural language processing and information retrieval (Schütze et al., 2008). They aim at modelling the likelihood of observing one token given all the tokens before it, capturing the underlying language patterns. Recent advances in deep learning have pushed the language modeling from traditional n-gram models to neural language models (Kiros et al., 2014;Karpathy et al., 2015;Kim et al., 2016;Peters et al., 2018;Devlin et al., 2018), achieving significantly better performance using recurrent neural networks.
Analogizing sensor names to human languages, we employ neural language models to capture the underlying naming pattern. As we seek to segment a sensor name string into substrings, we choose the classic Char-RNN model (Karpathy et al., 2015). In general, any character-level language models are applicable in our method.
One can also view our problem as tokenization of sensor names. We thus compare with multiple existing tokenizers provided in NLTK Twitter, Standford CoreNLP (Manning et al., 2014), and Stanza (Qi et al., 2020). As we demonstrate in evaluation, our method significantly outperforms these methods in segmenting sensor names.
Phrase Mining. Treating characters as words, our problem can be viewed as an unsupervised phrase mining problem with phrasal segmentation as output. Existing methods mainly leverage statistical signals based on term frequency in the corpus (Deane, 2005;Parameswaran et al., 2010;Danilevsky et al., 2014;El-Kishky et al., 2014). Among all these methods, ToPMine (El-Kishky et al., 2014) is arguably the most effective one. Our method Sensei significantly outperforms ToPMine in our empirical evaluation.

Conclusions and Future Work
In this paper, we study the problem of automating building metadata segmentation, which is an important first step to understanding the context of sensor data in buildings; smart building technologies rely on this information. We present Sensei, which is a fully automated method without requiring human labels. Sensei employs a character-level neural language model to capture the underlying generative patterns in building sensor names. Based on the probability distribution of character transitions (i.e., likelihood of observing the current character give the previous ones), it decides on two thresholds for sifting out examples for which it is confident to be Tie or Break. Considering these pseudolabeled examples as supervision, Sensei constructs an ensemble of binary classifiers to segment sensor names with the information provided by the language model. We conducted experiments on the sensor names from five real-world buildings, and Sensei on average achieves F 1 over 80% in segmenting sensor names, a roughly 49-point improvement over the best of compared methods.
As future work, collecting a larger collection of sensor metadata to pre-train our language model might significantly improve Sensei's performance. We also plan to show more usage of Sensei in standard language tasks in NLP.