All the script should be run inside the root directory of this project.
## Data
`data` contains all the generated data formatted as a dictionary where each key is canonical words, corresponding to multiple noisy word forms. 
The datasets are named by `{sentiment}-{noise-type}.json`. For example, `neg-typos.json` contains words with negative sentiment, and its noisy words are generated by typos.
`data_generator` contains the scripts to generate these noisy data. To extract noisy words from Twitter, download tweets data from [Kaggle](https://www.kaggle.com/datasets/kazanova/sentiment140) into the `data` folder and preprocess by `tweet_preprocess.py`.



## Corruption result
$ python check_result.py --help # see help information for valid argument parameters
We save all the results in the `data` directory. You can check them by the script `check_result.py`.
```
$ python check_result.py --help # see help information for valid argument parameters
$ python check_result.py --dataset_name neg-typos --model_name bert-yelp
```
We also provide Python codes in `check_result.ipynp` to load the dataframe and further analyze the result.

To reproduce our results, you can use the script `word_corruption.py`.
```
$ dataset_name=neg-typos
$ model_name=bert-base-uncased-SST-2
$ python word_corruption.py --dataset_name $dataset_name --model_name $model_name 
```
The script will generate the result as Pandas dataframe.


## (Optional) Evaluating new Models
To evaluate other models in the huggingface hub, you can specify your model name and corresponding model path as the key/value pair in the dictionary `hf_model_names` in the `resource` module 
Then, all the experiment can be regenerated on this new models.
```
import resource
dataset_name = "neg-typos"
plm_name = "bert"
tokenizer = resource.hf_tokenizers[plm_name]
plm = resource.hf_models["-".join(plm_name, dataset_name)]
data = resource.datasets[dataset_name]
```