Guidelines to Prepare Datasets for NLP
- The input data should be as diverse as possible to avoid over-fitting and help the underlying model better generalize on real examples.
- The sample dataset should be representative of the population.
- Datasets should not be biased towards one type (category/label).
- Prepare the initial manual dataset with care. An Imprecise dataset can lead to poor results.
- Input data should not contain duplicates. Duplicate samples may result in test set contamination and therefore negatively affect the training process, model metrics, and behavior.
- Provide documents/samples that resemble actual use cases as closely as possible. Do not use toy data or synthesized data for production systems.
- For good results, provide at least 100 distinct records for each label/class.
- In general, a larger dataset will lead to better results.
- Try giving the same data distribution for training as you expect when predicting the production. For example, at prediction time, if you wish to use documents that have no entities/categories in them, this should also be part of your training document set. Such records should also be categorized as ‘other.’