Guidelines to Prepare Datasets for NLP

The input data should be as diverse as possible to avoid over-fitting and help the underlying model better generalize on real examples.
The sample dataset should be representative of the population.
Datasets should not be biased towards one type (category/label).
Prepare the initial manual dataset with care. An Imprecise dataset can lead to poor results.
Input data should not contain duplicates. Duplicate samples may result in test set contamination and therefore negatively affect the training process, model metrics, and behavior.
Provide documents/samples that resemble actual use cases as closely as possible. Do not use toy data or synthesized data for production systems.
For good results, provide at least 100 distinct records for each label/class.
In general, a larger dataset will lead to better results.
Try giving the same data distribution for training as you expect when predicting the production. For example, at prediction time, if you wish to use documents that have no entities/categories in them, this should also be part of your training document set. Such records should also be categorized as ‘other.’