Text this: Sifter : a generalized, efficient, and scalable big data corpus generator