LAION-5B

LAION-5B is a large-scale open source training dataset compiled to advance research in artificial intelligence safety. Key facts:

Contains over 5 billion text-image pairs.
Source data includes ALT-text from Common Crawl and more.
One of the largest public multipodal AI training sets.
Used to improve harmless image generation for DALL-E and others.
Provides wider context beyond typical image datasets.
Captures broad knowledge about the world.
Reduces social biases through content diversity.
Released freely to democratize access to high-quality data.
Enables more robust and beneficial AI systems.

The unprecedented scale and breadth of LAION-5B helps models generate harmless, honest, and helpful content. It promotes AI that avoids stereotypes, toxicity, and falsehoods.

LAION-5B demonstrates responsible data practices aiming to direct AI progress toward human flourishing. It sets new standards in curating training data for social good.