T-SNE
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm used for dimensionality reduction and visualization of high-dimensional data. Developed by Geoffrey Hinton and Laurens van der Maaten in 2008, t-SNE has become a popular method for visualizing complex datasets, particularly in fields like bioinformatics, image processing, and natural language processing.
The primary goal of t-SNE is to map high-dimensional data points to a lower-dimensional space, such as 2D or 3D, in a way that preserves the pairwise similarities between points. In other words, if two points are close to each other in the original high-dimensional space, they should also be close in the reduced space, and vice versa.
t-SNE starts by measuring pairwise similarities between points in the high-dimensional space using a Gaussian distribution centered at each point. It then constructs a similar probability distribution in the lower-dimensional space. The algorithm iteratively adjusts the lower-dimensional representation to minimize the divergence between the two distributions, typically using the Kullback-Leibler divergence as the objective function.
One of the key features of t-SNE is its use of a t-distribution, rather than a Gaussian distribution, to measure similarities in the lower-dimensional space. This results in a "heavier-tailed" distribution that is more robust to the crowding problem, a common issue in dimensionality reduction where points in the lower-dimensional space tend to crowd together.
While t-SNE is effective for visualizing high-dimensional data, it has some limitations. The algorithm is computationally intensive, making it less suitable for very large datasets. It is also sensitive to hyperparameters like the perplexity, which controls the balance between preserving local and global structures. Additionally, t-SNE does not provide a deterministic mapping from the high-dimensional space to the lower-dimensional space, meaning that running the algorithm multiple times on the same data may produce different results.
In summary, t-SNE is a machine learning algorithm used for dimensionality reduction and data visualization. It aims to preserve the pairwise similarities between data points when mapping them from a high-dimensional space to a lower-dimensional one. The algorithm has been widely adopted for visualizing complex datasets but comes with computational costs and sensitivity to hyperparameters.