Research on clustering and discrimination of stock data and text data

All points in the figures are labeled with different colors and categories for each stock (e.g., "Kweichow Moutai," "Laobaigan Liquor," "China Merchants Bank," etc.). Overall, tSNE preserves local structure well, resulting in clearer separations between clusters, but may introduce some noise. SPCA and UMAP have more concentrated cluster distributions after dimensionality reduction, with relatively clear boundaries, making them suitable for scenarios requiring better interpretation of large-scale features. Although PCA preserves the principal components of the data, its clustering performance seems inferior to the other three methods.

The classification performance of the original data was very poor. Directly using SVM improved the performance, but it was far from sufficient. Sparse principal component analysis (SPCA) effectively extracted key features during the dimensionality reduction process, further improving classification accuracy. Particle swarm optimization (PSO) was used to optimize the SVM parameters, significantly improving the model's classification performance. While preserving local data features, t-SNE combined with PSO and SVM achieved the best classification results, achieving an accuracy of 74.07%, significantly exceeding the other methods.

[Picture]

A document term frequency matrix (DTM) was created, and the number of occurrences of each word in each document was obtained. The code was used to extract the top eight most frequent terms in each document and construct a new matrix to reduce data sparsity while retaining important features. A text model was trained using the word2vec method, with the input data being the text file named corpus_E.txt. The model's word embedding dimension was 1024, and the number of iterations was 100. The predict function was used to find the top 20 words with the closest lexical connections and record their embedding vectors. The top 20 words of the same topic were grouped together.

Sparse principal component analysis (SPCA) was used to reduce the dimensionality of the two methods. The reduced data was then clustered using the K-means method, dividing the documents into several subgroups. It is not difficult to see that the data was more easily clustered using the embedding vectors.

Without any parameter optimization, the SVM classification performance was poor, with an accuracy of 0. After introducing PSO, the accuracy improved to 55.56%. Optimizing the SVM model parameters through PSO improved model performance to some extent, but the ability to represent data features was still limited. The word embeddings generated using Word2Vec captured the semantic information of the words, and after dimensionality reduction using SPCA, the SVM model achieved 100% classification accuracy. Choosing the right feature representation method and model optimization techniques is crucial for improving classification and clustering accuracy.

[Picture]