Financial Public Opinion Sentiment Analysis


Data collection utilizes the FinChina SA dataset, which contains 288,788 high-quality articles selected from financial news articles and annotated with 21,272 sentiment data items. The report introduces methods for data preprocessing, sparse dimensionality reduction, and sentiment classification using the R language, specifically focusing on sentiment prediction in the financial sector. Improved methods include combining PSO with SVM to optimize the performance of SPCA. The report also discusses the GeoSPCA algorithm, a geometric sparse PCA method that gradually approaches the optimal solution via cutting planes, aiming to overcome the challenges of traditional non-convex optimization.
Paper Title: Gradient-based Sparse Principal Component Analysis with Extensions to Online Learning
I compare these using various metrics, including log-likelihood, BIC (Bayesian Information Criterion), ICL, silhouette coefficient, CH index, adjusted Rand index, and normalized mutual information.
  • PCA_90 represents the extraction of 90 principal components using principal component analysis (the variance contribution is 64.854%);
  • SPCA_BP_7 represents the selection of the first seven principal components after optimal parameter adjustment; SPCA_Pro_3 represents the selection of the first three sparse principal components after the proximal gradient algorithm combined with SPCA;
  • SPCA_Pro_7 represents the selection of the first seven sparse principal components after the proximal gradient algorithm combined with SPCA;
  • SPCA_Grand_3 represents the selection of the first three sparse principal components after gradient-based sparse principal component analysis.
When clustering keywords, SPCA_Pro_3 achieves the highest silhouette coefficient of 0.188, indicating a clear clustering structure. SPCA_Pro_3 achieved the highest CH index (the ratio of within-class variance to between-class variance) at 179.225, indicating strong clustering results. A higher adjusted Rand index is preferred, as it measures the consistency between clustering results and the true classifications. SPCA_Pro_7 achieved the highest Rand index at 0.203, and SPCA_Grand_7 achieved the highest NMI at 0.407, indicating relatively consistent classification information in their clustering results. When targeting emotion classification, similar results were achieved, with the exception of SPCA_Pro_3, which achieved the highest Rand index at 0.197.

Overall, SPCA_Pro_3 performed well in terms of the CH index and silhouette coefficient, indicating high clustering quality and significant inter-class differentiation. SPCA_Grand_7 achieved the highest NMI, indicating high consistency between the clustering results and the original data classification information.

When we used PSO+SVM to perform sentiment classification using different processing methods, we saw that SPCA_Grand_7 still achieved the highest accuracy on the test set. Furthermore, it's not hard to see that, despite both using 90 principal components, SPCA_BP achieved a 72.53% improvement in accuracy over PCA. Furthermore, after multiple parameter optimizations, I was able to achieve the same accuracy as using 90 principal components with only 7 sparse principal components (implying that my parameter optimization could be further improved). Using the proximal gradient algorithm combined with SPCA yielded better results than using SPCA alone. Using the gradient-based SPCA with 7 sparse principal components for classification yielded the best results, achieving an accuracy of 83.67%, an 80.22% improvement over PCA.