How to explain the outcome of k

I am currently conducting some analysis using NTSB aviation accident database. There are cause statements for most of the aviation incidents in this dataset that describe the factors lead to such event.

One of my objectives here is to try to group the causes, and clustering seems to be a feasible way to solve this kind of problem. I performed the followings prior to the beginning of k-means clustering:

  • Stop-word removal, that is, to remove some common functional words in the text
  • Text stemming, that is, to remove a word's suffix, and if necessary, transform the term into its simplest form
  • Vectorised the documents into TF-IDF vector to scale up the less-common but more-informative words and scale down highly-common but less-informative words
  • Applied SVD to reduce the dimensionality of vector
  • After these steps k-means clustering is applied to the vector. By using the events that occurred from Jan 1985 to Dec 1990 I get the following result with number of clusters k = 3 :

    (Note: I am using Python and sklearn to work on my analysis)

    ... some output omitted ... 
    Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=3, n_init=1,
        n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
        verbose=True)
    Initialization complete
    Iteration  0, inertia 8449.657
    Iteration  1, inertia 4640.331
    Iteration  2, inertia 4590.204
    Iteration  3, inertia 4562.378
    Iteration  4, inertia 4554.392
    Iteration  5, inertia 4548.837
    Iteration  6, inertia 4541.422
    Iteration  7, inertia 4538.966
    Iteration  8, inertia 4538.545
    Iteration  9, inertia 4538.392
    Iteration 10, inertia 4538.328
    Iteration 11, inertia 4538.310
    Iteration 12, inertia 4538.290
    Iteration 13, inertia 4538.280
    Iteration 14, inertia 4538.275
    Iteration 15, inertia 4538.271
    Converged at iteration 15
    
    Silhouette Coefficient: 0.037
    Top terms per cluster:
    **Cluster 0: fuel engin power loss undetermin exhaust reason failur pilot land**
    **Cluster 1: pilot failur factor land condit improp accid flight contribute inadequ**
    **Cluster 2: control maintain pilot failur direct aircraft airspe stall land adequ**
    

    and I generated a plot graph of the data as follows:

    绘制k均值聚类的结果

    The result doesn't seem like make sense to me. I wonder why all of the clusters contain some common terms like "pilot" and "failure".

    One possibility that I can think of (but I am not sure if it is valid in this case) is the documents with these common terms are actually located at the very centre of the the plot graph, therefore they can not be efficiently clustered into a right cluster. I believe this problem cannot be addressed by increasing the number of clusters, as I have just done it and this problem persists.

    I just want to know if there is any other factors that could cause the scenario that I am facing? Or more broadly, am I using the right clustering algorithm?

    Thanks SO.


    I do not want to be a carrier of bad news, but ...

  • Clustering is a very bad exploration technique - mostly because without a clear, task oriented aim, clustering techniques are actually focused on optimization of some mathematical criterions, which rarely have anything to do with what you want to achieve. Thus k-means in particular will look for minimization of the euclidean distances from cluster centers to all points inside a cluster. Is this anyhow related with the task you want to achieve? Usually the answer is "no", or in the best case "I have no idea".
  • Representing documents as bag of words leads to very general look at your data, thus it is not a good approach to distinguish between similar objets. Such an approach can be used to distinguish between texts about guns from texts about hockey, but not specialistic texts from the very same domain (which seems to be the case here)
  • In the end - you cannot really evaluate a clustering, and this is the biggest issue. Thus there are no well established techniques of fitting best clustering.
  • So, to answer your final questions

    I just want to know if there is any other factors that could cause the scenario that I am facing?

    There are thousands of such factors. Finding actual, reasonable from the human perspectice, clusters in data is extremely hard. Finding any clusters is exteremely simple - because every clustering technique will find something. But in order to find what is important here one would have to go through whole data exploration here.

    Or more broadly, am I using the right clustering algorithm?

    Probably not, as k-means is simply a method of minimizing of inner cluster sum of euclidean distances, thus it will not work in most real world scenarios.

    Unfortunately - this is not the kind of problem where you can just ask "which alogirhtm to use?" and someone will offer you exact solution.

    You have to dig in your data, figure out:

  • way of representation - is tfidf really good? have you preprocessed the vocablurary? Removed meaningless words? Maybe it is wort considering going for some modern word/document representation learning?
  • structure in your data - in order to find best model you should visualize your data, investigate, run statistical analysis, try to figure out what is an underlying metric. Is there any reasonable distribution of points? Are these gaussians? Gaussian mixtures? Is your data sparse?
  • can you provide some expert knowledge? Maybe you can divide part of dataset yourself? semi-supervised techniques are much better defined then any unsupervised ones, thus you might easily get much better results.
  • 链接地址: http://www.djcxy.com/p/91134.html

    上一篇: 从Google BigQuery导出到CloudSQL?

    下一篇: 如何解释k的结果