What statistics should a programmer (or computer scientist) know?

I'm a programmer with a decent background in math and computer science. I've studied computability, graph theory, linear algebra, abstract algebra, algorithms, and a little probability and statistics (through a few CS classes) at an undergraduate level.

I feel, however, that I don't know enough about statistics. Statistics are increasingly useful in computing, with statistical natural language processing helping fuel some of Google's algorithms for search and machine translation, with performance analysis of hardware, software, and networks needing proper statistical grounding to be at all believable, and with fields like bioinformatics becoming more prevalent every day.

I've read about how "Google uses Bayesian filtering the way Microsoft uses the if statement", and I know the power of even fairly naïve, simple statistical approaches to problems from Paul Graham's A Plan for Spam and Better Bayesian Filtering, but I'd like to go beyond that.

I've tried to look into learning more statistics, but I've gotten a bit lost. The Wikipedia article has a long list of related topics, but I'm not sure which I should look into. I feel like from what I've seen, a lot of statistics makes the assumption that everything is a combination of factors that linearly combine, plus some random noise in a Gaussian distribution; I'm wondering what I should learn beyond linear regression, or if I should spend the time to really understand that before I move on to other techniques. I've found a few long lists of books to look at; where should I start?

So I'm wondering where to go from here; what to learn, and where to learn it. In particular, I'd like to know:

  • What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?
  • What kind of statistical methods should I spend my time learning?
  • What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.
  • To clarify what I am looking for, I am interested in what problems that programmers typically need to deal with can benefit from a statistical approach, and what kind of statistical tools can be useful. For instance:

  • Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?
  • More generally, artificial intelligence has been moving away from discrete, symbolic approaches and towards statistical techniques. What statistical AI approaches have the most to offer now, to the working programmer (as opposed to ongoing research that may or may not provide concrete results)?
  • Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?
  • Simulation of physical systems, such as in computer graphics, frequently involves a stochastic approach.
  • Are there other problems commonly encountered by programmers that would benefit from a statistical approach?

  • Interesting question. As a statistician whose interest is more and more aligned with computer science perhaps I could provide a few thoughts...

  • Don't learn frequentist hypothesis testing. While the bulk of my work is done in this paradigm, it doesn't match the needs of business or data mining. Scientists generally have specific hypotheses in mind, and might wish to gauge the probability that, given their hypothesis isn't true, the data would be as extreme as it is. This is rarely the type of answer a computer scientist wants.

  • Bayesian is useful , even if you don't know why you are assuming the priors that you are using. A baysian analysis can give you a precise probability estimate for various contingencies, but it is important to realize that the only reason you have this precise estimate is because you made a fuzzy decision regarding the prior probability. (For those not in the know, with baysian inference, you can specify an arbitrary prior probability, and update this based on the data collected to get a better estimate).

  • Machine learning and classification might be a good place to get started. The machine learning literature is more focused on computer science problems, though it's mission is almost identical to that of statistics ( see: http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/ ).

    Since you spoke of large databases with large numbers of variables, here are a few algorithms that come in handy in this domain.

  • adaboost: If you have a large number of crappy classifiers, and want to make one good classifier. (see also logit boost)
  • Support Vector Machines: A powerful and flexible classifier. Can learn non-linear patterns (okay linear in the non-linear kernel space if you want to be picky about it).
  • k-nearest neighbor: A simple but powerful algorithm. It does not scale well, but there are approximate nearest neighbor alternatives that are not quite so pathological.
  • CART: This algorithm partitions the data based on a number of predictor variables. It is particularly good if there are variable interactions, or there exists a very good predictor that only works on a subset of the data.
  • Least angle regression: if the value that you are trying to predict is continuous and you have a lot of data and a lot of predictors.
  • This is by no means complete, but should give you a good jumping off point. A very good and accessible book on the subject is Duda, Hart, Stork: Pattern Classification

    Also, a big part of statistics is descriptive visualizations and analysis. These are of particular interest to the programmer because they allow him/her to convey information back to the user. In R, ggplot2 is my package of choice for creating visualizations. On the descriptive analysis side (and useful in text analysis) is multi-dimensional scaling, which can give a spacial interpretation of non-spacial data (for example the ideologies of senators http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1223908041).


    Just as a point, not as a critic, but your question should be formulated in a different way: "what statistics should any person know?".

    Fact is, unfortunately we all deal with statistics. It's a fact of life. Polls, weather forecast, drug effectiveness, insurances, and of course some parts of computer science. Being able to critically analyze the presented data gives the line between picking the right understanding or being scammed, whatever that means.

    Said that, I think the following points are important to understand

  • mean, median, standard deviation of a sample, and the difference between sample and population (this is very important)
  • the distributions, and why the gaussian distribution is so important (the central limit theorem)
  • What it is meant with Null Hypothesis testing.
  • What is variable transformation, correlation, regression, multivariate analysis.
  • What is bayesian statistics.
  • Plotting methods.
  • All these points are critical not only to you as a computer scientist, but also as a human being. I will give you some examples.

  • The evaluation of the null hypothesis is critical for testing of the effectiveness of a method. For example, if a drug works, or if a fix to your hardware had a concrete result or it's just a matter of chance. Say you want to improve the speed of a machine, and change the hard drive. Does this change matters? you could do sampling of performance with the old and new hard disk, and check for differences. Even if you find that the average with the new disk is lower, that does not mean the hard disk has an effect at all. Here enters Null hypothesis testing, and it will give you a confidence interval, not a definitive answer, like : there's a 90 % probability that changing the hard drive has a concrete effect on the performance of your machine.

  • Correlation is important to find out if two entities "change alike". As the internet mantra "correlation is not causation" teaches, it should be taken with care. The fact that two random variables show correlation does not mean that one causes the other, nor that they are related by a third variable (which you are not measuring). They could just behave in the same way. Look for pirates and global warming to understand the point. A correlation reports a possible signal, it does not report a finding.

  • Bayesian. We all know the spam filter. but there's more. Suppose you go to a medical checkup and the result tells you have cancer (I seriously hope not, but it's to illustrate a point). Fact is: most of the people at this point would think "I have cancer". That's not true. A positive testing for cancer moves your probability of having cancer from the baseline for the population (say, 8 per thousands people have cancer, picked out of thin air number) to a higher value, which is not 100 %. How high is this number depends on the accuracy of the test. If the test is lousy, you could just be a false positive. The more accurate the method, the higher is the skew, but still not 100 %. Of course, if multiple independent tests all confirm that you have cancer, then it's very probable you actually have it, but still it's not 100 %. maybe it's 99.999 %. This is a point many people don't understand about bayesian statistics.

  • Plotting methods. That's another thing that is always left unattended. Analysis of data does not mean anything if you cannot convey effectively what they mean via a simple plot. Depending on what information you want to put into focus, or the kind of data you have, you will prefer a xy plot, a histogram, a violin plot, or a pie chart.

  • Now, let's go to your questions. I think I overindulged in just a quick note, but since my answer was voted up quite a lot, I feel it's better if I answer properly to your questions as much as my knowledge allows (and here is vacation, so I can indulge as much as I want over it)

    What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?

    Normally, everything that has to do with data comparison which involves numerical (or reduced to numerical) input from unreliable sources. A signal from an instrument, a bunch of pages and the number of words they contain. When you get these data, and have to find a distilled answer out of the bunch, then you need statistics. Think for example to the algorithm to perform click detection on the iphone. You are using a trembling, fat stylus to refer to an icon which is much smaller than the stylus itself. Clearly, the hardware (capacitive screen) will send you a bunch of data about the finger, plus a bunch of data about random noise (air? don't know how it works). The driver must make sense out of this mess and give you ax,y coordinate on the screen. That needs (a lot of) statistics.

    What kind of statistical methods should I spend my time learning?

    The ones I told you are more than enough, also because to understand them, you have to walk through other stuff.

    What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

    I learned statistics mostly from standard university courses. My first book was the "train wreck book", and it's very good. I also tried this one, which focuses on R but it did not satisfy me particularly. You have to know things and R to get through it.

    Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?

    That depends on the question you need to answer using your dataset.

    Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?

    There are a lot of issues with measuring. Measuring is a fine and delicate art. Proper measuring is almost beyond human. The fact is that sampling introduces bias, either from the sampler, or from the method, or from the nature of the sample, or from the nature of nature. A good sampler knows these things and tries to reduce unwanted bias as much into a random distribution.

    The examples from the blog you posted are relevant. Say you have a startup time for a database. If you take performance measures within that time, all your measures will be biased. There's no statistical method that can tell you this. Only your knowledge of the system can.

    Are there other problems commonly encountered by programmers that would benefit from a statistical approach?

    Every time you have an ensemble of data producers, you have statistics, so scientific computing and data analysis is obviously one place. Folksonomy and social networking is pretty much all statistics. Even stackoverflow is, in some sense, statistical. The fact that an answer is highly voted does not mean that it's the right one. It means that there's a high probability that is right, according to the evaluation of a statistical ensemble of independent evaluators. How these evaluators behave make the difference between stackoverflow, reddit and digg.


    I have not much to add, but it happens that I just started to read this book: DS Sivia with J. Skilling, 'Data Analysis—a Bayesian tutorial', 2nd Edition, 2006, Oxford University Press.

    What caught my attention is the preface, where the author refers to a common dissatisfaction to those who approach the study of statistics:

    Preface

    As an undergraduate, I always found the subject of statistics to be rather mysterious. This topic wasn't entirely new to me, as we had been taught a little bit about probability earlier at high school; for example, I was already familiar with the binomial, Poisson and normal distributions. Most of this made sense, but only seemed to relate to things like rolling dice, flipping coins, shuffling cards and so on. However, having aspirations of becoming a scientist, what I really wanted to know was how to analyse experimental data. Thus, I eagerly looked forward to the lectures on statistics . Sadly, they were a great disappointment . Although many of the tests and procedures expounded were intuitively reasonable, there was something deeply unsatisfactory about the whole affair: there didn't seem to be any underlying basic principles! Hence, the course on 'probability and statistics' had led to an unfortunate dichotomy: probability made sense, but was just a game; statistics was important, but it was a bewildering collection of tests with little obvious rhyme or reason . While not happy with this situation, I decided to put aside the subject and concentrate on real science. After all, the predicament was just a reflection of my own inadequacies and I'd just have to work at it when the time came to really analyse my data.

    The story above is not just my own, but is the all too common experience of many scientists. Fortunately, it doesn't have to be like this . What we were not told in our undergraduate lectures is that there is an alternative approach to the whole subject of data analysis which uses only probability theory. In one sense, it makes the topic of statistics entirely superfluous. In another, it provides the logical justification for many of the prevalent statistical tests and procedures, making explicit the conditions and approximations implicitly assumed in their use .

    This book is intended to be a tutorial guide to this alternative Bayesian approach, including modern developments such as maximum entropy.

    ...

    I hope this book will maintain its promises.

    There are a couple of preview chapters from the first edition here, from a course in Cognitive Psychology/AI where this book was adopted, and other materials from the same course here. Related software by second author here. Also a more extended preview from Google Books here.

    链接地址: http://www.djcxy.com/p/24924.html

    上一篇: 掌握可视化/分析大型数据集的基本技能

    下一篇: 程序员(或计算机科学家)应该知道什么统计数据?