A colleague pointed me to a job advert to teach at a new university in London: the London Interdisciplinary School.

I was excited to see this was happening, because a new university may well be the best way to make the change we need in science and education.

The fundamental change we need is to take computing seriously.

Taking computing seriously implies two things:

  • Code as the basis for data analysis
  • Disciplined working process as the basis for collaboration.

If you don’t like reading long blog posts, skip to the blueprint at the end; you may want to come back here for the background.

I have discussed these changes before in Research methods in the twenty-first century.

Code as the basis for data analysis

This is the true and fruitful meaning of the term data science (see What is data science?).

The insight has been most accurately captured by the data science teaching initiatives at Berkeley – see https://data.berkeley.edu/education. Berkeley runs a foundation course in data science, called data8. Quoting from that page:

The course is designed for entry-level students from any major. It is designed specifically for students who have not previously taken statistics or computer science courses.

The course teaches programming (in Python) as the fundamental tool for analyzing data. It spends a lot of time thinking about plotting, and about the nature of inference. Because it uses a programming language, it can and does replace all classical statistical tests, such as t-tests, with equivalents that use resampling. When the student knows some code, resampling is much easier to explain than the classical statistical tests (Cobb 2007); the students can write the procedures for themselves, and fully understand all the steps, including the inference at the end.

The Berkeley course does not cover “big data”, and that’s the right thing to do. “Big data” is something you learn about when you have a firm foundation in small data.

It does cover some machine learning and prediction, and does it very well, but at the end, as another set of techniques for using data to draw conclusions about the world. This is right too. No student should start using machine learning without a good basic understanding of data analysis, sampling and inference.

Berkeley uses this course as a foundation on which to build a whole new approach to teaching. They have many connector courses that use the same tools and techniques as the foundation course, but apply them to areas such as law, statistics, computer science, psychology and geography.

This is right too; we should teach with data. We should engage students at once with the problem of real data, and the difficulties of drawing conclusions. They should think about the ambiguities in data and the conclusions we draw. They should learn to check all conclusions by the instructor, or in a scientific paper, against the data. They should constantly ask “Is that true? Under what circumstances is it true? How sure am I, that it is true? How would I get more evidence to support or refute this conclusion”.

Disciplined working process as the basis for collaboration

Working in collaboration on data analysis and other scientific projects is very difficult, and we currently do that very badly. We need to teach our students to start with sound, efficient practice, and keep teaching them with this same practice, so it becomes natural for them to use it for all new problems.

Science as currently practiced, and taught, uses what I would call an artisan working process, or at least, that is the polite name for it. A less polite description might be sloppy, informal and inefficient. Our current model is that of the skilled craftsman, from before the industrial revolution. The craft is so specialized, variable and arcane, that the only way to learn is for the student to copy the techniques they see during their apprenticeship. In science, the apprenticeship is working as a graduate student or post-doc. Scientists mostly work alone. The large proportion of scientists who use a significant amount of computing, have disorganized, informal working practice, passing around scripts and Word documents via email or via shared drives. It is very hard to share work, because each researcher has their own habits of analysis, coding, and data organization. There is very little chance that the work will be reproducible, even by the same researcher, after a few weeks, and little chance that a mentor will be able to check the work for error.

This model is, of course, monstrously inefficient, confusing, and error prone. It makes collaboration very difficult. It is easy to underestimate the tendency of computers to confuse and distract.

Meanwhile, programmers in industry have found these same problems, and they learned quickly that they couldn’t do efficient work that way. When they tried, the result was bad software, poor collaboration and inefficient production. They lost money.

In response, programmers have been thinking very hard about how to work and collaborate efficiently. They found they needed:

  • code testing
  • version control
  • continuous testing, with tests run for each code change
  • code review
  • issue tracking
  • pair coding

Although these techniques evolved for writing commercial software, they apply with the same force of necessity to the process of writing scientific software. They also turn out to be revolutionary in improving the process of building data analyses in a team. To those of us who have experienced these techniques for doing scientific projects, returning to the old email and track changes process feels like a return to the dark ages, as if we had been forced to use a typewriter instead of a word processor. The new techniques are much more effective in increasing clarity, reducing error and improving discussion. I estimate they give a ten-fold increase in efficiency for projects involving more than one person, and a three-fold increase for someone working on their own.

Using these techniques is not advanced, it is where the students should start. We have taught these techniques to statistics and other undergraduates, in Berkeley (Millman et al. 2018). Our course started with the basics of Python programming, then introduced the techniques they would need for working together, such as version control, and testing. We gave them a big, difficult, open-ended group project to do. It had to be reproducible, so we, their graders, could take their project repository (hosted on Github), and follow their instructions, in a README file, to reproduce all of their analysis, and build their final report, including the figures. If you read the paper, I hope you will agree, that they did this with great success.

Blueprint for a modern university

I now imagine I can design a whole new set of courses, from the ground up, to give the students the best possible foundation for modern scientific work.

  • The students start with something like the Berkeley foundations of data science course. They learn basic programming, in Python. They learn to get data from real data sources and explore it with plots and tables. They learn what models are, what inference means, and how to make models and inferences with simulation and permutation. They learn the bootstrap for estimation, and they get an introduction to regression, and basic machine learning. They do a substantial open-ended project using these tools at the end. They will learn in easy-start tools like the Jupyter Notebook.
  • The students will specialize too. They specialize using the same teaching methods as the data science foundation. If they are learning about the epidemiology of malaria, they get epidemiological data and do basic analysis. They get clinical trial data and draw conclusions. They learn how to put data on maps, and visualize malaria incidence.
  • Their courses give them a deep foundation of understanding. For example, when comparing the effect of two treatments for malaria, they know how formulate the null model (there is no effect of the treatment), and how to implement this model (by pooling the two treatment groups, permuting, and selecting random samples as a proxy for the original treatment groups). They know what it means to think of something as random, and where the randomness appears in their model. They can use random numbers and permutation to design new models to test new hypotheses.
  • The deep foundation gives them deep skepticism. They know what it is to understand something in depth, and they know the dangers of superficial understanding, and following recipes. When they learn something new, they learn it in depth, as they have learned in their courses. They test their understanding, by building in code.
  • In the first or second year, they move on to techniques for collaboration. Instead of Jupyter Notebooks, they start using more efficient tools like text editors and the terminal. They learn version control, and pull requests, and code review. They learn to formalize problems in issues, and respond to problems with code changes, or by finding new data for analysis. They learn to criticize each others’ work constructively, and to teach new team members. They learn how to deal with standard problems in the team, such as overconfident team members, or members who are not doing their share of work. Now they are in a good position to start collaborations with industry and academia.
  • As the students gain experience, they gain independence. They learn to start quickly, using standard tools. They begin to contribute to open-source, to support their analysis, and for their education, in group work, and in coding.

All this is a big departure from our current model of teaching, but I don’t think any university will have a choice but to change in the coming decade. These ways of learning and working are so much more efficient than the techniques we are using now, that students who do not learn them, will be at a great disadvantage. But that is not the only issue; we need to prepare our students for a changing world. We are entering a new age, in which complex data analysis will have an increasing influence on our lives and our scientific understanding. If we want to teach students to be effective and skeptical, we can’t afford to use the old methods; we just don’t have time to bolt computing and algorithms on top of our current courses. We must start with computing and algorithms, to teach students to build and to understand. Then they will have the tools to engage with confidence and vigor with the beasts of the new age.

References

Cobb, George W. 2007. “The Introductory Statistics Course: A Ptolemaic Curriculum?” Technology Innovations in Statistics Education 1 (1). https://escholarship.org/uc/item/6hb3k0nz.
Millman, K. Jarrod, Matthew Brett, Ross Barnowski, and Jean-Baptiste Poline. 2018. Teaching Computational Reproducibility for Neuroimaging.” Frontiers in Neuroscience.

Share on: TwitterFacebookEmail



Published

Category

teaching

Atom feed