How do the foundations get built?

This post follows from Who is building the foundations?.

Despite the title, that last post was about where the foundations get built.

Here I am interested the process by which developers start working on these foundation projects.

Software expert build or practitioner build?

I can think of two simple models:

Software expert build - or “Here, try this thing I built.” The developer has an idea for a library that they believe will be useful for people doing data analysis. They build it, and then go to find users.
Practitioner build - or “I need this thing, so I’m going to build it.” The developer has some data they need to analyze, or other work they need to do, or do more efficiently. They build the library to do that work.

For example, a entirely new language is likely to be in the Software Expert Build category, because a working scientist or data analyst is unlikely to have the time or expertise to build a language. I presume that domain-specific libraries, like Astropy, will usually come from the “Practitioner Build” group, where development gives more immediate benefit for work.

To put the distinction another way - what matters in open-source development: having a hard data analysis problem; or being expert in software development; or do we need both (to get started)?

Some references that came to mind

The “Practitioner Build” model follows Eric Raymond’s first lesson from The Cathedral and the Bazaar:

Every good work of software starts by scratching a developer’s personal itch.

Our question is - what happens if the person with the itch is not (yet) a developer. Does that person need a developer to come and help? Or do they teach themselves, and become a developer?

There is some discussion of this problem in the paper Understanding the High-Performance Computing Community: A Software Engineer’s Perspective. The paper describes the results of interaction between software engineers and scientists using large clusters to simulate “physical phenomena such as earthquakes, global climate change or nuclear reactions”. The message I took from the paper was that Software Experts often suggest solutions that Practitioners will not use, often with good reason:

Our aim in this paper is to distill our experience about how software engineers can productively engage the HPC community. Several software engineering practices generally considered good ideas in other development environments are quite mismatched to the needs of the HPC community. We found that keys to successful interactions include a healthy sense of humility on the part of software engineering researchers and the avoidance of assumptions that software engineering expertise applies equally in all contexts.

Why it matters

The distinction between Software Expert Build and Practitioner Build is important, because the models have different implications for funding and support. If the most fruitful development comes from Software Expert Build, then we can follow a traditional path by finding and funding experts with good ideas. If Practitioner Build has been more effective, we will need more imagination, because the practitioner builders will be harder to find, before most of the work is done.

Relevant data from scientific Python libraries

This post is to collate some data on the distinction between Software Expert Build and Practitioner Build, for the three most fundamental libraries in scientific Python: Numpy; Scipy and Matplotlib. Numpy is the base layer that implements the array object in Python, and seems close to a language on the spectrum above. Scipy is a collection of libraries that build on Numpy, and is therefore closer to the domain libraries. Matplotlib is the standard 2D plotting library for Python.

I have also done some more superficial analysis of another three important projects in the scientific Python world: IPython / Jupyter (interactive prompt / Notebook interface); Pandas (base library for data science); and Statsmodels (statistical models).

I have two sources of data to look at the Software Expert / Practitioner build distinction. The first are brief early histories of each project, from the main authors. Was the project driven by a good idea, in search of work to do, or was it driven by work that needed doing? I have histories for all the projects above.

The second source of data is the educational background of the contributors. In the case of Software Expert Build, I expect contributors to have training in computer science in general, and algorithm design / software engineering in particular. I have these data for Numpy, Scipy and Matplotlib. I didn’t have time to research the backgrounds of the developers for the other libraries.

Abbreviations

These are abbreviations I will use for the tables below:

Eng.	Engineering
Sci.	Science
Sys.	Systems
EE	Electrical Engineering
EECS	Electrical Engineering and Computer Science
CS	Computer Science
C Sys. Eng.	Computer and Systems Engineering
Comp. Biol.	Computational Biology

Numpy

Jim Hugunin was a major author of Numeric, the package that later became Numpy. He summarizes the origins of Numeric in the story of Jython:

The story of Jython begins with the pain of finishing my master’s thesis at MIT. In that thesis I fabricated, measured, and analyzed superconductor-semiconductor junctions as a potential building block for a quantum computer. For analyzing the measurements and comparing them with theory (the Bogoliubov-deGennes equations) I used matlab extensively. Matlab is a wonderful language for a wide range numerical analyses; however, it is a terrible language in which to do anything else. In order to overcome its shortcomings, I eventually cobbled together a hodge-podge of C, Python and matlab code to produce my final results.

I knew there had to be a better way of doing this. After finishing my thesis, I started to work on an extension to Python to support numeric analysis as naturally as matlab does, without sacrificing any of the power of Python as a rich general-purpose programming language. This was the first project where I discovered the power of a collaborative open source community. The contributions of Jim Fulton, David Ascher, Paul DuBois, Konrad Hinsen, and many others made that project much more successful than it could ever have been as an isolated endeavor. …

Elsewhere Hugunin notes that Numeric “is based on Jim Fulton’s original implementation of a matrix object as a container class.” He links to the then-current list of Python extension modules that contains a description of Fulton’s “Fortran Matrix extension”.

These are the educational backgrounds of the Numpy founding fathers that Hugunin names above:

	Bachelor’s	Master’s	PhD
Jim Hugunin ¹	EE	EECS
Jim Fulton ²	Civil Eng.	Sys. Eng.,
		Software Sys.
		Eng.
David Ascher ³	Physics		Cognitive Sci.
Paul DuBois ⁴	?		Math
Konrad Hinsen ⁵	?		Physics

Although Hugunin was studying Electrical Engineering and Computer Science when he was writing Numeric, his description above gives the impression of an engineer building a machine that happened to be a computer, where the analysis problems arose from the task.

Fulton ⁶ has a background in civil and systems engineering. At the time Numeric was starting, he was working at the US Geological Survey. He records his roles as “Data analysis, hydrologic modeling, and development and support of data analysis tools.”, so this qualifies him as a practitioner. He was also studying for an MS in software systems engineering, making him a rare case of a trained practitioner and formally trained software engineer.

Here are the educational backgrounds for the top 10 authors of Numpy, by number of commits, from Git commit 4b6b29afc:

	Commits	Bachelor’s	Master’s	PhD
Charles Harris ⁷	4169	Physics		Math
Travis Oliphant ⁸	2065	EECS, Math	EECS	Biomedical Eng.
David Cournapeau ⁹	1525		EE	CS
Pearu Peterson ¹⁰	1061		Physics	Physics
Eric Wieser ¹¹	935		Eng.
Pauli Virtanen ¹²	762		Physics	Physics
Mark Wiebe ¹³	759	CS / Math
Julian Taylor ¹⁴	745		Physics
Ralf Gommers ¹⁵	619	Physics		Physics
Matti Picus ¹⁶	516			Agricultural Eng.

From the table above, contributors with specific education in computer science are Travis Oliphant (bachelor’s and master’s degrees are in Electrical and Computer Engineering), David Cournapeau (PhD in Informatics), and Mark Wiebe (bachelor’s in computer science and math).

Oliphant’s master’s thesis was applied work on “New Methods for Scatterometry based Wind Estimation”. Cournapeau’s thesis was also applied work on speech processing and machine learning. I think both are examples of practitioners who became developers in response to the needs of their analysis.

Mark Wiebe had a lot of experience in software engineering before he began to work on Numpy (see ¹⁷). His is an interesting case, in that he made substantial contributions to Numpy, including the widely used einsum function, but he is also the only contributor that I can remember who had a large amount of their code pulled out of the main Numpy repository because the community could not agree on the changes.

Scipy

From the “Origins of Numpy” chapter in Guide to Numpy by Travis Oliphant:

In 1998, as a graduate student studying biomedical imaging at the Mayo Clinic in Rochester, MN, I came across Python and its numerical extension (Numeric) while I was looking for ways to analyze large data sets for Magnetic Resonance Imaging and Ultrasound using a high-level language. …

As I progressed with my thesis work, programming in Python was so enjoyable that I felt inhibited when I worked with other programming frameworks. As a result, when a task I needed to perform was not available in the core language, or in the Numeric extension, I looked around and found C or Fortran code that performed the needed task, wrapped it into Python (either by hand or using SWIG), and used the new functionality in my programs.

…

By operating in this need-it-make-it fashion I ended up with a substantial library of extension modules that helped Python + Numeric become easier to use in a scientific setting. …

When I finished my Ph.D. in 2001, Eric Jones (who had recently completed his Ph.D. at Duke) contacted me because he had a collection of Python modules he had developed as part of his thesis work as well. He wanted to combine his modules with mine into one super package. Together with Pearu Peterson we joined our efforts, and SciPy was born in 2001.

We already know the educational background of Oliphant and Peterson from the Numpy table above. Eric Jones has a BSE in Mechanical Engineering and a PhD in Electrical Engineering.

Here are the top 10 contributors to Scipy (as of 4d9467958) by number of commits:

	Commits	Bachelor’s	Master’s	PhD
Pauli Virtanen ¹⁸	2886		Physics	Physics
Ralf Gommers ¹⁹	2478	Physics		Physics
Evgeni Burovski ²⁰	1077		Physics, Math	Physics
Travis Oliphant ²¹	977	EECS	EECS	Biomedical Eng.
David Cournapeau ²²	820		EE	CS
Warren Weckesser ²³	756	C Sys. Eng.	C Sys. Eng.	Math
Alex Griffing ²⁴	604	Math		Bioinformatics
Pearu Peterson ²⁵	504		Physics	Physics
Nathan Bell ²⁶	375	CS, Math		CS
“Endolith” ²⁷	362	Sci., EE

Scipy top contributors with computer science training are Oliphant, Cournapeau, Weckesser and Bell. I have already discussed Oliphant and Cournapeau above; I think both are best described as practitioners who became developers. Weckesser has a Bachelor’s and a Master’s in Computer and System Engineering; his publication list ²⁸ suggests he has a longstanding interest in numerical software. Bell’s Google Scholar page shows he has many papers at the interface of software and numerical methods. I think both Weckesser and Bell can reasonably be classified as Software Experts who are also Practitioners (of numerical methods).

Matplotlib

From history of Matplotlib by John Hunter:

For years, I used to use MATLAB exclusively for data analysis and visualization. MATLAB excels at making nice looking plots easy. When I began working with EEG data, I found that I needed to write applications to interact with my data, and developed an EEG analysis application in MATLAB. As the application grew in complexity, interacting with databases, http servers, manipulating complex data structures, I began to strain against the limitations of MATLAB as a programming language, and decided to start over in Python. Python more than makes up for all of MATLAB’s deficiencies as a programming language, but I was having difficulty finding a 2D plotting package …

Finding no package that suited me just right, I did what any self-respecting Python programmer would do: rolled up my sleeves and dived in. …

John Hunter is also one of the top 10 contributors to Matplotlib by commits, as of commit a58f1c613. Here is the full list:

	Commits	Bachelor’s	Master’s	PhD
Thomas Caswell ²⁹	4309	Physics, Math	Physics	Physics
Michael Droettboom ³⁰	3918	CS, Music	Music
John Hunter ³¹	2145	Politics		Neurobiology
Eric Firing ³²	1810	Earth Sci.		Oceanography
Antony Lee ³³	1680		Eng.	Physics
Jody Klymak ³⁴	1073	Math, Physics	Oceaongraphy	Oceanography
David Stansby ³⁵	1066		Natural Sci.
Elliott Sales de Andrade ³⁶	960	Eng.	Physics
Jens Hedegaard Nielsen ³⁷	856	Physics	Physics	Physics
Nelle Varoquaux ³⁸	806	CS	Math	Comp. Biol.

Not many Matplotlib contributors have a computer science background.

Droettboom ³⁹ has a double major in music and computer science, and a masters in music / computer music research. It is difficult to tell if he had specific interest in software engineering or algorithms.

Varoquaux ⁴⁰ has a bachelor’s in CS, but has since worked in math and science.

Pandas

Pandas is the fundamental library for most data science tasks.

I have not analyzed the top contributors. Here is a short early history.

Wes McKinney created Pandas:

I studied theoretical mathematics at MIT (graduating in late 2006) …

From August 2007 to July 2010, I worked on the front office quant research team at AQR Capital Management … I started building pandas on April 6, 2008, as part of a skunkworks effort to reproduce some econometric research in Python. (from McKinney’s blog bio)

He gives more detail in a podcast

I have a mathematics background and I didn’t do a lot of programming in the past. …. I really didn’t start programming until much later in life because I realized at some point that I needed to program in order to be useful. In my first job in AQR capital management … I was struck by how difficult it was to go about very basic data analysis problems. … it felt that there was a disconnect between ability of your mind and your brain to think about what you want do with the data and the actual tools to do what you want. … I got really interested in building tools for myself to be more productive.

IPython / Jupyter

Fernando Pérez created IPython in 2001. IPython gradually developed into Jupyter, the widely used interactive notebook.

From the website above, Fernando gives his biography as:

I was born in Medellín, Colombia, where I did my undergraduate studies in Physics at the Universidad de Antioquia. From there I moved to Boulder, Colorado, where I completed my PhD in particle physics, before spending a few years as a post-doc in Applied Mathematics. In 2008 I moved to the San Francisco Bay Area, where I now work at UC Berkeley, on research involving algorithms and tool development in neuroimaging.

He is now a professor of statistics at UC Berkeley.

Fernando wrote a history of IPython and the Notebook. The quotes in this section are from that blog post, unless I list another source.

Let’s go back to the beginning: when I started IPython in late 2001, I was a graduate student in physics at CU Boulder. … … the natural flow of scientific computing pretty much mandates a solid interactive environment, so while other Python users and developers may like having occasional access to interactive facilities, scientists more or less demand them. … We continued slowly working on IPython, … By that point Brian Granger and Min Ragan-Kelley had come on board.

Ragan-Kelley has a “PhD in Applied Science & Technology, with an emphasis in computational plasma physics”. Granger has a bachelor’s in Engineering Physics, and PhD in Physics. At the time he joined IPython, in 2004, he was an assistant professor in Physics.

This is where early 2010 found us … I read an article about [the messaging library] ZeroMQ … It became clear that we had the right tools to build a two-process implementation of IPython that could give us the ‘real IPython’ but communicating with a different frontend, and this is precisely what we wanted for cleaner parallel computing, multiprocess clients and a notebook. … In October 2010 James Gao (a Berkeley neuroscience graduate student) wrote up a quick prototype of a web notebook, demonstrating again that this design really worked well and could be easily used by a completely different client. … And finally, in the summer of 2011 Brian took James’ prototype and built up a fully working system.

A 2012 grant application describes the process by which Pérez, Granger and Kelley trained themselves as developers:

While we were all trained as physicists without any software engineering education, in our interaction with the world of open source developers we have learned and adopted rigorous software engineering practices that we follow to ensure IPython remains a robust and high-quality project even as it grows.

Statsmodels

These are notes for the early history of the Statmodels project.

Skipper Seabold ⁴¹ started Statsmodels proper, as a Google Summer of Code project, in collaboration with Josef Perktold ⁴².

Seabold was interviewed in a recent podcast:

My background, my major up until my last year of undergrad was poetry writing. So I could not have been further from math and data science and programming. I switched over to studying economics. Ended up going to grad school for econ. Really liked econometrics, anyway, we got into our math econ kind of boot camp and first math econ class and it was taught in Python. … It really helped solidify the stuff I was working on. I felt like I didn’t really understand it until I could code it. And once I could code it I was like, “Oh sure I totally understand this and now can program and speak confidently about it I guess.” … I jumped in, and the thing that was really helpful was there was a problem out there that I needed to solve. I needed to be able to run an OLS regression, get the results out of the regression in a way that was easily understandable, build a user interface. I didn’t know the first thing about programming, but I had some good and patient mentors and kind of co-developers along the way. I learned a ton from the open source community. Just by kind of, honestly, failing in public, flailing in public, making pull requests and having people take the time out of their day to spend hours reading your code and suggest how to be a better programmer.

From the Statsmodels Scipy conference paper by Seabold and Perktold:

… statsmodels was started by Jonathan Taylor, a statistician now at Stanford, as part of SciPy under the name models. Eventually, models was removed from SciPy and became part of the NIPY neuroimaging project in order to mature. Improving the models code was later accepted as a SciPy-focused project for the Google Summer of Code 2009 and again in 2010. It is currently distributed as a SciKit, or add-on package for SciPy. The current main developers of statsmodels are trained as economists with a background in econometrics.

Taylor wrote the original code to implement analysis of brain imaging data.

Perktold ⁴³ is an economist. He has a PhD in economics.

Conclusion

In very brief summary, the histories of the various projects here are striking in the clarity with which they describe the process where practitioners start with an analysis problem, then build the solution. I would say they give very strong evidence for the Practitioner Build model.

The analysis of contributor backgrounds is a little more ambiguous. There are some major contributors, particularly in Scipy, that do have a strong prior background in software engineering. The large majority of top contributors, and project founders, did not have this background. A background in physical sciences is common, but there are many contributors with a math background.

Overall, these data give strong preference for the Practitioner Builder model. Unless something has changed, we have a difficult problem, in trying to support the practitioners who will build the next generation of libraries. If these histories are still typical, by the time we find them, most of the work will have been done.

Hugunin’s LinkedIn page ↩︎
Fulton’s LinkedIn page ↩︎
Ascher’s LinkedIn page ↩︎
Dubois’ web pages ↩︎
Hinsen’s Orchid page ↩︎
Fulton’s LinkedIn page ↩︎
Harris interview ↩︎
Oliphant’s LinkedIn page ↩︎
Cournapeau’s Wikipedia page ↩︎
Peterson’s web page and Peterson’s LinkedIn page ↩︎
Wieser’s LinkedIn page ↩︎
Virtanen’s LinkedIn page ↩︎
Wiebe’s LinkedIn page; it’s not clear whether he has a post-graduate degree. Bachelor majors listed in UManitoba page ↩︎
Reply to personal email.↩︎
Gommers’ LinkedIn page ↩︎
From Pincus’ Youtube presentation ↩︎
Wiebe’s LinkedIn page; it’s not clear whether he has a post-graduate degree. Bachelor majors listed in UManitoba page ↩︎
Virtanen’s LinkedIn page ↩︎
Gommers’ LinkedIn page ↩︎
CV via Burovski’s webpage ↩︎
Oliphant’s LinkedIn page ↩︎
Oliphant’s LinkedIn page ↩︎
Archive of Weckesser’s webpage ↩︎
See biography in Griffing’s 2012 thesis ↩︎
Peterson’s web page and Peterson’s LinkedIn page ↩︎
Bell’s LinkedIn page ↩︎
Endolith’s Codementor page; he has a BSEE degree.↩︎
Archive of Weckesser’s webpage ↩︎
Caswell’s LinkedIn page ↩︎
Droettboom’s LinkedIn page ↩︎
Personal email in 2005↩︎
Firing’s work web page ↩︎
Lee’s ORCID page ↩︎
Klymak’s work pages CV ↩︎
Stansby’s ORCID page ↩︎
Sales de Andrade’s ORCID page ↩︎
Nielsen’s LinkedIn page ↩︎
Varoquaux’ LinkedIn page and BIDS biosketch.↩︎
Droettboom’s LinkedIn page ↩︎
Varoquaux’ LinkedIn page and BIDS biosketch.↩︎
Seabold’s LinkedIn page ↩︎
Perktolds’s LinkedIn page ↩︎
Perktolds’s LinkedIn page ↩︎

Matthew Brett