I tried to estimate the extent to which different countries contribute to the foundations of scientific Python.

To do this I analyzed commits and Github data for the following set of libraries:

  • Numpy
  • Scipy
  • Matplotlib
  • Scikit-learn
  • Scikit-image
  • Statsmodels
  • Pandas
  • h5py
  • Cython
  • Sympy

I chose these libraries because they are common building blocks for more specialized libraries and workflows.

I made some effort to attribute library commits to authors, and then authors to countries, to give commits per country, across all these libraries.

I then divided the number of commits by the population in the country, in millions, to give a measure of the extent to which a country is pulling its weight, relative to its resources.

Here is the table of the top 10 countries, by number of commits, ordered by their commits / million in the population.

Country Commits Population (millions) Commits/million
Finland 4607 5.5 831.6
Canada 12533 37.0 339.2
Switzerland 2676 8.5 313.3
United States of America 80336 326.8 245.8
France 10755 65.2 164.9
Germany 13377 82.3 162.6
United Kingdom 5523 66.6 83.0
Japan 3742 127.2 29.4
Russian Federation 3572 144.0 24.8
India 5214 1354.0 3.9

Method

You will find the code to generate data in this post at https://github.com/matthew-brett/github-places.

The analysis steps were:

  • Clone all library repositories as submodules of the github-places analysis repository, above.
  • Use git shortlog to get names, number of commits and other details of significant contributors to each repository. I defined a significant contributor (SC) as an author of 25 commits or more. As you will see below, SC commits account for about 90% of all repository commits.
  • Use various heuristics to identify the Github user corresponding to each SC (see find_gh_users.py in the analysis repository). When the heuristics failed, I found the Github user manually. I had to do this for about 10% of SCs (see below).
  • Using the SC Github user profile, or other research, identify the country in which the SC is currently based. I looked for the current location because this is the only information I have from their Github profiles. I had to identify the user’s country by doing web research for about 35% of SC Github users (see below).

The results of these steps come from the output of repo_analysis.py in the analysis repository.

Repo analysis results

I could not find the location for 1.4% of SCs, and therefore 0.2% of SC contributor commits.

For 10.2% of SCs, I had to work out the corresponding Github user manually.

I had to identify the locations of 34.8% of SC Github users manually by various searches.

Percentage of total repository commits included in the analysis, by repository:

  • numpy: 88.2
  • scipy: 87.7
  • matplotlib: 89.4
  • scikit-learn: 85.1
  • scikit-image: 89.3
  • statsmodels: 94.4
  • pandas: 81.0
  • h5py: 83.1
  • cython: 93.0
  • sympy: 90.8

Overall percentage 88.5

Commits by country analysis

The table at the top of this post is the output from commit_analysis.py in the analysis repository.

Share on: TwitterFacebookEmail



Published

Category

teaching

Tags

Atom feed