Trends in UniTartuCS theses

The Institute of Computer Science at the University of Tartu (UniTartuCS for short) has a (new) register for bachelor’s and master’s theses. Out of curiosity, I have done some data analysis on these theses and in this post I will present some results.

As of May 24, 2025,1 the register contained 2563 theses in total. I have excluded some from the following analysis:

  • 8 theses from before 2010, because it seems like the data for those years is incomplete.

This leaves 2555 theses for the analysis. Note that the data for 2025 is not yet final.

Word processors used

The institute provides thesis templates for Microsoft Word and LaTeX. I wanted to find out how much each of them is used by the students.

This is complicated by the fact that theses are submitted as PDFs. Luckily, PDF file metadata contains two fields which give a lot of insight: PDF creator and PDF producer. Although, the content of these fields is not standardized, it’s not as messy as web browser User-Agent headers (yet). Working with the data, I reached the following classification:

  • Microsoft Word if the PDF creator matches Microsoft®? (Office )?Word|Acrobat PDFMaker .* for Word.
  • TeX if the PDF creator contains TeX.
  • Google Docs if the PDF producer matches Google Docs|Skia/PDF.
  • LibreOffice if the PDF producer matches (Libre|Open)Office.
  • Quartz if the PDF creator contains Quartz PDFContext. These are somehow created by MacOS, but it’s unclear to me how.
  • Print if the PDF producer matches Microsoft: Print To PDF|Foxit Reader (PDF )?Printer|PDF Printer. These are various PDF printers.
  • Unknown otherwise.

Overall

First, let’s look at the overall word processor breakdown:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-05-24-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year >= 2010 && datum.defence_year <= 2025"
    },
    {
      "aggregate": [{
        "op": "count",
        "as": "count"
      }],
      "groupby": ["pdf_classification"]
    },
    {
      "joinaggregate": [{
        "op": "sum",
        "field": "count",
        "as": "total_count"
      }],
      "groupby": []
    },
    {
      "calculate": "datum.count / datum.total_count",
      "as": "fraction"
    },
    {
      "lookup": "pdf_classification",
      "from": {
        "data": {
          "values": [
            {"pdf_classification": "Microsoft Word", "classification_order": 0},
            {"pdf_classification": "LibreOffice", "classification_order": 1},
            {"pdf_classification": "Google Docs", "classification_order": 2},
            {"pdf_classification": "Quartz", "classification_order": 3},
            {"pdf_classification": "Print", "classification_order": 4},
            {"pdf_classification": "Unknown", "classification_order": 5},
            {"pdf_classification": "TeX", "classification_order": 6}
          ]
        },
        "key": "pdf_classification",
        "fields": ["classification_order"]
      }
    }
  ],
  "width": "400",
  "mark": "arc",
  "encoding": {
    "theta": {
      "field": "count",
      "type": "quantitative",
      "aggregate": "sum",
      "stack": "normalize",
      "title": "Theses"
    },
    "color": {
      "field": "pdf_classification",
      "title": "PDF creator"
    },
    "order": {
      "field": "classification_order"
    },
    "tooltip": [
      {
        "field": "fraction",
        "format": ".0%",
        "title": " Theses"
      },
      {
        "field": "count",
        "title": "Theses"
      }
    ]
  }
}

This shows that Microsoft Word is used slightly more than LaTeX. Notably, Word-like WYSIWYG editors make up the majority.

Now, let’s dig a little deeper to see how the breakdown depends on the year and the curriculum.

By year

Second, let’s look at the word processor breakdown across the years:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-05-24-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year >= 2010 && datum.defence_year <= 2025"
    },
    {
      "aggregate": [{
        "op": "count",
        "as": "count"
      }],
      "groupby": ["pdf_classification", "defence_year"]
    },
    {
      "joinaggregate": [{
        "op": "sum",
        "field": "count",
        "as": "year_count"
      }],
      "groupby": ["defence_year"]
    },
    {
      "calculate": "datum.count / datum.year_count",
      "as": "year_fraction"
    },
    {
      "lookup": "pdf_classification",
      "from": {
        "data": {
          "values": [
            {"pdf_classification": "Microsoft Word", "classification_order": 0},
            {"pdf_classification": "LibreOffice", "classification_order": 1},
            {"pdf_classification": "Google Docs", "classification_order": 2},
            {"pdf_classification": "Quartz", "classification_order": 3},
            {"pdf_classification": "Print", "classification_order": 4},
            {"pdf_classification": "Unknown", "classification_order": 5},
            {"pdf_classification": "TeX", "classification_order": 6}
          ]
        },
        "key": "pdf_classification",
        "fields": ["classification_order"]
      }
    }
  ],
  "width": "725",
  "mark": "bar",
  "encoding": {
    "x": {
      "field": "defence_year",
      "title": "Year",
      "axis": {
        "labelAngle": 0
      }
    },
    "y": {
      "field": "count",
      "type": "quantitative",
      "stack": "normalize",
      "title": "Theses"
    },
    "color": {
      "field": "pdf_classification",
      "title": "PDF creator"
    },
    "order": {
      "field": "classification_order"
    },
    "tooltip": [
      {
        "field": "year_fraction",
        "format": ".0%",
        "title": "Theses (of year)"
      },
      {
        "field": "count",
        "title": "Theses"
      }
    ]
  }
}

This reveals two main trends:

  1. OpenOffice/LibreOffice usage has mostly diminished.
  2. Google Docs usage has become widespread.

The latter is worrying because Google Docs is (in my opinion) inadequate for typesetting a thesis. Having supervised and reviewed a number of theses (although relatively few compared to senior staff members), it’s often obvious that a thesis has been typeset in Google Docs based on poor and inconsistent formatting. Importing the Microsoft Word template into Google Docs is a lossy conversion because Docs has limited features and customizability, even compared to Word.

By curriculum

Third, let’s look at the relationship between word processor usage and curricula:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-05-24-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year >= 2010 && datum.defence_year <= 2025"
    },
    {
      "lookup": "curriculum",
      "from": {
        "data": {
          "values": [
            {"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
            {"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
            {"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
            {"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
            {"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
            {"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
            {"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
            {"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
            {"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
            {"curriculum": "other", "curriculum_name": "Other"}
          ]
        },
        "key": "curriculum",
        "fields": ["curriculum_name"]
      }
    }
  ],
  "width": {
    "step": "50"
  },
  "height": {
    "step": "50"
  },
  "mark": "rect",
  "encoding": {
    "x": {
      "field": "curriculum_name",
      "title": "Curriculum",
      "axis": {
        "labelAngle": -45
      },
      "sort": [
        "BSc - Computer Science",
        "MSc - Computer Science",
        "MSc - Software Engineering",
        "MSc - Cyber Security",
        "MSc - Data Science",
        "MSc - Data Science (exam)",
        "MSc - Conversion Master in IT",
        "MA - Innovation and Technology Management",
        "MA - Teacher of Mathematics and Informatics",
        "Other"
      ]
    },
    "y": {
      "field": "pdf_classification",
      "title": "PDF creator",
      "sort": [
        "TeX",
        "Unknown",
        "Print",
        "Quartz",
        "Google Docs",
        "LibreOffice",
        "Microsoft Word"
      ]
    },
    "color": {
      "aggregate": "count",
      "type": "quantitative",
      "scale": {
        "type": "log"
      },
      "title": "Theses"
    },
    "tooltip": {
      "aggregate": "count",
      "type": "quantitative"
    }
  }
}

Although the heatmap shows some trends, the logarithmic color scale2 makes exact comparisons difficult. Thus, let’s look at the word processor breakdown across different curricula in a different way:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-05-24-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year >= 2010 && datum.defence_year <= 2025"
    },
    {
      "aggregate": [{
        "op": "count",
        "as": "count"
      }],
      "groupby": ["pdf_classification", "curriculum"]
    },
    {
      "joinaggregate": [{
        "op": "sum",
        "field": "count",
        "as": "curriculum_count"
      }],
      "groupby": ["curriculum"]
    },
    {
      "calculate": "datum.count / datum.curriculum_count",
      "as": "curriculum_fraction"
    },
    {
      "lookup": "pdf_classification",
      "from": {
        "data": {
          "values": [
            {"pdf_classification": "Microsoft Word", "classification_order": 0},
            {"pdf_classification": "LibreOffice", "classification_order": 1},
            {"pdf_classification": "Google Docs", "classification_order": 2},
            {"pdf_classification": "Quartz", "classification_order": 3},
            {"pdf_classification": "Print", "classification_order": 4},
            {"pdf_classification": "Unknown", "classification_order": 5},
            {"pdf_classification": "TeX", "classification_order": 6}
          ]
        },
        "key": "pdf_classification",
        "fields": ["classification_order"]
      }
    },
    {
      "lookup": "curriculum",
      "from": {
        "data": {
          "values": [
            {"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
            {"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
            {"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
            {"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
            {"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
            {"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
            {"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
            {"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
            {"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
            {"curriculum": "other", "curriculum_name": "Other"}
          ]
        },
        "key": "curriculum",
        "fields": ["curriculum_name"]
      }
    }
  ],
  "width": "725",
  "mark": "bar",
  "encoding": {
    "x": {
      "field": "curriculum_name",
      "title": "Curriculum",
      "axis": {
        "labelAngle": -45
      },
      "sort": [
        "BSc - Computer Science",
        "MSc - Computer Science",
        "MSc - Software Engineering",
        "MSc - Cyber Security",
        "MSc - Data Science",
        "MSc - Data Science (exam)",
        "MSc - Conversion Master in IT",
        "MA - Innovation and Technology Management",
        "MA - Teacher of Mathematics and Informatics",
        "Other"
      ]
    },
    "y": {
      "field": "count",
      "type": "quantitative",
      "stack": "normalize",
      "title": "Theses"
    },
    "color": {
      "field": "pdf_classification",
      "title": "PDF creator"
    },
    "order": {
      "field": "classification_order"
    },
    "tooltip": [
      {
        "field": "curriculum_fraction",
        "format": ".0%",
        "title": "Theses (of curriculum)"
      },
      {
        "field": "count",
        "title": "Theses"
      }
    ]
  }
}

This reveals the following:

  1. Over 70% of “BSc - Computer Science” students use Word-like WYSIWYG editors and only 21% use LaTeX.
  2. LaTeX usage is most popular among “MSc - Computer Science” and “MSc - Data Science” students. This is probably motivated by the need to typeset more mathematics or do more data visualization.
  3. LaTeX usage is (almost) nonexistent among “MSc - Conversion Master in IT” and “MA - Teacher of Mathematics and Informatics” students. This is probably because students in those curricula are less technical.

Page count by curriculum

There’s another piece of PDF file metadata that can be analyzed: PDF page count. It only makes sense to consider curricula separately for this because the expected page counts (set by the guidelines) differ:

  • Bachelor’s theses should be ~20 pages (excluding appendices).
  • Master’s theses should be 40-50 pages (excluding appendices).

Since the PDF files also contain the appendices, the PDF page count can be expected to be higher.

From the following plots I’ve additionally excluded the following outliers:

  • A 373-page thesis, because it would screw with the scale of the plots.
  • All theses with under 11 pages, because these appear to be abstracts for theses with publishing restrictions and would skew the results.

Overall

First, let’s look at the thesis page count statistics by curricula:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-05-24-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year >= 2010 && datum.defence_year <= 2025"
    },
    {
      "filter": "datum.pdf_pages != 373 && datum.pdf_pages >= 11"
    },
    {
      "lookup": "curriculum",
      "from": {
        "data": {
          "values": [
            {"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
            {"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
            {"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
            {"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
            {"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
            {"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
            {"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
            {"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
            {"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
            {"curriculum": "other", "curriculum_name": "Other"}
          ]
        },
        "key": "curriculum",
        "fields": ["curriculum_name"]
      }
    }
  ],
  "width": "725",
  "height": "350",
  "mark": {
    "type": "boxplot",
    "size": 30,
    "ticks": true
  },
  "encoding": {
    "x": {
      "field": "curriculum_name",
      "title": "Curriculum",
      "axis": {
        "labelAngle": -45
      },
      "sort": [
        "BSc - Computer Science",
        "MSc - Computer Science",
        "MSc - Software Engineering",
        "MSc - Cyber Security",
        "MSc - Data Science",
        "MSc - Data Science (exam)",
        "MSc - Conversion Master in IT",
        "MA - Innovation and Technology Management",
        "MA - Teacher of Mathematics and Informatics",
        "Other"
      ]
    },
    "y": {
      "field": "pdf_pages",
      "type": "quantitative",
      "title": "Pages"
    },
    "tooltip": {
      "field": "pdf_pages",
      "type": "quantitative"
    }
  }
}

By year

Second, let’s look at the thesis page count average across the years, still by curricula: (click on a curriculum name in the legend for a more focused view)

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": "/assets/2025-05-24-unitartucs-theses-blog.csv",
    "format": {
      "type": "csv",
      "parse": {
        "pdf_pages": "number"
      }
    }
  },
  "transform": [
    {
      "filter": "datum.defence_year >= 2010 && datum.defence_year <= 2025"
    },
    {
      "filter": "datum.pdf_pages != 373 && datum.pdf_pages >= 11"
    },
    {
      "lookup": "curriculum",
      "from": {
        "data": {
          "values": [
            {"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
            {"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
            {"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
            {"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
            {"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
            {"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
            {"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
            {"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
            {"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
            {"curriculum": "other", "curriculum_name": "Other"}
          ]
        },
        "key": "curriculum",
        "fields": ["curriculum_name"]
      }
    }
  ],
  "width": "650",
  "height": "350",
  "mark": {
    "type": "line",
    "point": true
  },
  "params": [{
    "name": "curriculum_name",
    "select": {"type": "point", "fields": ["curriculum_name"]},
    "bind": "legend"
  }],
  "encoding": {
    "x": {
      "field": "defence_year",
      "title": "Year",
      "axis": {
        "labelAngle": 0
      }
    },
    "y": {
      "field": "pdf_pages",
      "type": "quantitative",
      "aggregate": "average",
      "title": "Pages (average)"
    },
    "color": {
      "field": "curriculum_name",
      "type": "nominal",
      "title": "Curriculum",
      "sort": [
        "BSc - Computer Science",
        "MSc - Computer Science",
        "MSc - Software Engineering",
        "MSc - Cyber Security",
        "MSc - Data Science",
        "MSc - Data Science (exam)",
        "MSc - Conversion Master in IT",
        "MA - Innovation and Technology Management",
        "MA - Teacher of Mathematics and Informatics",
        "Other"
      ]
    },
    "tooltip": {
      "field": "pdf_pages",
      "type": "quantitative",
      "aggregate": "average",
      "format": ".1f"
    },
    "opacity": {
      "condition": {"param": "curriculum_name", "value": 1},
      "value": 0.2
    }
  }
}

  1. The post has been updated with 2025 data: initially the data was as of March 16, 2025 and excluded theses from 2025. None of the findings have changed with the update. 

  2. The logarithmic scale is necessary because the thesis counts differ in orders of magnitude. A linear color scale would be dominated by a few most popular combinations.