Trends in UniTartuCS theses
The Institute of Computer Science at the University of Tartu (UniTartuCS for short) has a (new) register for bachelor’s and master’s theses. Out of curiosity, I have done some data analysis on these theses and in this post I will present some results.
As of March 16, 2025, the register contained 2292 theses in total. I have excluded some from the following analysis:
- 14 theses from 2025, because the main defence season is still ahead.
- 8 theses from before 2010, because it seems like the data for those years is incomplete.
- 8 theses for the “MSc - Data Science (exam)” curriculum, because they aren’t really theses, but 4-page abstracts for capstone projects.
This leaves 2262 theses for the analysis.
Word processors used
The institute provides thesis templates for Microsoft Word and LaTeX. I wanted to find out how much each of them is used by the students.
This is complicated by the fact that theses are submitted as PDFs. Luckily, PDF file metadata contains two fields which give a lot of insight: PDF creator and PDF producer. Although, the content of these fields is not standardized, it’s not as messy as web browser User-Agent headers (yet). Working with the data, I reached the following classification:
- Microsoft Word if the PDF creator matches
Microsoft®? (Office )?Word
. - TeX if the PDF creator contains
TeX
. - Google Docs if the PDF producer matches
Google Docs|Skia/PDF
. - LibreOffice if the PDF producer matches
(Libre|Open)Office
. - Quartz if the PDF creator contains
Quartz PDFContext
. These are somehow created by MacOS, but it’s unclear to me how. - Print if the PDF producer matches
Microsoft: Print To PDF|Foxit Reader (PDF )?Printer|PDF Printer
. These are various PDF printers. - Unknown otherwise.
Overall
First, let’s look at the overall word processor breakdown:
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"data": {
"url": "/assets/2025-03-16-unitartucs-theses-blog.csv",
"format": {
"type": "csv",
"parse": {
"pdf_pages": "number"
}
}
},
"transform": [
{
"filter": "datum.defence_year >= 2010 && datum.defence_year <= 2024"
},
{
"filter": "datum.curriculum != \"msc_data_science_exam\""
},
{
"aggregate": [{
"op": "count",
"as": "count"
}],
"groupby": ["Classification"]
},
{
"joinaggregate": [{
"op": "sum",
"field": "count",
"as": "total_count"
}],
"groupby": []
},
{
"calculate": "datum.count / datum.total_count",
"as": "fraction"
},
{
"lookup": "Classification",
"from": {
"data": {
"values": [
{"Classification": "Microsoft Word", "classification_order": 0},
{"Classification": "LibreOffice", "classification_order": 1},
{"Classification": "Google Docs", "classification_order": 2},
{"Classification": "Quartz", "classification_order": 3},
{"Classification": "Print", "classification_order": 4},
{"Classification": "Unknown", "classification_order": 5},
{"Classification": "TeX", "classification_order": 6}
]
},
"key": "Classification",
"fields": ["classification_order"]
}
}
],
"width": "400",
"mark": "arc",
"encoding": {
"theta": {
"field": "count",
"type": "quantitative",
"aggregate": "sum",
"stack": "normalize",
"title": "Theses"
},
"color": {
"field": "Classification",
"title": "PDF creator"
},
"order": {
"field": "classification_order"
},
"tooltip": [
{
"field": "fraction",
"format": ".0%",
"title": " Theses"
},
{
"field": "count",
"title": "Theses"
}
]
}
}
This shows that Microsoft Word is used slightly more than LaTeX. Notably, Word-like WYSIWYG editors make up the majority.
Now, let’s dig a little deeper to see how the breakdown depends on the year and the curriculum.
By year
Second, let’s look at the word processor breakdown across the years:
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"data": {
"url": "/assets/2025-03-16-unitartucs-theses-blog.csv",
"format": {
"type": "csv",
"parse": {
"pdf_pages": "number"
}
}
},
"transform": [
{
"filter": "datum.defence_year >= 2010 && datum.defence_year <= 2024"
},
{
"filter": "datum.curriculum != \"msc_data_science_exam\""
},
{
"aggregate": [{
"op": "count",
"as": "count"
}],
"groupby": ["Classification", "defence_year"]
},
{
"joinaggregate": [{
"op": "sum",
"field": "count",
"as": "year_count"
}],
"groupby": ["defence_year"]
},
{
"calculate": "datum.count / datum.year_count",
"as": "year_fraction"
},
{
"lookup": "Classification",
"from": {
"data": {
"values": [
{"Classification": "Microsoft Word", "classification_order": 0},
{"Classification": "LibreOffice", "classification_order": 1},
{"Classification": "Google Docs", "classification_order": 2},
{"Classification": "Quartz", "classification_order": 3},
{"Classification": "Print", "classification_order": 4},
{"Classification": "Unknown", "classification_order": 5},
{"Classification": "TeX", "classification_order": 6}
]
},
"key": "Classification",
"fields": ["classification_order"]
}
}
],
"width": "725",
"mark": "bar",
"encoding": {
"x": {
"field": "defence_year",
"title": "Year",
"axis": {
"labelAngle": 0
}
},
"y": {
"field": "count",
"type": "quantitative",
"stack": "normalize",
"title": "Theses"
},
"color": {
"field": "Classification",
"title": "PDF creator"
},
"order": {
"field": "classification_order"
},
"tooltip": [
{
"field": "year_fraction",
"format": ".0%",
"title": "Theses (of year)"
},
{
"field": "count",
"title": "Theses"
}
]
}
}
This reveals two main trends:
- OpenOffice/LibreOffice usage has mostly diminished.
- Google Docs usage has become widespread.
The latter is worrying because Google Docs is (in my opinion) inadequate for typesetting a thesis. Having supervised and reviewed a number of theses (although relatively few compared to senior staff members), it’s often obvious that a thesis has been typeset in Google Docs based on poor and inconsistent formatting. Importing the Microsoft Word template into Google Docs is a lossy conversion because Docs has limited features and customizability, even compared to Word.
By curriculum
Third, let’s look at the relationship between word processor usage and curricula:
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"data": {
"url": "/assets/2025-03-16-unitartucs-theses-blog.csv",
"format": {
"type": "csv",
"parse": {
"pdf_pages": "number"
}
}
},
"transform": [
{
"filter": "datum.defence_year >= 2010 && datum.defence_year <= 2024"
},
{
"filter": "datum.curriculum != \"msc_data_science_exam\""
},
{
"lookup": "curriculum",
"from": {
"data": {
"values": [
{"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
{"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
{"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
{"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
{"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
{"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
{"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
{"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
{"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
{"curriculum": "other", "curriculum_name": "Other"}
]
},
"key": "curriculum",
"fields": ["curriculum_name"]
}
}
],
"width": {
"step": "50"
},
"height": {
"step": "50"
},
"mark": "rect",
"encoding": {
"x": {
"field": "curriculum_name",
"title": "Curriculum",
"axis": {
"labelAngle": -45
},
"sort": [
"BSc - Computer Science",
"MSc - Computer Science",
"MSc - Software Engineering",
"MSc - Cyber Security",
"MSc - Data Science",
"MSc - Conversion Master in IT",
"MA - Innovation and Technology Management",
"MA - Teacher of Mathematics and Informatics",
"Other"
]
},
"y": {
"field": "Classification",
"title": "PDF creator",
"sort": [
"TeX",
"Unknown",
"Print",
"Quartz",
"Google Docs",
"LibreOffice",
"Microsoft Word"
]
},
"color": {
"aggregate": "count",
"type": "quantitative",
"scale": {
"type": "log"
},
"title": "Theses"
},
"tooltip": {
"aggregate": "count",
"type": "quantitative"
}
}
}
Although the heatmap shows some trends, the logarithmic color scale1 makes exact comparisons difficult. Thus, let’s look at the word processor breakdown across different curricula in a different way:
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"data": {
"url": "/assets/2025-03-16-unitartucs-theses-blog.csv",
"format": {
"type": "csv",
"parse": {
"pdf_pages": "number"
}
}
},
"transform": [
{
"filter": "datum.defence_year >= 2010 && datum.defence_year <= 2024"
},
{
"filter": "datum.curriculum != \"msc_data_science_exam\""
},
{
"aggregate": [{
"op": "count",
"as": "count"
}],
"groupby": ["Classification", "curriculum"]
},
{
"joinaggregate": [{
"op": "sum",
"field": "count",
"as": "curriculum_count"
}],
"groupby": ["curriculum"]
},
{
"calculate": "datum.count / datum.curriculum_count",
"as": "curriculum_fraction"
},
{
"lookup": "Classification",
"from": {
"data": {
"values": [
{"Classification": "Microsoft Word", "classification_order": 0},
{"Classification": "LibreOffice", "classification_order": 1},
{"Classification": "Google Docs", "classification_order": 2},
{"Classification": "Quartz", "classification_order": 3},
{"Classification": "Print", "classification_order": 4},
{"Classification": "Unknown", "classification_order": 5},
{"Classification": "TeX", "classification_order": 6}
]
},
"key": "Classification",
"fields": ["classification_order"]
}
},
{
"lookup": "curriculum",
"from": {
"data": {
"values": [
{"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
{"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
{"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
{"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
{"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
{"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
{"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
{"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
{"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
{"curriculum": "other", "curriculum_name": "Other"}
]
},
"key": "curriculum",
"fields": ["curriculum_name"]
}
}
],
"width": "725",
"mark": "bar",
"encoding": {
"x": {
"field": "curriculum_name",
"title": "Curriculum",
"axis": {
"labelAngle": -45
},
"sort": [
"BSc - Computer Science",
"MSc - Computer Science",
"MSc - Software Engineering",
"MSc - Cyber Security",
"MSc - Data Science",
"MSc - Conversion Master in IT",
"MA - Innovation and Technology Management",
"MA - Teacher of Mathematics and Informatics",
"Other"
]
},
"y": {
"field": "count",
"type": "quantitative",
"stack": "normalize",
"title": "Theses"
},
"color": {
"field": "Classification",
"title": "PDF creator"
},
"order": {
"field": "classification_order"
},
"tooltip": [
{
"field": "curriculum_fraction",
"format": ".0%",
"title": "Theses (of curriculum)"
},
{
"field": "count",
"title": "Theses"
}
]
}
}
This reveals the following:
- Over 70% of “BSc - Computer Science” students use Word-like WYSIWYG editors and only 20% use LaTeX.
- LaTeX usage is most popular among “MSc - Computer Science” and “MSc - Data Science” students. This is probably motivated by the need to typeset more mathematics or do more data visualization.
- LaTeX usage is (almost) nonexistent among “MSc - Conversion Master in IT” and “MA - Teacher of Mathematics and Informatics” students. This is probably because students in those curricula are less technical.
Page count
Finally, let’s look at the thesis page count statistics by curricula:
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"data": {
"url": "/assets/2025-03-16-unitartucs-theses-blog.csv",
"format": {
"type": "csv",
"parse": {
"pdf_pages": "number"
}
}
},
"transform": [
{
"filter": "datum.defence_year >= 2010 && datum.defence_year <= 2024"
},
{
"filter": "datum.curriculum != \"msc_data_science_exam\""
},
{
"filter": "datum.pdf_pages <= 200 && datum.pdf_pages > 10"
},
{
"lookup": "curriculum",
"from": {
"data": {
"values": [
{"curriculum": "bsc_computer_science", "curriculum_name": "BSc - Computer Science"},
{"curriculum": "msc_computer_science", "curriculum_name": "MSc - Computer Science"},
{"curriculum": "msc_software_engineering", "curriculum_name": "MSc - Software Engineering"},
{"curriculum": "msc_data_science", "curriculum_name": "MSc - Data Science"},
{"curriculum": "msc_data_science_exam", "curriculum_name": "MSc - Data Science (exam)"},
{"curriculum": "msc_cyber_security", "curriculum_name": "MSc - Cyber Security"},
{"curriculum": "msc_conversion_master_in_it", "curriculum_name": "MSc - Conversion Master in IT"},
{"curriculum": "msc_innovation_and_technology_management", "curriculum_name": "MA - Innovation and Technology Management"},
{"curriculum": "ma_maths_and_informatics_teacher", "curriculum_name": "MA - Teacher of Mathematics and Informatics"},
{"curriculum": "other", "curriculum_name": "Other"}
]
},
"key": "curriculum",
"fields": ["curriculum_name"]
}
}
],
"width": "725",
"height": "350",
"mark": {
"type": "boxplot",
"size": 30,
"ticks": true
},
"encoding": {
"x": {
"field": "curriculum_name",
"title": "Curriculum",
"axis": {
"labelAngle": -45
},
"sort": [
"BSc - Computer Science",
"MSc - Computer Science",
"MSc - Software Engineering",
"MSc - Cyber Security",
"MSc - Data Science",
"MSc - Conversion Master in IT",
"MA - Innovation and Technology Management",
"MA - Teacher of Mathematics and Informatics",
"Other"
]
},
"y": {
"field": "pdf_pages",
"type": "quantitative",
"title": "Pages"
},
"tooltip": {
"field": "pdf_pages",
"type": "quantitative"
}
}
}
From this plot, I’ve additionally excluded the following outliers:
- A 373-page thesis, because it would screw with the scale of the plot.
- All theses with under 10 pages, because these appear to be abstracts for theses with publishing restrictions and would skew the results.
-
The logarithmic scale is necessary because the thesis counts differ in orders of magnitude. A linear color scale would be dominated by a few most popular combinations. ↩