Rethinking Domain Knowledge

Hi, in this week’s blog post, I would like to explore one of the three main pillars of data science: Domain knowledge. I am sure most of you have come across the famous Venn diagram describing data science as a discipline located at the centre of statistics, programming, and domain knowledge. This diagram is usually accompanied by job descriptions involving terms like “solution oriented” and “transforming business knowledge into AI” or something similar. This is also how I thought about the role of domain knowledge in Data Science for the longest time: as a pool from which to draw ideas and related them to machine learning techniques to create products right at the intersection of technology and the business domain. That is, I thought about it until I came across a very insightful essay by Kate Crawford and Trevor Paglen from 2019, featuring the following quote:

“First, the underlying theoretical paradigm of the training sets assumes that concepts—whether “corn”, “gender,” “emotions,” or “losers”—exist in the first place, and that those concepts are fixed, universal, and have some sort of transcendental grounding and internal consistency. Second, it assumes a fixed and universal correspondences between images and concepts, appearances and essences. What’s more, it assumes uncomplicated, self-evident, and measurable ties between images, referents, and labels.“

While the whole article about the politics of data is a great and enlightening read (link : https://excavating.ai/), the above quote caught my eye because it highlights an aspect the importance of domain knowledge about which I had not bothered to think that clearly before: The role of domain knowledge to truly understand the concepts we are working with. What are the outcomes we are interested in substantially? Can I actually and reasonably expect what they call a “universal correspondence between appearances and essence”, i.e., my data and what I genuinely try to model?

In my admittedly still limited experience as an aspiring data scientist, I noticed that these kinds of questions are rather often side-lined, as we tend to get excited over the newest data set, and the latest advances in machine learning algorithms that help us to pick up patterns even better in our data. This is ever more tempting as deep learning approaches, which are arguably the type of model most detached from domain knowledge, consistently push the state of the art and outdo humans in increasingly more fields. While other approaches invited us to spend at least some time pondering about the nature of the features to collect for and feed to our models, deep learning even takes the feature engineering to a degree from us. Thereby tempting us to invest less in substantive concept comprehension.

As Crawford and Paglen show in their essay, this can lead to numerous pitfalls in AI and machine learning, as the models learn kinds of logic we have largely abandoned ourselves for numerous reasons. The most striking example the authors present in their essay involves the role of skull shapes to derive conclusions about the individuals present in pictures. But we don not need to go to cases being this extreme to see that oftentimes we simply assume things regarding our object of interest. Again, a prominent example is provided by the earlier attempts to recognise feelings from faces, which later turned out to be plagued by bad performance when applied to test sets involving people from other parts of the world. The underlying assumption in this case was simply that facial expressions are a human constant and therefore immune to local variance.

To counter this kind of problems I think it’s time to reconceptualize to role of domain knowledge in our data science projects and use it not only as a pool from which we can draw cool business ideas but also as a bases from which we can combat problems such a lack of concept transferability and conceptual overstretching. An important part in this rethinking can be played by domain experts, which while not deeply immersed into AI topics per se can help to shed light at potential pitfalls such as conceptual imprecision or wrong underlying assumptions. This highlights once more that data science is at its core an interdisciplinary endeavour.

Leave a Comment Cancel Reply