Slow Data in a Hurried Age*

Posted: October 3, 2024 | Author: Azza Abouzied | Filed under: Uncategorized |3 Comments

* This is a play on David Mikics Slow Reading in a Hurried Age. As someone whose research is about building tools for decision-making with data, I find myself at odds with how these tools are often used. Mikics describes how we have lost the joy of reading books because of how hurried our world has become; our reading attention spans are now limited to tweets, and other forms of bad writing. I think the hurried age also applies to how we make decisions with data: so below is my attempt at describing the problem. In the future, I will try to propose ways of slowing things down.

Part 1: The Gated Data Nullius

What purpose does data serve?

When we began to build abodes for data, the databases, from cave walls to stone tablets to blue ledgers, to Oracle^TMs and finally to lakes in the cloud, we started with a simple desire: to keep record — a shared contract of what happened, when, where and how, and what should happen in the future as of now. All this kept us civil: “this is our story for the record.“

I remember over-stuffed, tattered, beige folders of patient files shuffling around my mother’s clinic: papers of test results, scribbled notes, and pink and yellow carbon copy prescriptions peeking from every side, held by an invisible force in each folder. Each one was a joint story of human interaction across time, place and technology. As with all stories, much is left out. Tests were done on Feb; Results came back a week later; Prescriptions were given on March; More tests; More results; The patient is back. From these fragments, my mother would skilfully fill the narrative gaps. As Ali sat down, she could tell he wasn’t taking his meds, the very ones she prescribed in March. He was giving the meds to his brother who couldn’t afford them and seemed sicker! But he was getting sicker too as the record showed; even after accounting for Ali’s confession: He wasn’t fasting that day they ran the tests. Like every day he couldn’t resist his morning tea with 4 sugar heap-fuls. The full story lives off the record. Rich lives, paltry records.

When I grade labs, exams, or papers, I would often think of those folders. I submit my grades on Gradescope, thinking here is my feedback for the record yet fully aware of the richness of in-class conversations, the 1-on-1s, the circumstances of my students and the learning (or lack thereof) beyond what the record shows. My students do the same thing when they fill in course evaluations, here is what we thought of the class for your record, fully aware of the richness of all our interactions inside and outside the classroom.

But data are not just records! A “database” is an antiquated term. Data doesn’t live in a base. It’s part of a set; it’s pulled from a log; it’s generated from a model. It’s beyond you and me and it doesn’t care for our full stories. How did the database become the data set? Well, a couple of things happened:

First Development: We stopped being deliberate about what we retained. When we tallied inventory in ledgers or kept notes in records, there was a clear sense of limits and costs. Maintaining a record took time, effort and physical space¹. There was a visible cost to retention. So we only kept what we needed: we were deliberate. But when data retention ceased to have a visible cost — Data centers are hidden from sight, often across continents; Once paid, the subscription, licence or storage costs create a perverted sunk cost fallacy, in that we ignore the cost and assume data lives free! — we gave up our deliberations. Keeping track of things with exacting precision does not require much or any effort: we track when we track things down to the millisecond². We track how long a patient spent with the doctor and how long it took to grade one answer vs. another. We don’t need to store notes on our interactions, we can and we will store every uhm and err.

Second Development: We stopped owning our stories. As the beige folders made their way into my mother’s office, I could see the people in the waiting room perk up: “is that mine?” They felt they owned the folder and the fragments of their story that it held. But who owns data? The minutes it took for a patient to be seen; the diagnostic codes that justify insurance claims for tests, referrals and refills; and so on. Whose stories (if any) are they representing? The physical folders are gone but with it the imaginary force that tied together the doctor, the patient, the nurses, and many others across time and place as they ritually kept record of their many interactions. The insurance company keeps the data and decides our insurance premiums even if we never engaged in “keeping record” with the insurer. The medical group keeps the data and pushes wellness checks on unwell patients. The clinic, the university, the hypermarket, the bank, the city … we don’t see our stories, yet it is data wrapped on our records, forcing its way into our lives.

We have grown familiar and even comfortable with the disappearance of the ownership of our narratives even when we commit to writing our own stories. On social media, there are no owned stories, only data. A post generates likes, reposts, and maybe responses. Our posts and responses are stripped of the richness of our lives; they don’t even rise to paltry records because we don’t even try to share the context through which we write our fragments (we can’t) or read others’ (we don’t ask).

From these developments emerges the data nullius, like terra nullius or no man’s land, no individual owns the data. Unlike the commons, we don’t govern it either for collective use. Yet, it is kept and gated with varying degrees of access.

So what purpose does data serve?

Quick takes on critical issues! I don’t mean this cynically. It is just the way it is. Imagine a boardroom, where you or someone else are sitting around representing an institution. A well meaning concern is posed. Either to prioritize or discredit the concern, one announces “Show me the data!” The data person³ pipes up: “well we have this data that we can use to answer this question with … (5 minutes of data analysis jargon go by; words like correlation and causation are thrown around to signal deep knowledge of all things data) … it isn’t exactly this and of course we would have to do further analysis but we can show … (5 more minutes!)” The newly formulated question has nothing to do with the concern but that question has the right data parts. The data itself is not substantial enough to answer the question but that is secondary, future work, after the first take is presented at the next board room meeting.

The presence of any data (Development 1) drives this behavior. Pushing back is seen as ludicrous. Surely, you would want a quick handle or even an approximate answer before you invest any further resources into studying an issue or handling it. Resources are constrained. We have no time. There is also a feedback loop that plays in here: data begets data. Resources are put into the secondary future work to gather any more data that can be easily gathered, joined, merged, etc. (Development 1). We can now answer even more questions motivated by, but disconnected from, deep concerns. And so we operate in a world of compounded approximate and bad answers to irrelevant questions.

Let me give an example. Across universities faculty are concerned that using course evaluations to evaluate teaching effectiveness is leading to grade inflation⁴. Brought to a university’s board room, the data person formulates the following irrelevant question: “are student grades correlated with course evaluations?” Some analytical work of course is needed to account for discretized, finite grades, non-normal distributions, non-linearity, etc . All of this is explained along with the required “correlation is not causation” to provide the necessary due diligence that assures everyone is aware of the imperfection as they nod to “let’s see what the data says“⁵. Yet no matter how much data⁶ exists, it cannot uncover whether faculty take a more lenient grading approach in fear of poor evaluations. Here are some reasons why:

Many faculty-student-administrator-adjective-interactions: Espousing this concern, grade inflation is probably strongest when we have a large proportion of (i) must-get-an-A students (or perceptions thereof), (ii) risk-averse yet impartial faculty: the combination of holding equity dear, yet wanting to get tenured, promoted or renewed, and (iii) data-driven administrators that believe course evals should determine teaching effectiveness. But we could have variations on this: smaller proportions, pragmatic (a-C-would-do) students, risk-tolerant or partial faculty, etc. Moreover, each of these plays into a noisy, person-specific, interaction-specific, function of how much additional grade is tacked on. From Development 2, we know that the data doesn’t come with any of that context, even if the record between the faculty and the student does so for them.
Many other factors: Perhaps grade inflation happens because of many factors that may include course evaluations or changes in knowledge expectations or the faculty actually just got better at teaching.
It’s time-evolving: New faculty might come in with certain cultural norms of grading. Perhaps they were stringent and received poor evals. They sought advice from a colleague who said “just give them good grades.” After some internal moral qualms, the faculty did become lenient and evals did improve. The faculty pushes on the same advice to newer faculty. They got tenured and then found a new sense of risk tolerance. They make the material more challenging and they are more confident docking down students but the evals remained the same. They now claim that evals are not influenced by grades. This is just one complex trajectory for one faculty that explains the interplay of grading norms and evals.

To all these reasons and any more, the data person has rigorous solutions; rigorous methods that are less sensitive to the violation of their assumptions; causal models; data curation (e.g. let’s look at large class sizes taught by multiple senior instructors within the same year — so calculus 101?!). Yet the violation of an assumption should invalidate the result; causal models cannot account for unaccounted causes and often require more data than what is available for robustness; and at what point does the curated data stop generalizing to the entire university. Debating the data person can go on indefinitely.

Let me be clear, I am not making an argument on the validity of the concern: it may, it may not be. I am arguing against the presumption that with data, especially what is often readily available, we can figure it out. I’m also pointing out that the emergence of the data nullius makes it increasingly difficult to argue against a fast data approach. Back in the board room, imagine a voice that says: “hold on, that is not what records of grading are meant to be used for; neither are records of course evaluation.” If that voice does emerge, it will quickly be quelled again with irrelevant responses. “Oh we are doing this analysis in the aggregate, we are not violating any one’s confidentiality.” But it isn’t about confidentiality, in fact the extraction of a data point from the very context it was created within is the problem. But the voice rarely emerges because we are familiar and comfortable with the disappearance of the ownership of our stories.

Again these are rough ideas, feel free to comment.

The tangible nature of a record meant it also didn’t travel much or far. ↩︎
A nursery app would ping me the very minute my toddler had a bowel movement! ↩︎
At this point, it is fair for you to ask “Azza, are you the data person?” I’m not the data person but I also plead the 5th 🙂 ↩︎
This an age old concern. There are lots of empirical studies that show some evidence of this, and no evidence. I learned that in the US a possible driver for grade inflation was the Vietnam War; to protect failing students from being conscripted, faculty boosted the grades! ↩︎
The presence of empirical studies with other universities does not detract from the data analysis effort: (1) we have the data, so we can do it! (2) we are unique. ↩︎
There might be other ways to determine how prevalent grade inflation might be at an institution and what drives it but none of them are easy or fast or without problem: Qualitative interviews (Too long) ; Surveys (Few answer them, fewer truthfully); Logical reasoning (Ah, the philosophers); Game theoretic formulations and simulations (Ah, the economists); Randomized Control Trials (Ah, more economists … What? You want to do what?); We build a multi-factorial AI model that predicts grades or course evals and we examine the weights assigned to course evals or grades (Get out of here!). ↩︎

3 Comments on “Slow Data in a Hurried Age*”

DataSciencer says:

November 24, 2024 at 2:06 pm

A proponent of the “quick takes on critical issues” using fast data, to borrow your terminology, may argue that the raw data points that are gathered are almost axiomatic. Perhaps one could say data does not lie. It is mundane but one can build and perhaps proof theories using these building blocks we call data. I am leaning these day to thinking that using data for a purpose that it is not designed for carries certain risks with it.

Incidentally, and perhaps not data problem per se, but fear of negative ratings drive doctors to be more lenient and more inclined to prescribe controlled-substance drugs to their patients fueling all sort of health and societal problems at in North America.

Reply
- Azza Abouzied says:
  
  December 16, 2024 at 11:26 am
  
  Hi DataSciencer, thanks for the comment and sorry for the late reply. I definitely hear this argument made by proponents of data-driven methods. But raw data points are far from axiomatic truths. Data does lie: case in point are pre-election polls (https://www.argmin.net/p/going-beyond-the-polls?utm_campaign=post&utm_medium=web). Having to launch a survey myself these days where the results can impact a policy decision, I find myself worrying about how to get respondents to be truthful or even self-interested: anonymity doesn’t preclude worries about peer virtue judgments. As for measured physical phenomena, sensors are prone to errors, mis-calibrations, etc. and any legitimate data processing pipeline has a substantial data “cleaning” phase.
  
  Reply
  - DataSciencer says:
    
    December 18, 2024 at 1:59 pm
    
    Thanks for taking the time to reply. The election polls example is a definitely good and timely one.
    
    I want to also add that I think the book recommendation “Slow Reading” looks like it is going to help me escape the unhealthy reading diet that I and I am sure many others have found ourselves in, no thanks to social media and news outlets. But thank you for the recommendation. I plan on reading it after the holidays.

Azza Blogs

On Humans, Machines & The Space Between

Slow Data in a Hurried Age*

Part 1: The Gated Data Nullius

3 Comments on “Slow Data in a Hurried Age*”

Leave a comment Cancel reply

Recent Posts

Follow Blog via Email

Azza Blogs

On Humans, Machines & The Space Between

Slow Data in a Hurried Age*

Part 1: The Gated Data Nullius

Share this:

Related

3 Comments on “Slow Data in a Hurried Age*”

Leave a comment Cancel reply

Recent Posts

Follow Blog via Email