What Can the 人文学科 Teach 大数据?

Tuesday, February 26, 2019 - 10:02

像许多美国人一样, 我对科技又爱又恨:当我的学龄前孩子吵着要iPad屏幕时间,而不是看书讲故事的时间时,我内心感到害怕. 我们的城市, 我们的政府, 我们的保险公司, and even the vendors of books are awash with technology as well. 在最近的一次黑客马拉松上, the expert from the local transit authority confessed that with logs of accidents, and data about the wealth and race of inhabitants, they have too much data to inform decision-making.  

 

了解人文学科传统上是如何处理大问题的,可以让数据科学专家了解,如何在回答基于严肃调查的问题的同样技巧的基础上,得出有意义的结论. Humanists, after all, are experts at probing the largest questions of our species. 一个例子可能是掌握亚里士多德以来哲学家们关于正义或性别等话题的言论, unpacking the values behind those concepts,  and coming to a new understanding of how those ideas are changing in our own day.  The traditional role of the humanities is to elevate the ambitions of human beings, asking what it means to be a citizen, an heir to the legacies of learning on many continents, or an individual with the capacity of dissent.

 

现在比以往任何时候都重要, 对于那些从事大数据工作的人来说,在人文学科问题上进行培训是很重要的——对于那些从事人文学科工作的人来说,明确他们的批判性思维工具与数据科学家的相关性是很重要的.   人文学科的价值在于通过有技巧的学术研究来处理这些问题——以及许多较小的问题. 

 

The particular skills of humanities scholarship take many forms, but they all agree in emphasizing serious engagement with texts and their contexts.  They ask about the nature of the evidence at hand,the valuesthat govern the inquiry,and the many ways of modeling those concepts.   这些技能, 除此之外, allow scholars to produce both a strong consensus about truth where it is found, while simultaneously making room for dissentabout issues of interpretation, 身份和意义.  Skillful interpretation of the data allows scholars to agree about the facts (例如, which manuscripts are the authentic production of a particular medieval scribe), while establishing room for dissent about the interpretation of those facts (例如, characterizing the perspective of Biblical literalism versus historical interpretation). 

 

我最近提出了“关键搜索”的概念,作为人文价值如何转化为数据世界的一般模型.  关键搜索有三个主要组成部分,反映了传统人文主义者过去处理重大问题的方式, 风扬, 还有导读. 

 

传统上, 人文主义者开始通过参考过去的经典(这并不是说不加批判地接受过去的价值观)来解开像“正义”这样的范畴。. They “seed” their research by beginning with a review of learned writing topic, carefully choosing particular texts whose 类别 resonate with them, just as a gardener carefully selects the seeds to plant in the ground.  

 

Modeling this process has a lot to offer studies in big data. 就像园丁一样, critical thinkers need to carefully choose their keywords, 类别, and sets of documents to “seed” research in the field of big data.  处理数据时, defining “justice” or even “gender” requires being clear about which definition one uses.  这些选择必须是明确的和自我反射的,因为它们对下游有很强的影响.  They need to be documented in order to make the query replicable.  

 

在大数据时代, “播种”的大部分工作都是通过算法的选择来完成的——无论是机器学习, 散度的措施, 或者主题建模, 例如, is used to distill the findings of the data. From the humanities perspective, it isn’t enough to simply perform a search based on an 算法; the 算法 itself has biases, which will redound through the search process.  只有通过比较不同算法产生的结果,我们才能深入了解特定工具是如何影响结果的.

 

模型的第二步, “风扬,” explains the work typically done by scholars as they read widely, gaining information about context, and following the insights of pattern recognition, 话语, or critical theory to foreground particular test cases.  This step is usually interpretive, which means that there is no objectively “right” answer about the “best” theory, but that scholarship progresses by scholars engaging from each others’ insights.  

 

以大数据为例, “筛选”是指研究人员审查任何特定算法的结果,以询问数据和算法如何适合她的问题.  这可能意味着, 例如, discussing how the same 算法 produces different answers at different scales, or how using a different measurement produces different results.  例如, in one digital history experiment, 三个不同的被普遍接受的散度方程从数据中得出了三个截然不同的答案.  比较不同算法的结果意味着突出特定算法固有的偏差, 方程, 或者选择比例.  

 

在数据科学工作中, as in problems traditionally addressed by the humanities, the right answer affords room for debate and interpretation.  关键是工程师, 即使在处理大数据时也是如此, 要注意透明地记录特定算法的选择,以及可能导致结果偏差的方式.  迭代播种和筛选为naïvely拥抱计算算法的结果提供了安全屏障.  目前, it is unclear how dependable most of our best tools for modeling text are, and where careful limits need to be provided.  例如, 处理主题模型的计算机科学家自己也呼吁进行更多的研究, 为什么, and how the topic model aligns with insights gained in traditional approaches. 埃里克·鲍默和他的同事们警告说,“几乎没有理由期望主题模型中的单词分布会以任何有意义的方式与人类的解释保持一致。.“反复筛选和阅读可以防止从数字过程中得出鲁莽的结论. 一个真正关键的搜索需要人类监督,无论算法和人文问题之间的契合度是不明确的.

 

The next step in the process is “guided reading,这反映了园丁如何挑选发霉和损坏的水果,哪些适合食用,哪些适合做馅饼.  面对档案,传统的人文学者积极地选择段落进行研究. 

 

Digital scholars too must reckon with the choice of which findings to present.  在这个过程的这个阶段, the scholar carefully inspects the results returned by a search process, 有时是抽样, 有时对它们进行泛化(例如通过再次计算关键字或主题建模), before iterating the process again.  确保有一个人为的步骤来检查数据——或者“引导阅读”——对于确保研究过程产生有意义的发现是很重要的. 不断“检查”计算机工作的过程使专家能够更好地判断生成的子语料库是否适合手头的学术问题,以及如何适合.  Sampling the results in a structured, regular process allows the scholar to assess the results of a search confidently. 

 

批判性搜索本身调和了学者对特定算法的偏见和透视性质的敏感性. In many cases, however, one pass through the 算法s is not enough. 关键字搜索, 主题模型, 而且差异度量都可以用来将语料库缩小到更小的文本体, 例如 identifying a particular decade of interest. In order to precisely "tune" the 算法s to the researcher's question, successive rounds of the critical search process may be necessary.

 

Critical search means adopting 算法s to the research agendas we already have—feminist, 陆军中尉的, 环境, 外交, 等等,然后寻找那些工具和参数来提高我们的假肢对档案的多维度的敏感度. 记录种子的选择, 算法, 的否决, 在我们如何理解正典的过程中,迭代对于纪律实践的透明度有很大的帮助, how we develop a sensitivity to new research agendas, and how we as a field pursue the refinement of our understanding of the past.

 

By emulating the humanities and embracing the skills of critical thought, 参与关键搜索过程的个人可以使他们如何处理所呈现的数据的选择可见和透明.  就像传统的人文主义者, 他们将比较并结合二手资料和权威文本的见解,以决定提取哪些类别以及这些类别的含义.  In explaining any given approach to data, 他们将完整地记录他们围绕不同算法所做的选择及其结果, 从而帮助社区作为一个整体,为存在的事实和不同的解释方法之间的共识留出空间. 

文章链接