Assessment Data Governance in Ed-Fi

Introduction

The assessment domain in Ed-Fi is unique in a few ways, with the most important distinctions being:
    .1The models are flexible, especially in regards to student assessment results
    .2There is no single standard source of assessment data and instead will be populated by various vendors

What are the impacts?
    .1Vendors can make vastly different decisions about how to populate the assessment-related resources
    .2Downstream processes need to handle this flexibility in complicated ways

What should we do about it?
When integrating a new assessment into Ed-Fi, there are initial data modeling decisions that must be made, and this is the point in the process where misalignments could occur. These decisions can be broken down into two broad categories:
  • Assessment definition / Hierarchy
  • What makes an assessment unique - and therefore what should be populated in the AssessmentIdentifier field in Ed-Fi?
  • How can we model the structure of an assessment to properly capture the hierarchical nature of scores and subscores?
  • Score standardization
  • Should we try to align to standard score names/definitions at the point of integrating into Ed-Fi or persist the original vendor scores?

This document will dive more deeply into the ambiguities that exist at both of those decision points, recommendations to make this process more consistent and the resulting data more functional, and the ongoing role of data governance within each.


Assessment Definitions / Hierarchy

Background

Accurately defining an assessment is the most important decision to be made during an assessment integration. This decision point will impact all other aspects of the process. Specifically, the assessment definition is inherently tied to capturing the hierarchy of an assessment using the data model.

The only fields used to define an assessment (the fields that make up the primary key) are AssessmentIdentifier and Namespace. This seemly matches the primary key makeup of other resources:
  • Student primary key = StudentUID
  • School primary key = SchoolId
However, there are a few reasons why these examples do not entirely align:

First, there are typically standard, single sources for student and school IDs, especially within the context of a single LEA. Student & school IDs will be populated entirely from the SIS. On the other hand, assessments are sourced from a number of different vendors with different underlying data structures.

Second, there are commonly understood definitions of student and school ID and are existing fields in the source system. This is not necessarily true of assessment IDs.

Ed-Fi defines the AssessmentIdentifier property as "A unique number or alphanumeric code assigned to an assessment." This definition leaves room for ambiguities.

How would someone determine what makes an assessment unique? Should it be unique by:
  • Grade?
  • Subject?
  • Year?
  • All of the above?

Technically, any of these fields could be included in the assessment identifier, and we have seen a variety of combinations of them across implementations. Some examples include:
  • MCAS03AESpring2018 (MCAS Grade 03 ELA - 2018. Seen in Boston)
  • STAR-RD-V1 (From Data Import Tool)
  • CAINC-IREADY-DIAGNOSTIC-ELA (From i-Ready native integration)
The impact of this is that each of these assessments will seemly have different grains, despite the primary key being consistent.

Previously, EA attempted to determine the AssessmentIdentifier by considering the following two questions:
    .1Are there components in this assessment for which scores/other metadata significantly differ?
    .2How does the vendor describe the assessment?

But those questions do not have straightforward answers, and does not remove ambiguity from the process.

Recommendation

An assessment integration into Ed-Fi should capture the true hierarchy of an assessment as much as possible by properly utilizing the AssessmentIdentifiers, ObjectiveAssessmentIdentifiers, and ParentObjectiveAssessment fields. To this effect, the assessment identifier should reflect the highest level at which scores exist, which typically involves including two properties in the identifier:
  • Assessment Title
  • Subject

By doing so, the scores at the student overall assessment level can always be systematically mapped to a particular subject, which is vital for most analytical reporting.

What are the problematic impacts of not following this structure? Take an assessment with the following structure:
assessmentIdentifier
assessmentTitle
academicSubjectDescriptors
My Fake Assessment
My Fake Assessment
Mathematics, ELA
And corresponding student assessment structure:
studentIdentifier
assessmentIdentifier
scoreResults
100
My Fake Assessment
{math_scale_score: 100, ela_scale_score: 200}

There would be no systematic way to map the math_scale_score to Mathematics, and ela_scale_score to ELA. The goal should not be to write custom logic downstream of Ed-Fi for each assessment in order to map scores to subjects.

In another problematic use-case: take an assessment with composite scores, it may be tempting to include all subjects that make up that composite score in the subjects field of the assessment record:
assessmentIdentifier
assessmentTitle
academicSubjectDescriptors
My Fake Composite Assessment
My Fake Composite Assessment
Mathematics, ELA
With a corresponding student assessment record:
studentIdentifier
assessmentIdentifier
scoreResults
100
My Fake Composite Assessment
{composite_score: 100}

However, a dashboard could exist downstream that uses the information from the assessment record to subset the corresponding student assessment records. Should choosing either Mathematics or ELA from a dropdown return the same student assessment record with composite scores? This could result in comparing composite scores to true subject-specific scores, creating an apples-to-oranges comparison.

Another reason to specify these two properties beyond mapping scores to subjects is simply to reduce ambiguity of what must be added to the AssessmentIdentifier to guarantee uniqueness.

How does this end up working in practice? Take the example of the Renaissance Star assessment - there are no scores reported across subjects, so to match these governance standards, this assessment is being mapped with the following identifiers:
  • Star-MA (to represent Math)
  • Star-RD (to represent Reading)
  • Star-EL (to represent Early Literacy)
The main concern of this disaggregation of assessment records is that analyses or even simple querying downstream could be more difficult if the goal is to inspect all scores within a particular assessment (such as Star), regardless of subject. The assessmentFamily property should be used to address this drawback.

However, for an assessment like PSAT/SAT, there are scores reported across subjects, so to match governance standards, this assessment is being mapped with the following identifiers:
  • PSAT 8/9
  • PSAT 10
  • PSAT/NMSQT
  • SAT
Each of these assessments contain a single 'Composite' subject at the overall assessment level because that would be the level of score captured at the student overall assessment level. They are inherently single subject, so including the actual subject code in the identifier would be redundant. The various sections, tests, and their corresponding subjects can all be captured as objective assessments:

Data Governance Role

While the proposed solution could help to offer consistency across assessments, ambiguity is still going to exist, particularly in regards to the following points:
  • What should be included in the assessmentTitle?
  • There is often no column with this information in the student data and there might not be a clear answer from the Vendor.
  • How should we handle assessments that cannot fit into this structure?
  • This structure can work sufficiently well for most assessments, but there will be assessments that cannot easily match this structure, and attempting to do so may result in lost information.
  •  NWEA MAP Reading Fluency , for example:
  • This assessment is inherently single subject (reading), so maybe there should be a single assessmentIdentifier: 'NWEA MAP Reading Fluency' - however there are scores captured at the form level, which signifies those different forms should be captured as separate assessment identifiers.
  • This is a great example of needing to deeply understand the assessment in order to properly map into the Ed-Fi structure.
  • How should we capture assessments throughout history?
  • Significant changes could have occurred throughout historical years that would impact how we define an assessment, but those changes might not always be transparent to someone modeling an assessment based on the current year. Content experts on each particular assessment should provide evidence for the splitting of assessment identifiers across history.
  • 'Significant' in this context is also hard to define - but could include:
  • Changes to standards
  • Changes to the underlying statistical model of scores
  • Changes in assessment version

It is impossible to remove all ambiguity from this process, and data governance oversight will be necessary to maintain consistency.

Score Standardization

Background

As of now, EA typically creates custom descriptors for AssessmentReportingMethod for each score & vendor. Instead of standardizing to the default AssessmentReportingMethod descriptors, EA creates new values under the specific vendor namespace that represents the score exactly as received.

In practice, this ends up looking like:
  • uri://www.nwea.org/map/AssessmentReportingMethodDescriptor#RIT Scale Score
  • uri://dibels.uoregon.edu/assessment/dibels/AssessmentReportingMethodDescriptor#Composite Score

 Downstream , both of those scores are unified to a single scale_score column. This begs the question: why not normalize the scores names at the point of data integration into Ed-Fi?

NWEA defines the "RIT Scale Score" as a score on the "Rasch UnIT scale". They state this as the characteristics of the RIT scale:
These RIT scales are stable, equal interval scales that use individual item difficulty values to measure student achievement independent of grade level (that is, across grades). "Equal interval" means that the difference between scores is the same regardless of whether a student is at the top, bottom, or middle of the RIT scale. "Stable" means that the scores on the same scale from different students, or from the same students at different times, can be directly compared, even though different sets of test items are administered. A RIT score also has the same meaning regardless of the grade or age of the student.

University of Oregon defines the "Composite Score" as a "combination of multiple DIBELS scores, which provides the best overall estimate of the student's reading proficiency". The calculation behind the score is complicated, and documented  here .

Technically, both of these scores are scaled scores (a raw score that has been adjusted and converted to a standardized scale), but the calculations behind those scores are different and oftentimes unique to the vendor. The score name and vendor-specific namespace captures those differences, in case this additional information is relevant to those who need to use the data. If those score names were normalized at the point of ingestion, it might be unclear which specific score (and underlying calculation) belongs to the scale_score. One of our main goals in data integration is to allow for flexible analytics - but there is always going to be a tension between flexibility and accessibility.

Further, some assessments include multiple scores that could correctly be called scale_score, such as having both composite_score and rasch_score. When we call one of these scale_score in the warehouse, we're effectively picking a winner as to which is the 'main' scale score, but it's somewhat transparent which was chosen, and the other is still included in the data. If this decision is made at loading time, it becomes opaque, and there may not be a standard place to put alternative scale scores.

Additionally, many vendors offer additional scores that would be impossible to standardize. Some examples include:
  • NWEA MAP: 'Fall-To-Fall Projected Growth'
  • Renaissance STAR: 'Normal Curve Equivalent'

No matter what, some flexibility would be necessary when mapping scores to Ed-Fi to accommodate these additional scores.


Recommendation

In order to empower multiple analytic use-cases, the current method of standardizing downstream of the Ed-Fi ODS is acceptable. The downside is that each system that interacts with the data directly from the ODS will need to handle standardization separately. However, this may be necessary in certain cases anyway given different desired analyses.

If in the future there was an ability to maintain a mapping between original score names with standard score names in the Ed-Fi model itself, that could be a agreeable solution for transparency, flexibility, and functionality.


Data Governance Role

While in some cases the translation from the original score name to a standard score name is straightforward, there are cases where that is not true. As seen in the above example, these scores across vendors vary greatly, both in name and in calculation.

Some common score columns include:
  • Scale score
  • Raw score
  • Performance level
  • SEM
  • Percentile

While scores like raw score, SEM, and percentile should be easier to standardize to given their more standard definitions, the other two may require oversight from a governance group. As an example, Renaissance Star assessments contain multiple scores that could translate to a single performance_level column:
  • RenaissanceBenchmarkCategoryName
Renaissance default benchmark categories; these are standard across all Renaissance customers.
  • StateBenchmarkCategoryName
The benchmark categories used are the benchmarks defined at the state level in the Renaissance program. These values will be unique per state test (i.e. Level 1, Level 2, Level 3).
  • DistrictBenchmarkCategoryName
The benchmark categories used are the benchmarks defined at the district level in the Renaissance program.

In this case, the vendor score to map to the standard score could depend on the analytic use-case, but a default score could be defined by a governance group to avoid inconsistencies.