There are a few high-level topics to understand before you can begin development work:
What is earthmover?
earthmover is a python-based, CLI tool for transforming collections of tabular source data into a variety of text-based data formats via YAML configuration and Jinja templates.
What is an earthmover bundle?
Bundles are pre-built data mappings for converting various data formats to Ed-Fi format using earthmover. They consist of a folder with CSV seed data, JSON template files, and a config.yaml with sources, transformations, and destinations.
What is the structure of the Ed-Fi assessment domain?
The assessment domain of Ed-Fi consists of multiple entities, including (but not limited to): Assessment, ObjectiveAssessment, StudentAssessment
Take time to review the assessment domain documentation to better understand the destination as well as the existing assessment earthmover bundles to better understand the process
Why is integrating assessment data different than integrating other data into Ed-Fi?
Assessments and all related Entities in Ed-Fi are unique in a few ways, with the most important distinctions being:
.1The models are flexible, especially in regards to student assessment results
.2There is no single, standard source of assessment data and instead will be populated by various vendors
What are the impacts of the above?
.1Vendors can make vastly different decisions about how to populate the assessment-related resources
.2Downstream processes need to handle this flexibility
The goal of this document is to offer tactical guidance for creating a new assessment earthmover bundle so that integrations are consistent, regardless of the vendor that is developing the bundle.
Development steps:
Review assessment data source(s)
.1Ensure that the structure of the assessment data is consistent
The goal for creating a new bundle is that it can be used by every district across the state/country that administers that particular assessment. The source should be a consistent file structure that is ideally directly from the original assessment vendor.
.2Determine if there is a single source of student assessment results
Sometimes assessment vendors split their student assessment results into multiple files
E.g. Renaissance splits their Star assessment results into multiple files, so all are included as sources in the bundle and then joined together
.3Review any available data dictionaries
It can be very difficult to correctly transform data from the original structure into the Ed-Fi model if the columns of the source data are not defined
This will help in determining which columns map to specific Ed-Fi properties
Determine the assessment identifier & namespace
Ed-Fi defines the assessment identifier as 'A unique number or alphanumeric code assigned to an assessment', but there is no standard source for those IDs. It will be up to the person building out the bundle to determine what that ID should be, but governance standards have been established to help clarify the process. To read more about those standards, see this document.
Namespace should represent the source assessment organization (aka the assessment vendor).
An assessment integration into Ed-Fi should capture the true hierarchy of an assessment as much as possible by properly utilizing the AssessmentIdentifiers, ObjectiveAssessmentIdentifiers, and ParentObjectiveAssessment fields. To this effect, the assessment identifier should reflect the highest level at which scores exist, which typically involves including two properties in the identifier:
Assessment Title
Subject
By doing so, the scores at the student overall assessment level can always be systematically mapped to a particular subject, which is vital for most analytical reporting.
To that end, it may be acceptable to set the overall assessment subject as 'composite' and capture the individual subject results using objective assessments.
Including those properties would result in more records in the assessment domain, and could be slightly redundant for certain assessments but would be more consistent across assessments.
While the proposed solution could help to offer consistency across assessments, ambiguity is still going to exist, particularly in regards to the following points:
What should be included in the assessmentTitle?
There is often no column with this information in the student data and there might not be a clear answer from the Vendor.
Title can capture a multitude of constructs, and determining that title requires deeply understanding the assessment:
For DIBELS, the titles of DIBELS Next and DIBELS 8th capture a different version of the assessment
For MAP Reading Fluency, the titles capture the various forms of the assessments (Foundational Skills, Adaptive Oral Reading, etc)
How should we handle assessments that cannot fit into this structure?
As we mentioned, this structure can likely work sufficiently well for most assessments, but there will be assessments that cannot easily match this structure, and attempting to do so may result in lost information.
Reaching out to EA or contacting someone with experience using the data are solid next steps.
Note: The above guidance generally follows the Ed-Fi best practices for assessment identifiers & namespaces.
Establish the hierarchy of the assessment
This step is inherently tied to the prior step, because defining an assessment is the first piece of establishing the hierarchy of the assessment. And similarly to the prior step, it is not always straight-forward to determine how the structure of an assessment should translate into the Ed-Fi structure. Here are some questions to consider:
.3Are there any additional nested structures (like subsections within subtests)?
In the Ed-Fi assessment domain, you can associate certain objective assessments to a 'parent' allowing for a hierarchy within objective assessments.
Take the example of the Renaissance Star assessment. As established above, each subject is treated as an overall assessment. Within each overall assessment, there are multiple levels of subscores:
Overall assessment results (Star Reading assessment, Star Math Assessment, Star Early Literacy assessments)
Domain results (E.g. Literature; Informational Text; Numbers and Operations; etc.)
Skill Areas (E.g. Literature: Plot; Literature: Setting; Literature: Character; etc.)
Standards (Results split by domain, relating to state learning standards)
In this case, Domains, Skill Areas, and Standards were all treated as objective assessments, but the Domains were treated as 'parent' objective assessments to the relevant Skill Areas and Standards.
Governance Review
Once the assessment identifier(s) and hierarchy have been determined, a governance artifact should be created following this template, which can be reviewed by a governance committee.
Copy the assessment bundle template
In the bundles repo, there is a template bundle that can be used as a starting point for a new bundle. You should fork the repo, create a branch, and start your bundle by copying the entire _template_bundle folder.
Investigate student IDs
Within a single assessment file, there might be multiple student ID columns. In order to successfully send records to an Ed-Fi ODS, one of those IDs must match the student unique ID of the student resource.
Typically, the column from the assessment file that matches the student unique ID will differ across districts and years, so maintaining a configuration at that level is necessary, especially when automating the process.
In order to avoid manually determining which column from the file matches the student unique ID of an Ed-Fi roster source (or which column must be xwalked), EA built out a number of packages to handle student ID xwalking.
The first package, compute_match_rates, will take in an Ed-Fi studentEducationOrganizationAssociation source as well as an assessment results file, and attempt to join each ID from the roster source to each ID from Ed-Fi and determine the highest match rate.
Graphical depiction of the 'Compute Match Rates' package logic.
The second package, apply xwalk, takes in the same Ed-Fi studentEducationOrganizationAssociation source and assessment results file as the package above, except this time the highest match rate from the step above is applied and every possible student it mapped to the corresponding studentUniqueId. Students that cannot be matched are also output in the original source file structure. This step ensures that no students will be unintentionally dropped in downstream steps.
Graphical depiction of the template bundle with 'apply_xwalk' package logic.
Some of these metadata are other entities, like Assessment (contains information about an assessment, including titles, scores, and subjects) and others are descriptors, like AssessmentReportingMethod (A controlled list of all possible score values).
Best practice here is to determine all possible values for these components and maintain those as seeds in the assessment bundle. Typically, those values are determined using data dictionaries. See an example here.
There are cases where that information is not available from a vendor, and it is actually necessary to source those possible values from the student assessment results instead. See an example here.
Important note, relevant here: EA does not standardize scores at the time of integration.
As of now, EA typically creates custom descriptors for AssessmentReportingMethod for each score & vendor. Instead of standardizing to the default AssessmentReportingMethod descriptors, EA creates new values under the specific vendor namespace that represents the score exactly as received.
Downstream, both of those scores are unified to a single scale_score column. This begs the question: why not normalize the scores names at the point of data integration into Ed-Fi?
NWEA defines the "RIT Scale Score" as a score on the "Rasch UnIT scale". They state this as the characteristics of the RIT scale:
These RIT scales are stable, equal interval scales that use individual item difficulty values to measure student achievement independent of grade level (that is, across grades). "Equal interval" means that the difference between scores is the same regardless of whether a student is at the top, bottom, or middle of the RIT scale. "Stable" means that the scores on the same scale from different students, or from the same students at different times, can be directly compared, even though different sets of test items are administered. A RIT score also has the same meaning regardless of the grade or age of the student.
University of Oregon defines the "Composite Score" as a "combination of multiple DIBELS scores, which provides the best overall estimate of the student's reading proficiency". The calculation behind the score is complicated, and documented here.
Technically, both of these scores are scaled scores (a raw score that has been adjusted and converted to a standardized scale), but the calculations behind those scores are different and oftentimes unique to the vendor. The score name and vendor-specific namespace captures those differences, in case this additional information is relevant to those who need to use the data. If those score names were normalized at the point of ingestion, it might be unclear which specific score (and underlying calculation) belongs to the scale_score. One of our main goals in data integration is to allow for flexible analytics - but there is always going to be a tension between flexibility and accessibility.
Additionally, many vendors offer additional scores that would be impossible to standardize. Some examples include:
NWEA MAP: 'Fall-To-Fall Projected Growth'
Renaissance STAR: 'Normal Curve Equivalent'
No matter what, some flexibility would be necessary when mapping scores to Ed-Fi to accommodate these additional scores.
Start populating the earthmover config
As described in the YAML configuration section of the earthmover documentation, this configuration file is comprised of six sections. This document will not reiterate that information, but will highlight steps that are necessary when building out a new assessment bundle. These initial steps should be quick adjustments from the template bundle:
There are many points to consider at the transformation section of the earthmover config. These include:
.1Are there Ed-Fi requirements that will require the reformation of columns?
The Ed-Fi API documentation is an incredibly useful resource in determining requirements, such as data types and character limits. The transformation section is where those requirements can be addressed and handled.
.2Could the files include rows outside of a single district & year?
If so, filters should be built into the template as a safeguard. The earthmover output should be ready to send to an ODS, and most ODS' are separated by district & year. As an example, NWEA sometimes sends results across districts & years in a single file if the scope of the data agreement for a particular external vendor extends beyond a single district.
.3Do sources need to be joined?
If information that is relevant for a single entity exists across multiple files, those sources will need to be joined together. This is the case for Renaissance Star assessments, where overall assessment results are contained in one file and objective assessment results are contained in other files. Student objective assessment results are captured within the StudentAssessment entity, so those scores must be joined together.
.4Does the output grain match the grain of the Ed-Fi entity?
Grain can be hard to conceptualize, but dbt's definition is probably the most straightforward:
Grain is the combination of columns at which records in a table are unique.
.aIn the case of the StudentAssessment entity, the combination of columns at which records are unique are: AssessmentIdentifier, Namespace, StudentAssessmentIdentifier, and StudentUniqueId. This means that the student assessment output of earthmover should be a single row for each administration of an assessment for a particular student. This will not always match the grain of the assessment results source file. Continuing on the above example for Renaissance Star assessments, the join between the overall and objective assessment files would result in a grain at the objective assessment level instead of the required student assessment level. To handle this, EA wrote a transformation that grouped the objective assessment results by the student assessment identifier which resulted in the correct grain that included a column of aggregated student objective assessment results that could slot directly into that section of the template.
Set destinations in earthmover config
The destinations section of the earthmover config specifies how the transformed data is materialized to the template files discussed in the next section. The source for each destination should be either a transformation node or a source (transformations are not required).
You should set the linearize parameter to True.
Fill out templates
After transforming the source data, earthmover will convert it to a text-based file based on templates that are configured for each project. These template should exactly match the relevant Ed-Fi API endpoint structures. The template bundle includes some of the most common templates for integrating assessment data, but others could be necessary, depending on the assessment. If others need to be brought into the bundle, the structure can be copied from the Ed-Fi API documentation (search for the relevant entity, click get, scroll until you see the Example Value - copy that into a new .jsonl file in the bundle folder).
Once all of the relevant template files exist in the bundle, the next step is to populate them using the columns of the data frame that result from the above transformations. While doing so, keep these points in mind:
.1EA typically sends information that does not match a particular Ed-Fi property as a score result. This can include subjects, test names, etc.
Note: each will need a corresponding AssessmentReportingMethod descriptor value
.2
.3Two Ed-Fi requirements exist that can complicate this process:
.aEmpty strings cannot be sent as a performance level or score result.
.bAt least 1 performance level or score result must be included in a student assessment/student objective assessment record, or else those properties must be skipped entirely.
The last two requirements above may seem obvious, but the logic necessary to fill in the earthmover json templates can get tricky, and that requirement can make things more difficult.
Take this example: The assessment has 10 possible objective assessments that each have 2 associated performance levels and 2 other scores (ex. scale_score and sem). It is not guaranteed that a student will have taken all objective assessments and even if they did, they may not have results for all performance levels and scores. In the jinja, you may be tempted to loop over all of the objective assessments and PLs/scores using logic like this:
.1If all of the PL/score results are null, the entire objective assessment must be skipped
.2If any of the PL/score results are null, that particular scoreResult or performanceLevel must be skipped
The first issue can be fixed relatively easily with conditionals:
{%if perf_level_1_result != '' or perf_level_2_result != '' or obj_scale_score_result != '' or obj_sem_result != ''%}
(with the endif after the comma conditional so you don't have incorrectly leading/trailing commas)
But the second issue is more complicated. You could try putting conditionals around each item in the PerformanceLevels / scoreResult lists, however that would fall apart due to comma issues - If you included the comma in the first conditional and the second result is null, you will be left with a trailing comma and alternatively if you included the comma in the second conditional and the first result is null, you will have a leading comma.
The way to fix this issue is to determine which results exist ahead of time, add those to a list, then loop over that list:
Changing file structures across years, various updates to the Ed-Fi data model, and other factors might require flexibility to be built into a bundle. This is currently handled using conditional logic, like in this example:
Sometimes, a bundle will contain multiple earthmover.yaml files because the transformations across years, subjects, etc. are different, but have the same resulting structure, so can use the same templates.
If changes are significant enough that containing that logic within a single bundle (specifically within the templaters) would make the code extremely hard to read, it may be necessary to split that code out into a different bundle and likely capture as a separate assessmentIdentifier.
Once you have completed the above steps, you should test your earthmover code. This can be run using the following command:
earthmover run -c path/to/earthmover.yaml
Oftentimes, there are parameters in the earthmover config that are to be specified via the command line with:
earthmover run -c path/to/earthmover.yaml -p'{"OUTPUT_DIR":"path/to/my/output/dir"}'
Once the command runs successfully, the first check is whether the output is valid json. There are multiple methods to do so, but you should make sure to not upload student-level data or any PII to an online json validator. EA typically uses a command-line program called jq for this purpose.
From there, you should inspect the output to ensure that the templates populated correctly and that none of the properties are coming through as null, as that will cause an error when pushing to the Ed-Fi ODS.
The template bundle includes a prepopulated lightbeam YAML configuration file. Often, this can be kept as-is, though it should be reviewed in the context of an individual assessment.