Skip to content

Data Preparation 2015

Adam VanIwaarden edited this page Oct 16, 2015 · 3 revisions

Data Preparation of 2015 SAGE Data

Student level data preparation

In the source code below, the data supplied by the Utah State Office of Education (USOE) is cleaned and processed using a two step process:

1a. In the first step the data is cleaned using the .sql script below wherein valid student records are selected based upon business rules specified in the sql code itself. The result is an exportable .csv file where each VALID_CASE is unique by CONTENT_AREA, YEAR and ID. 1b. In the second step the clean data files from step 1a are read into R and modified slightly with regard to variable types/classes. The result is an .Rdata file that is suitable for analysis with the SGP Package.

This year two longitudinal data files were produced. The first was the data for students who continuously enrolled for the full 2014-2015 academic year ("FAY"). In addition to these typically included students are the academic records for non-FAY students from previous academic years (2008 to 2014), which were added in order a) to be used as prior scores (dependent variables) in the 2015 SGP analyses and b) to create 2014 SGPs for non-FAY students using the SGP coefficient matrices constructed in the 2014 analyses. The second set of data is the 2015 non-FAY student data, which is used to create 2015 SGPs using the coefficient matrices produced using only students who are eligible for accountability purposes. Accordingly, steps 1a. and 1b. are repeated twice for each data set.

Step 1a: Initial data cleaning with SQL (2015 FAY and previous years non-FAY)

The two sets of SQL code below were provided directly from the USOE and were not modified by the Center staff in any way. As such there is no commentary provided here regarding this code.

2015 FAY and previous years' non-FAY base data
--ALTER VIEW v_sgp_longfile AS
------ SGP 2015 LONGFILE
SELECT VALID_CASE,
       SchoolYear,
       StudentID,
       TestSubjectID,
       TestSubject,
       SubjectArea,
       StudentGradeLevel,
       YEAR,
       ID, -- student_id
       CONTENT_AREA,
       GRADE,
       SCALE_SCORE,
       ACHIEVEMENT_LEVEL,
       DISTRICT_NUMBER,
       SCHOOL_NUMBER,
       DISTRICT_NAME,
       SCHOOL_NAME,
       LAST_NAME,
       FIRST_NAME,
       SSID -- Use in construction of individual student report (ISR) file name
  FROM (
---------------------------------
------ SAGE 2015 REPORTABLE TESTS [COUNT(*) = 1,065,488 (0min:14sec)]
SELECT sub2.*
  FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
   ROW_NUMBER() OVER (
       PARTITION BY SchoolYear, StudentID, CONTENT_AREA
       ORDER BY TestSubjectID -- Prefer the score associated with the more advanced course
     ) AS preferred_score
  FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
       t.school_year AS SchoolYear,
       r.StudentID AS StudentID,
       t.test_subject_id AS TestSubjectID,
       c.test_subject AS TestSubject,    
       t.subject_area AS SubjectArea,
       t.grade_level AS StudentGradeLevel,
       NULL AS HighestAppropriateTestGradeLevel,
       t.school_year AS YEAR,
       r.StudentID AS ID,
  CASE t.test_subject_id
       WHEN 312 THEN 'PRE_ALGEBRA' -- Math 8
       WHEN 313 THEN 'SEC_MATH_I'
       WHEN 314 THEN 'SEC_MATH_II'
       WHEN 315 THEN 'SEC_MATH_III'
       WHEN 505 THEN 'EARTH_SCIENCE'
       WHEN 506 THEN 'BIOLOGY'
       WHEN 507 THEN 'CHEMISTRY'
       WHEN 508 THEN 'PHYSICS'
       ELSE (
       CASE
       WHEN t.subject_area = 'L' THEN 'ELA'
       WHEN t.subject_area = 'M' AND t.test_subject_id NOT IN ('312','313','314','315') THEN 'MATHEMATICS'
       WHEN t.subject_area = 'S' AND t.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
       )
   END AS CONTENT_AREA,
  CASE WHEN CAST(t.test_subject_id AS VARCHAR) IN ('312','313','314','315','505','506','507','508') THEN 'EOCT'
       WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
       ELSE CAST(t.grade_level AS VARCHAR)
   END AS GRADE,
       t.scale_score AS SCALE_SCORE,
       t.proficiency_level AS ACHIEVEMENT_LEVEL,
       t.district_id AS DISTRICT_NUMBER,
       t.school_id AS SCHOOL_NUMBER,
       i.institution_name AS DISTRICT_NAME,
       s.school_name AS SCHOOL_NAME, 
       m.last_name AS LAST_NAME,
       m.first_name AS FIRST_NAME,
       r.ssid AS SSID
  FROM SAGE.reportable_tests AS t
  JOIN test_program p
    ON t.test_prog_id = p.test_prog_id
  JOIN ct_test_subjects AS c
    ON p.test_subject_id = c.test_subject_id
  JOIN RootSSIDs AS r
    ON t.student_id = r.OldStudentID
  JOIN student_master AS m
    ON r.StudentID = m.student_id
  JOIN v_be_school AS s
    ON t.district_id = s.district_id
       AND t.school_id = s.school_id
       AND t.school_number = s.school_nbr
  JOIN v_be_institution AS i
    ON s.district_id = i.institution_id
 WHERE t.school_year = 2015
   AND t.ParticipationCode NOT IN (204,304) -- Not excluded from testing by parent
   AND t.scale_score IS NOT NULL -- Scale score is available
   AND t.full_academic_year = 'Y' -- Enrolled for full academic year in school
  ) AS sub1
  ) AS sub2
 WHERE preferred_score = 1
 UNION ALL
----------------------------
------ SAGE 2014 VALID TESTS [COUNT(*) = 1,095,347 (0min:24sec)]
SELECT sub2.* 
  FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
   ROW_NUMBER() OVER (
       PARTITION BY StudentID, CONTENT_AREA
       ORDER BY TestSubjectID -- Prefer the score associated with the more advanced course
  ) AS preferred_score
  FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
       t.school_year AS SchoolYear,
       t.student_id AS StudentID,
       t.test_subject_id AS TestSubjectID,
       c.test_subject AS TestSubject,    
       t.subject_area AS SubjectArea,
       e.grade_level AS StudentGradeLevel,
       NULL AS HighestAppropriateTestGradeLevel,
       t.school_year AS YEAR,
       r.StudentID AS ID,
  CASE p.test_subject_id
       WHEN 312 THEN 'PRE_ALGEBRA' -- Math 8
       WHEN 313 THEN 'SEC_MATH_I'
       WHEN 314 THEN 'SEC_MATH_II'
       WHEN 315 THEN 'SEC_MATH_III'
       WHEN 505 THEN 'EARTH_SCIENCE'
       WHEN 506 THEN 'BIOLOGY'
       WHEN 507 THEN 'CHEMISTRY'
       WHEN 508 THEN 'PHYSICS'
       ELSE (
       CASE
       WHEN p.subject_area = 'L' THEN 'ELA'
       WHEN p.subject_area = 'M' AND p.test_subject_id NOT IN ('312','313','314','315') THEN 'MATHEMATICS'
       WHEN p.subject_area = 'S' AND p.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
       )
   END AS CONTENT_AREA,
  CASE WHEN CAST(p.test_subject_id AS VARCHAR) IN ('312','313','314','315','505','506','507','508') THEN 'EOCT'
       WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
       ELSE CAST(e.grade_level AS VARCHAR)
   END AS GRADE,
       t.scale_score AS SCALE_SCORE,
       t.proficiency_level AS ACHIEVEMENT_LEVEL,
       t.district_id AS DISTRICT_NUMBER,
       t.school_id AS SCHOOL_NUMBER,
       i.institution_name AS DISTRICT_NAME,
       s.school_name AS SCHOOL_NAME, 
       m.last_name AS LAST_NAME,
       m.first_name AS FIRST_NAME,
       r.ssid AS SSID
  FROM SAGE.valid_tests AS t
  JOIN test_program p
    ON t.test_prog_id = p.test_prog_id
  JOIN ct_test_subjects AS c
    ON p.test_subject_id = c.test_subject_id
  JOIN student_enrollment AS e
    ON t.school_year = e.school_year
       AND t.district_id = e.district_id
       AND t.school_id = e.school_id
       AND t.school_number = e.school_number
       AND t.student_id = e.student_id
  JOIN RootSSIDs AS r
    ON t.student_id = r.OldStudentID
  JOIN student_master AS m
    ON r.StudentID = m.student_id
  JOIN v_be_school AS s
    ON t.district_id = s.district_id
       AND t.school_id = s.school_id
       AND t.school_number = s.school_nbr
  JOIN v_be_institution AS i
    ON s.district_id = i.institution_id
 WHERE t.school_year = 2014
   AND t.ParticipationCode <> 204
   AND t.scale_score IS NOT NULL
  ) AS sub1
  ) AS sub2
 WHERE preferred_score = 1 
 UNION ALL
--------------------
------ CRT 2010-2013 [COUNT(*) = 4,170,653 (1min:55sec)]
SELECT sub2.*
  FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
   ROW_NUMBER() OVER(
       PARTITION BY SchoolYear, StudentID, CONTENT_AREA
       ORDER BY SCALE_SCORE DESC, CASE WHEN StudentGradeLevel = HighestAppropriateTestGradeLevel THEN 1 ELSE 0 END DESC
     ) AS preferred_score
  FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
       t.school_year AS SchoolYear,
       r.StudentID AS StudentID,
       p.test_subject_id AS TestSubjectID,
       c.test_subject AS TestSubject,    
       p.subject_area AS SubjectArea,
       p.grade_high AS HighestAppropriateTestGradeLevel,
       CAST(e.grade_level AS TINYINT) AS StudentGradeLevel,
       t.school_year AS YEAR,
       t.student_id AS ID,
  CASE p.test_subject_id
       WHEN 307 THEN 'PRE_ALGEBRA'
       WHEN 312 THEN 'PRE_ALGEBRA' -- Math 8
       WHEN 308 THEN 'GEOMETRY'
       WHEN 309 THEN 'ALGEBRA_I'
       WHEN 310 THEN 'ALGEBRA_II'
       WHEN 311 THEN 'ALGEBRA_II'
       WHEN 505 THEN 'EARTH_SCIENCE'
       WHEN 506 THEN 'BIOLOGY'
       WHEN 507 THEN 'CHEMISTRY'
       WHEN 508 THEN 'PHYSICS'
       ELSE (
       CASE
       WHEN p.subject_area = 'L' THEN 'ELA'
       WHEN p.subject_area = 'M' AND p.test_subject_id NOT IN ('307','312','308','309','310','311') THEN 'MATHEMATICS'
       WHEN p.subject_area = 'S' AND p.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
       )
   END AS CONTENT_AREA,
  CASE WHEN CAST(p.test_subject_id AS VARCHAR) IN ('307','312','308','309','310','311','505','506','507','508') THEN 'EOCT'
       WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
       ELSE CAST(e.grade_level AS VARCHAR)
   END AS GRADE,
       t.scaled_score AS SCALE_SCORE,
       t.proficiency AS ACHIEVEMENT_LEVEL,
       t.district_id AS DISTRICT_NUMBER,
       t.school_id AS SCHOOL_NUMBER,
       i.institution_name AS DISTRICT_NAME,
       s.school_name AS SCHOOL_NAME, 
       m.last_name AS LAST_NAME,
       m.first_name AS FIRST_NAME,
       r.ssid AS SSID
  FROM student_test AS t
  JOIN test_program p
    ON t.test_prog_id = p.test_prog_id
  JOIN ct_test_subjects AS c
    ON p.test_subject_id = c.test_subject_id
  JOIN student_enrollment AS e
    ON t.school_year = e.school_year
       AND t.district_id = e.district_id
       AND t.school_id = e.school_id
       AND t.school_number = e.school_number
       AND t.student_id = e.student_id
       AND t.entry_date = e.entry_date
  JOIN RootSSIDs AS r
    ON t.student_id = r.OldStudentID
  JOIN student_master AS m
    ON r.StudentID = m.student_id
  JOIN v_be_school AS s
    ON t.district_id = s.district_id
       AND t.school_id = s.school_id
       AND t.school_number = s.school_nbr
  JOIN v_be_institution AS i
    ON s.district_id = i.institution_id
 WHERE t.school_year BETWEEN 2010 AND 2013
   AND t.test_medium <> 'U'
   AND t.test_participation IN ('','1') AND t.test_non_participation = ''
   AND t.scaled_score BETWEEN 130 AND 199
   AND ((e.grade_level >= 3 AND p.subject_area IN ('L','M'))
    OR (e.grade_level >= 4 AND p.subject_area = 'S'))
  ) AS sub1
  ) AS sub2
 WHERE preferred_score = 1 
  ) AS sub3

Step 1b: Creating an R version of the cleaned data (2015 FAY and previous years non-FAY)

After data has been cleaned using the above SQL script, a comma delimited text file is exported and needs to be read into R in order to create a .Rdata file that can be used for SGP calculations. The R code below reads the SGP_Longfile_2015.csv data (located in the relative folder Data/Base_Files) into R and tidies up the data variable by variable to ensure that it matches previously established data naming conventions for subsequent analysis in the SGP package.

Of note, this year's original data submission (SGP_Longfile_2015.csv) was problematic in that some records contained the text "ASCENT, INC.," for several school/district names. The additional comas created problems reading in the csv format. These records were first fixed in an text editor (changed to "ASCENT INC.") and the file saved before successfully opening the file in R.

##########################################################
####
#### Code for cleaning and preparation of Utah LONG data
####
##########################################################

library(SGP)
library(data.table)

### Set working directory

### Read in USOE .csv file  

Utah_Data_LONG_2015 <- fread(
  "Data/Base_Files/SGP_Longfile_2015.csv", colClasses=rep("character", 20))

#  Give variables names (no names provided in data submission - correct names and order was provided by Randy R.)
setnames(Utah_Data_LONG_2015, c('VALID_CASE', 'SchoolYear', 'StudentID', 
  'TestSubjectID', 'TestSubject', 'SubjectArea', 'StudentGradeLevel', 
  'YEAR', 'ID', 'CONTENT_AREA', 'GRADE', 'SCALE_SCORE', 'ACHIEVEMENT_LEVEL', 
  'DISTRICT_NUMBER', 'SCHOOL_NUMBER', 'DISTRICT_NAME', 'SCHOOL_NAME', 
  'LAST_NAME', 'FIRST_NAME', 'SSID'))

#  Add the `ENROLLMENT_STATUS` variables using data.table assign by reference.
Utah_Data_LONG_2015[, SCHOOL_ENROLLMENT_STATUS := 'Enrolled School: Yes']
Utah_Data_LONG_2015[, DISTRICT_ENROLLMENT_STATUS := 'Enrolled District: Yes']
Utah_Data_LONG_2015[, STATE_ENROLLMENT_STATUS := 'Enrolled State: Yes']


# Transform proficiency level variable to previously used levels (BP, P and A)
Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL <- factor(
  Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL, levels=0:4, 
  labels=c(NA, "Level 1", "Level 2", "Level 3", "Level 4"), ordered=TRUE)

Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL_FULL <- Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL

levels(Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL) <- c(NA, "BP", "BP", "P", "A")


### Final cleaning needed for File Randy uploaded

#  INVALIDate cases with GRADE = 2
Utah_Data_LONG_2015[which(GRADE==2), VALID_CASE := "INVALID_CASE"]

#  Fix 2014 and 2015 8th Grade Math (Labeled as "PRE_ALGEBRA")
Utah_Data_LONG_2015[
  which(CONTENT_AREA=="PRE_ALGEBRA" & YEAR %in% 2014:2015), GRADE := "8"]
Utah_Data_LONG_2015[
  which(CONTENT_AREA=="PRE_ALGEBRA" & YEAR %in% 2014:2015), CONTENT_AREA := "MATHEMATICS"]
Duplicates from previous years

The original data submission included all student records from 2008 to 2014, not just the new non-FAY student records that needed to be added. Thus it was necessary to identify the duplicates from the new data. This required loading the SGP data object saved after the 2014 analyses and combining the existing longitudinal data with the new long file. When duplicates were present, the original record was retained. In the end, the file Utah_Data_LONG_2015.Rdata is produced, which includes the 2015 FAY and 2014 new non-FAY data. The 2008-2013 data is added to the Utah_SGP@Data slot directly and re-saved as the new 2014 object.

### Identify duplicate and unique cases from prior years

load("~Data/Utah_SGP.Rdata")

UT <- copy(Utah_SGP@Data)
setnames(UT, 
  c("test_subject", "school_year", "student_id", "test_subject_id"), 
  c("TestSubject", "SchoolYear", "StudentID", "TestSubjectID"))

UT[, NOTES := "Original data."] #Old
Utah_Data_LONG_2015[, NOTES := "Original data."]
Utah_Data_LONG_2015[which(YEAR != 2015), NOTES := "Provided in 2015 as additional prior data."]
Utah_Data_LONG_2015[, ACHIEVEMENT_LEVEL := as.character(ACHIEVEMENT_LEVEL)]

ALL_SAGE <- rbind(UT, Utah_Data_LONG_2015, fill=TRUE)
setkeyv(ALL_SAGE, c(key(UT), "NOTES"))
setkeyv(ALL_SAGE, key(UT))
ALL_SAGE <- ALL_SAGE[!duplicated(ALL_SAGE)]

###  Create object of just the 2008-2013 priors to add into the new "final" 2014 object.  
###  2014 non-FAY will be used in first analysis step to create their SGPs.

UT_PRIORS <- ALL_SAGE[which(!YEAR %in% c(2014, 2015) & NOTES != "Original data.")][, names(Utah_Data_LONG_2015), with=FALSE]

Utah_Data_LONG_2015 <- ALL_SAGE[YEAR %in% c(2014, 2015)][, names(Utah_Data_LONG_2015), with=FALSE]
Utah_Data_LONG_2015 <- Utah_Data_LONG_2015[-which(YEAR == 2014 & NOTES == "Original data.")]

###  Save 2014/2015 long data object.
save(Utah_Data_LONG_2015, file="/media/Data/Dropbox/SGP/Utah/Data/Base_Files/Utah_Data_LONG_2015.Rdata")


###  Add the 2008-2013 non-duplicated prior data to the existing longitudinal data file

Utah_SGP@Data <- rbindlist(list(UT, UT_PRIORS), fill=TRUE)

###  Final changes - make NOTES a factor and remove unused variables
Utah_SGP@Data[, NOTES := factor(NOTES)]
Utah_SGP@Data[, EMH_LEVEL := NULL]
Utah_SGP@Data[, GRADE_PRIOR := NULL]
Utah_SGP@Data[, ACHIEVEMENT_LEVEL_PRIOR2 := NULL]

save(Utah_SGP, file="/media/Data/Dropbox/SGP/Utah_SGP.Rdata")

Step 1a: Initial data cleaning with SQL (2015 non-FAY)

The SQL code below was used to create the second data set submitted to the Center for analysis.

------ 1. 2015 STUDENTS POTENTIALLY MISSING SGPs (N = 72,566; 0min:50sec)
SELECT DISTINCT
       SchoolYear,
       TestSubjectID,
       i.StudentID,
       i.OldStudentID,
       i.SSID
  INTO #Missing2015
  FROM (
--> IN 2015 StudentTests BUT StudentID CHANGED
SELECT SchoolYear,
       TestSubjectID,
       t.StudentID
  FROM SAGE.StudentTests AS t
  JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS i
    ON t.StudentID = i.OldStudentID
 WHERE SchoolYear = 2015
   AND ParticipationCode NOT IN (204,304)
   AND ScaleScore IS NOT NULL
   AND IsFullAcademicYear = 'Y'
   AND SGP IS NULL
   AND ((GradeLevel >= 3 AND SubjectArea IN ('L','M'))
       OR (GradeLevel >= 4 AND SubjectArea = 'S'))
   AND CASE WHEN t.StudentID <> i.StudentID THEN 1 ELSE 0 END = 1
 UNION
--> IN 2015 reportable_tests BUT NOT FAY
SELECT school_year AS SchoolYear,
       test_subject_id AS TestSubjectID,
       r.student_id AS StudentID
  FROM SAGE.reportable_tests AS r
 WHERE school_year = 2015
   AND scale_score IS NOT NULL
   AND ParticipationCode NOT IN (204,304)
   AND full_academic_year = 'N'
 UNION
--> IN 2015 valid_tests 2015 BUT NOT IN reportable_tests
SELECT v.school_year AS SchoolYear,
       v.test_subject_id AS TestSubjectID,
       v.student_id AS StudentID
  FROM SAGE.valid_tests AS v
  LEFT JOIN SAGE.reportable_tests AS r
    ON v.school_year = r.school_year
       AND v.student_id = r.student_id
       AND v.test_subject_id = r.test_subject_id
 WHERE v.school_year = 2015
   AND v.scale_score IS NOT NULL
   AND v.ParticipationCode NOT IN (204,304)
   AND r.school_year IS NULL
   AND r.student_id IS NULL
   AND r.test_subject_id IS NULL
  ) AS sub
  JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS i
    ON sub.StudentID = i.OldStudentID
--> CHECK
SELECT * FROM #Missing2015

------ 2. UNION SET 2008-2009,2015 (N = 150,051; 1min:23sec) 
--> SUPPLEMENTAL SAGE 2015 (N = 68,262)
SELECT sub2.*
  FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
   ROW_NUMBER() OVER (
       PARTITION BY SchoolYear, StudentID, CONTENT_AREA
       ORDER BY TestSubjectID
     ) AS preferred_score
  FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
       t.school_year AS SchoolYear,
       x.StudentID AS StudentID,
       t.test_subject_id AS TestSubjectID,
       c.test_subject AS TestSubject,    
       t.subject_area AS SubjectArea,
       NULL AS StudentGradeLevel, -- t.grade_level 
       p.grade_high AS HighestAppropriateTestGradeLevel,
       t.school_year AS YEAR,
       x.StudentID AS ID,
  CASE t.test_subject_id
       WHEN 313 THEN 'SEC_MATH_I'
       WHEN 314 THEN 'SEC_MATH_II'
       WHEN 315 THEN 'SEC_MATH_III'
       WHEN 505 THEN 'EARTH_SCIENCE'
       WHEN 506 THEN 'BIOLOGY'
       WHEN 507 THEN 'CHEMISTRY'
       WHEN 508 THEN 'PHYSICS'
       ELSE (
       CASE
       WHEN t.subject_area = 'L' THEN 'ELA'
       WHEN t.subject_area = 'M' AND t.test_subject_id NOT IN ('313','314','315') THEN 'MATHEMATICS'
       WHEN t.subject_area = 'S' AND t.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
       )
   END AS CONTENT_AREA,
  CASE WHEN CAST(t.test_subject_id AS VARCHAR) IN ('313','314','315','505','506','507','508') THEN 'EOCT'
       WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
       ELSE NULL -- t.grade_level
   END AS GRADE,
       t.scale_score AS SCALE_SCORE,
       t.proficiency_level AS ACHIEVEMENT_LEVEL,
       t.district_id AS DISTRICT_NUMBER,
       t.school_id AS SCHOOL_NUMBER,
       i.institution_name AS DISTRICT_NAME,
       s.school_name AS SCHOOL_NAME, 
       m.last_name AS LAST_NAME,
       m.first_name AS FIRST_NAME,
       x.ssid AS SSID,
       x.OldStudentID
  FROM #Missing2015 AS x
  JOIN SAGE.valid_tests AS t
    ON x.SchoolYear = t.school_year
       AND t.student_id IN (x.OldStudentID,x.StudentID)
       AND x.TestSubjectID = t.test_subject_id
  JOIN test_program p
    ON t.test_prog_id = p.test_prog_id
  JOIN ct_test_subjects AS c
    ON p.test_subject_id = c.test_subject_id
  JOIN student_master AS m
    ON x.StudentID = m.student_id
  JOIN v_be_school AS s
    ON t.district_id = s.district_id
       AND t.school_id = s.school_id
       AND t.school_number = s.school_nbr
  JOIN v_be_institution AS i
    ON s.district_id = i.institution_id
 WHERE t.school_year = 2015
   AND t.ParticipationCode NOT IN (204,304)
   AND t.scale_score IS NOT NULL
  ) AS sub1
  ) AS sub2
 WHERE preferred_score = 1
 UNION
--> SUPPLEMENTAL CRT 2008-2009 (N = 81,789)
SELECT sub2.*
  FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
   ROW_NUMBER() OVER(
       PARTITION BY SchoolYear, StudentID, CONTENT_AREA
       ORDER BY SCALE_SCORE DESC, CASE WHEN StudentGradeLevel = HighestAppropriateTestGradeLevel THEN 1 ELSE 0 END DESC
     ) AS preferred_score
  FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
       t.school_year AS SchoolYear,
       x.StudentID AS StudentID,
       p.test_subject_id AS TestSubjectID,
       c.test_subject AS TestSubject,    
       p.subject_area AS SubjectArea,
       p.grade_high AS HighestAppropriateTestGradeLevel,
       CAST(e.grade_level AS TINYINT) AS StudentGradeLevel,
       t.school_year AS YEAR,
       x.StudentID AS ID,
  CASE p.test_subject_id
       WHEN 307 THEN 'PRE_ALGEBRA'
       WHEN 312 THEN 'PRE_ALGEBRA' -- Math 8
       WHEN 308 THEN 'GEOMETRY'
       WHEN 309 THEN 'ALGEBRA_I'
       WHEN 310 THEN 'ALGEBRA_II'
       WHEN 311 THEN 'ALGEBRA_II'
       WHEN 505 THEN 'EARTH_SCIENCE'
       WHEN 506 THEN 'BIOLOGY'
       WHEN 507 THEN 'CHEMISTRY'
       WHEN 508 THEN 'PHYSICS'
       ELSE (
       CASE
       WHEN p.subject_area = 'L' THEN 'ELA'
       WHEN p.subject_area = 'M' AND p.test_subject_id NOT IN ('307','312','308','309','310','311') THEN 'MATHEMATICS'
       WHEN p.subject_area = 'S' AND p.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
       )
   END AS CONTENT_AREA,
  CASE WHEN CAST(p.test_subject_id AS VARCHAR) IN ('307','312','308','309','310','311','505','506','507','508') THEN 'EOCT'
       WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
       ELSE CAST(e.grade_level AS VARCHAR)
   END AS GRADE,
       t.scaled_score AS SCALE_SCORE,
       t.proficiency AS ACHIEVEMENT_LEVEL,
       t.district_id AS DISTRICT_NUMBER,
       t.school_id AS SCHOOL_NUMBER,
       i.institution_name AS DISTRICT_NAME,
       s.school_name AS SCHOOL_NAME, 
       m.last_name AS LAST_NAME,
       m.first_name AS FIRST_NAME,
       x.ssid AS SSID,
       x.OldStudentID
  FROM student_test AS t
  JOIN test_program p
    ON t.test_prog_id = p.test_prog_id
  JOIN ct_test_subjects AS c
    ON p.test_subject_id = c.test_subject_id
  JOIN student_enrollment AS e
    ON t.school_year = e.school_year
       AND t.district_id = e.district_id
       AND t.school_id = e.school_id
       AND t.school_number = e.school_number
       AND t.student_id = e.student_id
       AND t.entry_date = e.entry_date
  JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS x
    ON t.student_id = x.OldStudentID
  JOIN student_master AS m
    ON t.student_id = m.student_id
  JOIN v_be_school AS s
    ON t.district_id = s.district_id
       AND t.school_id = s.school_id
       AND t.school_number = s.school_nbr
  JOIN v_be_institution AS i
    ON s.district_id = i.institution_id
 WHERE ((t.school_year = 2008 AND p.subject_area = 'M')
       OR t.school_year = 2009 AND e.full_academic_year = 'N')
   AND t.test_medium <> 'U'
   AND t.test_participation IN ('','1') AND t.test_non_participation = ''
   AND t.scaled_score BETWEEN 130 AND 199
   AND ((e.grade_level >= 3 AND p.subject_area IN ('L','M'))
    OR (e.grade_level >= 4 AND p.subject_area = 'S'))
   AND e.full_academic_year = 'N'
  ) AS sub1
  ) AS sub2
 WHERE preferred_score = 1

-------------------------------------------------------- 
------ 3. SAGE IDENTIFIERS 2014 (N = 75,988; 1min:12sec)
--> CHECK: Payson High (SchoolID = 827) Language Arts 11 (TestSubjectID = 110) N = 102
SELECT DISTINCT 
       SchoolYear,
       TestSubjectID,
       i.StudentID,
       i.OldStudentID
  FROM (
--> IN 2014 StudentTests BUT StudentID CHANGED
SELECT SchoolYear,
       TestSubjectID,
       t.StudentID
  FROM SAGE.StudentTests AS t
  JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS i
    ON t.StudentID = i.OldStudentID
 WHERE SchoolYear = 2014
   AND ParticipationCode NOT IN (204,304)
   AND ScaleScore IS NOT NULL
   AND IsFullAcademicYear = 'Y'
   AND SGP IS NULL
   AND ((GradeLevel >= 3 AND SubjectArea IN ('L','M'))
       OR (GradeLevel >= 4 AND SubjectArea = 'S'))
   AND CASE WHEN t.StudentID <> i.StudentID THEN 1 ELSE 0 END = 1 
 UNION
--> IN 2014 reportable_tests BUT NOT FAY
SELECT school_year AS SchoolYear,
       test_subject_id AS TestSubjectID,
       r.student_id AS StudentID
  FROM SAGE.reportable_tests AS r
 WHERE school_year = 2014
   AND scale_score IS NOT NULL
   AND ParticipationCode NOT IN (204,304)
   AND full_academic_year = 'N'
 UNION
--> IN 2014 valid_tests BUT NOT IN reportable_tests
SELECT v.school_year AS SchoolYear,
       v.test_subject_id AS TestSubjectID,
       v.student_id AS StudentID
  FROM SAGE.valid_tests AS v
  LEFT JOIN SAGE.reportable_tests AS r
    ON v.school_year = r.school_year
       AND v.student_id = r.student_id
       AND v.test_subject_id = r.test_subject_id
 WHERE v.school_year = 2014
   AND v.scale_score IS NOT NULL
   AND v.ParticipationCode NOT IN (204,304)
   AND r.school_year IS NULL
   AND r.student_id IS NULL
   AND r.test_subject_id IS NULL
  ) AS sub
  JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS i
    ON sub.StudentID = i.OldStudentID

Step 1b: Creating an R version of the cleaned data (2015 non-FAY)

Because there were found to be some duplicates between the 2015 FAY and 2015 non-FAY data files, the duplicate removal process must be run with this data file as well.

Utah_Data_LONG_2015_nonFAY <- fread(
  "Data/Base_Files/SGP 2015 Supplemental - Missing SGP Longfile (plus nonFAY 2008-2009).csv", colClasses=rep("character", 22))

Utah_Data_LONG_2015_nonFAY[, SCHOOL_ENROLLMENT_STATUS := 'Enrolled School: No']
Utah_Data_LONG_2015_nonFAY[, DISTRICT_ENROLLMENT_STATUS := 'Enrolled District: No']
Utah_Data_LONG_2015_nonFAY[, STATE_ENROLLMENT_STATUS := 'Enrolled State: Yes']

### Final cleaning needed for file Randy uploaded

# TRANSFORM PROFICIENCY LEVEL VARIABLES TO R FACTORS
Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL <- 
  factor(Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL, levels=0:4, 
      labels=c(NA, "Level 1", "Level 2", "Level 3", "Level 4"), ordered=TRUE)

Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL_FULL <- Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL

levels(Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL) <- c(NA, "BP", "BP", "P", "A")

#  INVALIDate cases with GRADE = 2
Utah_Data_LONG_2015_nonFAY[which(GRADE==2), VALID_CASE := "INVALID_CASE"]

Utah_Data_LONG_2015_nonFAY[, NOTES := "Student Non-FAY."]
Utah_Data_LONG_2015_nonFAY[which(YEAR != 2015), NOTES := "Provided in 2015 Non-FAY as additional prior data."]

##    Final changes to Non-FAY data and Utah_SGP object
nonFAY <- rbindlist(
  list(UT, UT_PRIORS, Utah_Data_LONG_2015[YEAR=='2015'], Utah_Data_LONG_2015_nonFAY), fill=TRUE)
setkeyv(nonFAY, c(key(UT), "NOTES"))
setkeyv(nonFAY, key(UT))
dim(nonFAY[!duplicated(nonFAY) & NOTES != 'Student Non-FAY.'])

Utah_Data_LONG_2015_nonFAY <- nonFAY[!duplicated(nonFAY) & NOTES == 'Student Non-FAY.' & YEAR == '2015']
save(Utah_Data_LONG_2015_nonFAY, file="Data/Base_Files/Utah_Data_LONG_2015_nonFAY.Rdata")

Naming conventions

Note that the naming conventions associated with Utah data provided by USOE perfectly conform with the naming conventions used by the SGP Package. Although the SGP Package accommodates different state data naming conventions through the use of a meta-data lookup table embedded within the SGPstateData Rdata object in the package, this is not necessary for Utah because the naming conventions have been established in the SQL code in data prep Step 1a above. The variable name lookup table for Utah is shown below with names.provided giving the state specific names, names.sgp the SGP Package specific names, names.type giving the type associated with the variable which is used within summarizeSGP to construct group summaries, names.info providing meta-data associated with the variable, and names.output a Boolean indicator of whether variable should be used with summarizeSGP.

names.provided names.sgp names.type names.info names.output
ID ID individual Unique student identifier
LAST_NAME LAST_NAME label Student last name
FIRST_NAME FIRST_NAME label Student first name
CONTENT_AREA CONTENT_AREA content Content area associated with record
YEAR YEAR time Year associated with record
GRADE GRADE institution_level Grade/Test level associated with record
SCALE_SCORE SCALE_SCORE measure Scale score
VALID_CASE VALID_CASE individual_inclusion Valid case indicator
TEST_NAME TEST_NAME misc Test name
ACHIEVEMENT_LEVEL ACHIEVEMENT_LEVEL measure Achievement level
SCHOOL_LEVEL SCHOOL_LEVEL institution_type Elementary/Middle/High School level flag
SCHOOL_NAME SCHOOL_NAME label School name
SCHOOL_NUMBER SCHOOL_NUMBER institution School number
DISTRICT_NAME DISTRICT_NAME label School name
DISTRICT_NUMBER DISTRICT_NUMBER institution District number
STATE_ENROLLMENT_STATUS STATE_ENROLLMENT_STATUS institution_inclusion State inclusion/accountability indicator
DISTRICT_ENROLLMENT_STATUS DISTRICT_ENROLLMENT_STATUS institution_inclusion District inclusion/accountability indicator
SCHOOL_ENROLLMENT_STATUS SCHOOL_ENROLLMENT_STATUS institution_inclusion School inclusion/accountability indicator
SUBGROUP SUBGROUP demographic Student super subgroup status TRUE
GENDER GENDER demographic Student gender TRUE
ETHNICITY ETHNICITY demographic Student ethnicity TRUE
FRL_STATUS FRL_STATUS demographic Student FRL status TRUE
IEP_STATUS IEP_STATUS demographic Student IEP status TRUE
ELL_STATUS ELL_STATUS demographic Student ELL status TRUE
GT_STATUS GT_STATUS demographic Student Gifted Talented Status TRUE
TEST_MEDIUM TEST_MEDIUM misc Test medium
HIGH_NEED_STATUS HIGH_NEED_STATUS demographic High need status flag TRUE