-
Notifications
You must be signed in to change notification settings - Fork 0
Data Preparation 2015
In the source code below, the data supplied by the Utah State Office of Education (USOE) is cleaned and processed using a two step process:
1a. In the first step the data is cleaned using the .sql script below wherein valid student records are selected based upon business rules specified in the sql code itself. The result is an exportable .csv file where each VALID_CASE
is unique by CONTENT_AREA
, YEAR
and ID
.
1b. In the second step the clean data files from step 1a are read into R and modified slightly with regard to variable types/classes. The result is an .Rdata
file that is suitable for analysis with the SGP Package.
This year two longitudinal data files were produced. The first was the data for students who continuously enrolled for the full 2014-2015 academic year ("FAY"). In addition to these typically included students are the academic records for non-FAY students from previous academic years (2008 to 2014), which were added in order a) to be used as prior scores (dependent variables) in the 2015 SGP analyses and b) to create 2014 SGPs for non-FAY students using the SGP coefficient matrices constructed in the 2014 analyses. The second set of data is the 2015 non-FAY student data, which is used to create 2015 SGPs using the coefficient matrices produced using only students who are eligible for accountability purposes. Accordingly, steps 1a. and 1b. are repeated twice for each data set.
The two sets of SQL
code below were provided directly from the USOE and were not modified by the Center staff in any way. As such there is no commentary provided here regarding this code.
--ALTER VIEW v_sgp_longfile AS
------ SGP 2015 LONGFILE
SELECT VALID_CASE,
SchoolYear,
StudentID,
TestSubjectID,
TestSubject,
SubjectArea,
StudentGradeLevel,
YEAR,
ID, -- student_id
CONTENT_AREA,
GRADE,
SCALE_SCORE,
ACHIEVEMENT_LEVEL,
DISTRICT_NUMBER,
SCHOOL_NUMBER,
DISTRICT_NAME,
SCHOOL_NAME,
LAST_NAME,
FIRST_NAME,
SSID -- Use in construction of individual student report (ISR) file name
FROM (
---------------------------------
------ SAGE 2015 REPORTABLE TESTS [COUNT(*) = 1,065,488 (0min:14sec)]
SELECT sub2.*
FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
ROW_NUMBER() OVER (
PARTITION BY SchoolYear, StudentID, CONTENT_AREA
ORDER BY TestSubjectID -- Prefer the score associated with the more advanced course
) AS preferred_score
FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
t.school_year AS SchoolYear,
r.StudentID AS StudentID,
t.test_subject_id AS TestSubjectID,
c.test_subject AS TestSubject,
t.subject_area AS SubjectArea,
t.grade_level AS StudentGradeLevel,
NULL AS HighestAppropriateTestGradeLevel,
t.school_year AS YEAR,
r.StudentID AS ID,
CASE t.test_subject_id
WHEN 312 THEN 'PRE_ALGEBRA' -- Math 8
WHEN 313 THEN 'SEC_MATH_I'
WHEN 314 THEN 'SEC_MATH_II'
WHEN 315 THEN 'SEC_MATH_III'
WHEN 505 THEN 'EARTH_SCIENCE'
WHEN 506 THEN 'BIOLOGY'
WHEN 507 THEN 'CHEMISTRY'
WHEN 508 THEN 'PHYSICS'
ELSE (
CASE
WHEN t.subject_area = 'L' THEN 'ELA'
WHEN t.subject_area = 'M' AND t.test_subject_id NOT IN ('312','313','314','315') THEN 'MATHEMATICS'
WHEN t.subject_area = 'S' AND t.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
)
END AS CONTENT_AREA,
CASE WHEN CAST(t.test_subject_id AS VARCHAR) IN ('312','313','314','315','505','506','507','508') THEN 'EOCT'
WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
ELSE CAST(t.grade_level AS VARCHAR)
END AS GRADE,
t.scale_score AS SCALE_SCORE,
t.proficiency_level AS ACHIEVEMENT_LEVEL,
t.district_id AS DISTRICT_NUMBER,
t.school_id AS SCHOOL_NUMBER,
i.institution_name AS DISTRICT_NAME,
s.school_name AS SCHOOL_NAME,
m.last_name AS LAST_NAME,
m.first_name AS FIRST_NAME,
r.ssid AS SSID
FROM SAGE.reportable_tests AS t
JOIN test_program p
ON t.test_prog_id = p.test_prog_id
JOIN ct_test_subjects AS c
ON p.test_subject_id = c.test_subject_id
JOIN RootSSIDs AS r
ON t.student_id = r.OldStudentID
JOIN student_master AS m
ON r.StudentID = m.student_id
JOIN v_be_school AS s
ON t.district_id = s.district_id
AND t.school_id = s.school_id
AND t.school_number = s.school_nbr
JOIN v_be_institution AS i
ON s.district_id = i.institution_id
WHERE t.school_year = 2015
AND t.ParticipationCode NOT IN (204,304) -- Not excluded from testing by parent
AND t.scale_score IS NOT NULL -- Scale score is available
AND t.full_academic_year = 'Y' -- Enrolled for full academic year in school
) AS sub1
) AS sub2
WHERE preferred_score = 1
UNION ALL
----------------------------
------ SAGE 2014 VALID TESTS [COUNT(*) = 1,095,347 (0min:24sec)]
SELECT sub2.*
FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
ROW_NUMBER() OVER (
PARTITION BY StudentID, CONTENT_AREA
ORDER BY TestSubjectID -- Prefer the score associated with the more advanced course
) AS preferred_score
FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
t.school_year AS SchoolYear,
t.student_id AS StudentID,
t.test_subject_id AS TestSubjectID,
c.test_subject AS TestSubject,
t.subject_area AS SubjectArea,
e.grade_level AS StudentGradeLevel,
NULL AS HighestAppropriateTestGradeLevel,
t.school_year AS YEAR,
r.StudentID AS ID,
CASE p.test_subject_id
WHEN 312 THEN 'PRE_ALGEBRA' -- Math 8
WHEN 313 THEN 'SEC_MATH_I'
WHEN 314 THEN 'SEC_MATH_II'
WHEN 315 THEN 'SEC_MATH_III'
WHEN 505 THEN 'EARTH_SCIENCE'
WHEN 506 THEN 'BIOLOGY'
WHEN 507 THEN 'CHEMISTRY'
WHEN 508 THEN 'PHYSICS'
ELSE (
CASE
WHEN p.subject_area = 'L' THEN 'ELA'
WHEN p.subject_area = 'M' AND p.test_subject_id NOT IN ('312','313','314','315') THEN 'MATHEMATICS'
WHEN p.subject_area = 'S' AND p.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
)
END AS CONTENT_AREA,
CASE WHEN CAST(p.test_subject_id AS VARCHAR) IN ('312','313','314','315','505','506','507','508') THEN 'EOCT'
WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
ELSE CAST(e.grade_level AS VARCHAR)
END AS GRADE,
t.scale_score AS SCALE_SCORE,
t.proficiency_level AS ACHIEVEMENT_LEVEL,
t.district_id AS DISTRICT_NUMBER,
t.school_id AS SCHOOL_NUMBER,
i.institution_name AS DISTRICT_NAME,
s.school_name AS SCHOOL_NAME,
m.last_name AS LAST_NAME,
m.first_name AS FIRST_NAME,
r.ssid AS SSID
FROM SAGE.valid_tests AS t
JOIN test_program p
ON t.test_prog_id = p.test_prog_id
JOIN ct_test_subjects AS c
ON p.test_subject_id = c.test_subject_id
JOIN student_enrollment AS e
ON t.school_year = e.school_year
AND t.district_id = e.district_id
AND t.school_id = e.school_id
AND t.school_number = e.school_number
AND t.student_id = e.student_id
JOIN RootSSIDs AS r
ON t.student_id = r.OldStudentID
JOIN student_master AS m
ON r.StudentID = m.student_id
JOIN v_be_school AS s
ON t.district_id = s.district_id
AND t.school_id = s.school_id
AND t.school_number = s.school_nbr
JOIN v_be_institution AS i
ON s.district_id = i.institution_id
WHERE t.school_year = 2014
AND t.ParticipationCode <> 204
AND t.scale_score IS NOT NULL
) AS sub1
) AS sub2
WHERE preferred_score = 1
UNION ALL
--------------------
------ CRT 2010-2013 [COUNT(*) = 4,170,653 (1min:55sec)]
SELECT sub2.*
FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
ROW_NUMBER() OVER(
PARTITION BY SchoolYear, StudentID, CONTENT_AREA
ORDER BY SCALE_SCORE DESC, CASE WHEN StudentGradeLevel = HighestAppropriateTestGradeLevel THEN 1 ELSE 0 END DESC
) AS preferred_score
FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
t.school_year AS SchoolYear,
r.StudentID AS StudentID,
p.test_subject_id AS TestSubjectID,
c.test_subject AS TestSubject,
p.subject_area AS SubjectArea,
p.grade_high AS HighestAppropriateTestGradeLevel,
CAST(e.grade_level AS TINYINT) AS StudentGradeLevel,
t.school_year AS YEAR,
t.student_id AS ID,
CASE p.test_subject_id
WHEN 307 THEN 'PRE_ALGEBRA'
WHEN 312 THEN 'PRE_ALGEBRA' -- Math 8
WHEN 308 THEN 'GEOMETRY'
WHEN 309 THEN 'ALGEBRA_I'
WHEN 310 THEN 'ALGEBRA_II'
WHEN 311 THEN 'ALGEBRA_II'
WHEN 505 THEN 'EARTH_SCIENCE'
WHEN 506 THEN 'BIOLOGY'
WHEN 507 THEN 'CHEMISTRY'
WHEN 508 THEN 'PHYSICS'
ELSE (
CASE
WHEN p.subject_area = 'L' THEN 'ELA'
WHEN p.subject_area = 'M' AND p.test_subject_id NOT IN ('307','312','308','309','310','311') THEN 'MATHEMATICS'
WHEN p.subject_area = 'S' AND p.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
)
END AS CONTENT_AREA,
CASE WHEN CAST(p.test_subject_id AS VARCHAR) IN ('307','312','308','309','310','311','505','506','507','508') THEN 'EOCT'
WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
ELSE CAST(e.grade_level AS VARCHAR)
END AS GRADE,
t.scaled_score AS SCALE_SCORE,
t.proficiency AS ACHIEVEMENT_LEVEL,
t.district_id AS DISTRICT_NUMBER,
t.school_id AS SCHOOL_NUMBER,
i.institution_name AS DISTRICT_NAME,
s.school_name AS SCHOOL_NAME,
m.last_name AS LAST_NAME,
m.first_name AS FIRST_NAME,
r.ssid AS SSID
FROM student_test AS t
JOIN test_program p
ON t.test_prog_id = p.test_prog_id
JOIN ct_test_subjects AS c
ON p.test_subject_id = c.test_subject_id
JOIN student_enrollment AS e
ON t.school_year = e.school_year
AND t.district_id = e.district_id
AND t.school_id = e.school_id
AND t.school_number = e.school_number
AND t.student_id = e.student_id
AND t.entry_date = e.entry_date
JOIN RootSSIDs AS r
ON t.student_id = r.OldStudentID
JOIN student_master AS m
ON r.StudentID = m.student_id
JOIN v_be_school AS s
ON t.district_id = s.district_id
AND t.school_id = s.school_id
AND t.school_number = s.school_nbr
JOIN v_be_institution AS i
ON s.district_id = i.institution_id
WHERE t.school_year BETWEEN 2010 AND 2013
AND t.test_medium <> 'U'
AND t.test_participation IN ('','1') AND t.test_non_participation = ''
AND t.scaled_score BETWEEN 130 AND 199
AND ((e.grade_level >= 3 AND p.subject_area IN ('L','M'))
OR (e.grade_level >= 4 AND p.subject_area = 'S'))
) AS sub1
) AS sub2
WHERE preferred_score = 1
) AS sub3
After data has been cleaned using the above SQL script, a comma delimited text file is exported and needs to be read into
R
in order to create a .Rdata
file that can be used for SGP calculations. The R
code below
reads the SGP_Longfile_2015.csv data (located in the relative folder Data/Base_Files) into R
and tidies up the data variable by
variable to ensure that it matches previously established data naming conventions for subsequent analysis in the SGP package.
Of note, this year's original data submission (SGP_Longfile_2015.csv) was problematic in that some records contained the text "ASCENT, INC.," for several school/district names. The additional comas created problems reading in the csv format. These records were first fixed in an text editor (changed to "ASCENT INC.") and the file saved before successfully opening the file in R
.
##########################################################
####
#### Code for cleaning and preparation of Utah LONG data
####
##########################################################
library(SGP)
library(data.table)
### Set working directory
### Read in USOE .csv file
Utah_Data_LONG_2015 <- fread(
"Data/Base_Files/SGP_Longfile_2015.csv", colClasses=rep("character", 20))
# Give variables names (no names provided in data submission - correct names and order was provided by Randy R.)
setnames(Utah_Data_LONG_2015, c('VALID_CASE', 'SchoolYear', 'StudentID',
'TestSubjectID', 'TestSubject', 'SubjectArea', 'StudentGradeLevel',
'YEAR', 'ID', 'CONTENT_AREA', 'GRADE', 'SCALE_SCORE', 'ACHIEVEMENT_LEVEL',
'DISTRICT_NUMBER', 'SCHOOL_NUMBER', 'DISTRICT_NAME', 'SCHOOL_NAME',
'LAST_NAME', 'FIRST_NAME', 'SSID'))
# Add the `ENROLLMENT_STATUS` variables using data.table assign by reference.
Utah_Data_LONG_2015[, SCHOOL_ENROLLMENT_STATUS := 'Enrolled School: Yes']
Utah_Data_LONG_2015[, DISTRICT_ENROLLMENT_STATUS := 'Enrolled District: Yes']
Utah_Data_LONG_2015[, STATE_ENROLLMENT_STATUS := 'Enrolled State: Yes']
# Transform proficiency level variable to previously used levels (BP, P and A)
Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL <- factor(
Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL, levels=0:4,
labels=c(NA, "Level 1", "Level 2", "Level 3", "Level 4"), ordered=TRUE)
Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL_FULL <- Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL
levels(Utah_Data_LONG_2015$ACHIEVEMENT_LEVEL) <- c(NA, "BP", "BP", "P", "A")
### Final cleaning needed for File Randy uploaded
# INVALIDate cases with GRADE = 2
Utah_Data_LONG_2015[which(GRADE==2), VALID_CASE := "INVALID_CASE"]
# Fix 2014 and 2015 8th Grade Math (Labeled as "PRE_ALGEBRA")
Utah_Data_LONG_2015[
which(CONTENT_AREA=="PRE_ALGEBRA" & YEAR %in% 2014:2015), GRADE := "8"]
Utah_Data_LONG_2015[
which(CONTENT_AREA=="PRE_ALGEBRA" & YEAR %in% 2014:2015), CONTENT_AREA := "MATHEMATICS"]
The original data submission included all student records from 2008 to 2014, not just the new non-FAY student records that needed to be added. Thus it was necessary to identify the duplicates from the new data. This required loading the SGP data object saved after the 2014 analyses and combining the existing longitudinal data with the new long file. When duplicates were present, the original record was retained. In the end, the file Utah_Data_LONG_2015.Rdata
is produced, which includes the 2015 FAY and 2014 new non-FAY data. The 2008-2013 data is added to the Utah_SGP@Data
slot directly and re-saved as the new 2014 object.
### Identify duplicate and unique cases from prior years
load("~Data/Utah_SGP.Rdata")
UT <- copy(Utah_SGP@Data)
setnames(UT,
c("test_subject", "school_year", "student_id", "test_subject_id"),
c("TestSubject", "SchoolYear", "StudentID", "TestSubjectID"))
UT[, NOTES := "Original data."] #Old
Utah_Data_LONG_2015[, NOTES := "Original data."]
Utah_Data_LONG_2015[which(YEAR != 2015), NOTES := "Provided in 2015 as additional prior data."]
Utah_Data_LONG_2015[, ACHIEVEMENT_LEVEL := as.character(ACHIEVEMENT_LEVEL)]
ALL_SAGE <- rbind(UT, Utah_Data_LONG_2015, fill=TRUE)
setkeyv(ALL_SAGE, c(key(UT), "NOTES"))
setkeyv(ALL_SAGE, key(UT))
ALL_SAGE <- ALL_SAGE[!duplicated(ALL_SAGE)]
### Create object of just the 2008-2013 priors to add into the new "final" 2014 object.
### 2014 non-FAY will be used in first analysis step to create their SGPs.
UT_PRIORS <- ALL_SAGE[which(!YEAR %in% c(2014, 2015) & NOTES != "Original data.")][, names(Utah_Data_LONG_2015), with=FALSE]
Utah_Data_LONG_2015 <- ALL_SAGE[YEAR %in% c(2014, 2015)][, names(Utah_Data_LONG_2015), with=FALSE]
Utah_Data_LONG_2015 <- Utah_Data_LONG_2015[-which(YEAR == 2014 & NOTES == "Original data.")]
### Save 2014/2015 long data object.
save(Utah_Data_LONG_2015, file="/media/Data/Dropbox/SGP/Utah/Data/Base_Files/Utah_Data_LONG_2015.Rdata")
### Add the 2008-2013 non-duplicated prior data to the existing longitudinal data file
Utah_SGP@Data <- rbindlist(list(UT, UT_PRIORS), fill=TRUE)
### Final changes - make NOTES a factor and remove unused variables
Utah_SGP@Data[, NOTES := factor(NOTES)]
Utah_SGP@Data[, EMH_LEVEL := NULL]
Utah_SGP@Data[, GRADE_PRIOR := NULL]
Utah_SGP@Data[, ACHIEVEMENT_LEVEL_PRIOR2 := NULL]
save(Utah_SGP, file="/media/Data/Dropbox/SGP/Utah_SGP.Rdata")
The SQL
code below was used to create the second data set submitted to the Center for analysis.
------ 1. 2015 STUDENTS POTENTIALLY MISSING SGPs (N = 72,566; 0min:50sec)
SELECT DISTINCT
SchoolYear,
TestSubjectID,
i.StudentID,
i.OldStudentID,
i.SSID
INTO #Missing2015
FROM (
--> IN 2015 StudentTests BUT StudentID CHANGED
SELECT SchoolYear,
TestSubjectID,
t.StudentID
FROM SAGE.StudentTests AS t
JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS i
ON t.StudentID = i.OldStudentID
WHERE SchoolYear = 2015
AND ParticipationCode NOT IN (204,304)
AND ScaleScore IS NOT NULL
AND IsFullAcademicYear = 'Y'
AND SGP IS NULL
AND ((GradeLevel >= 3 AND SubjectArea IN ('L','M'))
OR (GradeLevel >= 4 AND SubjectArea = 'S'))
AND CASE WHEN t.StudentID <> i.StudentID THEN 1 ELSE 0 END = 1
UNION
--> IN 2015 reportable_tests BUT NOT FAY
SELECT school_year AS SchoolYear,
test_subject_id AS TestSubjectID,
r.student_id AS StudentID
FROM SAGE.reportable_tests AS r
WHERE school_year = 2015
AND scale_score IS NOT NULL
AND ParticipationCode NOT IN (204,304)
AND full_academic_year = 'N'
UNION
--> IN 2015 valid_tests 2015 BUT NOT IN reportable_tests
SELECT v.school_year AS SchoolYear,
v.test_subject_id AS TestSubjectID,
v.student_id AS StudentID
FROM SAGE.valid_tests AS v
LEFT JOIN SAGE.reportable_tests AS r
ON v.school_year = r.school_year
AND v.student_id = r.student_id
AND v.test_subject_id = r.test_subject_id
WHERE v.school_year = 2015
AND v.scale_score IS NOT NULL
AND v.ParticipationCode NOT IN (204,304)
AND r.school_year IS NULL
AND r.student_id IS NULL
AND r.test_subject_id IS NULL
) AS sub
JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS i
ON sub.StudentID = i.OldStudentID
--> CHECK
SELECT * FROM #Missing2015
------ 2. UNION SET 2008-2009,2015 (N = 150,051; 1min:23sec)
--> SUPPLEMENTAL SAGE 2015 (N = 68,262)
SELECT sub2.*
FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
ROW_NUMBER() OVER (
PARTITION BY SchoolYear, StudentID, CONTENT_AREA
ORDER BY TestSubjectID
) AS preferred_score
FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
t.school_year AS SchoolYear,
x.StudentID AS StudentID,
t.test_subject_id AS TestSubjectID,
c.test_subject AS TestSubject,
t.subject_area AS SubjectArea,
NULL AS StudentGradeLevel, -- t.grade_level
p.grade_high AS HighestAppropriateTestGradeLevel,
t.school_year AS YEAR,
x.StudentID AS ID,
CASE t.test_subject_id
WHEN 313 THEN 'SEC_MATH_I'
WHEN 314 THEN 'SEC_MATH_II'
WHEN 315 THEN 'SEC_MATH_III'
WHEN 505 THEN 'EARTH_SCIENCE'
WHEN 506 THEN 'BIOLOGY'
WHEN 507 THEN 'CHEMISTRY'
WHEN 508 THEN 'PHYSICS'
ELSE (
CASE
WHEN t.subject_area = 'L' THEN 'ELA'
WHEN t.subject_area = 'M' AND t.test_subject_id NOT IN ('313','314','315') THEN 'MATHEMATICS'
WHEN t.subject_area = 'S' AND t.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
)
END AS CONTENT_AREA,
CASE WHEN CAST(t.test_subject_id AS VARCHAR) IN ('313','314','315','505','506','507','508') THEN 'EOCT'
WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
ELSE NULL -- t.grade_level
END AS GRADE,
t.scale_score AS SCALE_SCORE,
t.proficiency_level AS ACHIEVEMENT_LEVEL,
t.district_id AS DISTRICT_NUMBER,
t.school_id AS SCHOOL_NUMBER,
i.institution_name AS DISTRICT_NAME,
s.school_name AS SCHOOL_NAME,
m.last_name AS LAST_NAME,
m.first_name AS FIRST_NAME,
x.ssid AS SSID,
x.OldStudentID
FROM #Missing2015 AS x
JOIN SAGE.valid_tests AS t
ON x.SchoolYear = t.school_year
AND t.student_id IN (x.OldStudentID,x.StudentID)
AND x.TestSubjectID = t.test_subject_id
JOIN test_program p
ON t.test_prog_id = p.test_prog_id
JOIN ct_test_subjects AS c
ON p.test_subject_id = c.test_subject_id
JOIN student_master AS m
ON x.StudentID = m.student_id
JOIN v_be_school AS s
ON t.district_id = s.district_id
AND t.school_id = s.school_id
AND t.school_number = s.school_nbr
JOIN v_be_institution AS i
ON s.district_id = i.institution_id
WHERE t.school_year = 2015
AND t.ParticipationCode NOT IN (204,304)
AND t.scale_score IS NOT NULL
) AS sub1
) AS sub2
WHERE preferred_score = 1
UNION
--> SUPPLEMENTAL CRT 2008-2009 (N = 81,789)
SELECT sub2.*
FROM (
------ TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
ROW_NUMBER() OVER(
PARTITION BY SchoolYear, StudentID, CONTENT_AREA
ORDER BY SCALE_SCORE DESC, CASE WHEN StudentGradeLevel = HighestAppropriateTestGradeLevel THEN 1 ELSE 0 END DESC
) AS preferred_score
FROM (
------ RAW DATA SET
SELECT 'VALID_CASE' AS VALID_CASE,
t.school_year AS SchoolYear,
x.StudentID AS StudentID,
p.test_subject_id AS TestSubjectID,
c.test_subject AS TestSubject,
p.subject_area AS SubjectArea,
p.grade_high AS HighestAppropriateTestGradeLevel,
CAST(e.grade_level AS TINYINT) AS StudentGradeLevel,
t.school_year AS YEAR,
x.StudentID AS ID,
CASE p.test_subject_id
WHEN 307 THEN 'PRE_ALGEBRA'
WHEN 312 THEN 'PRE_ALGEBRA' -- Math 8
WHEN 308 THEN 'GEOMETRY'
WHEN 309 THEN 'ALGEBRA_I'
WHEN 310 THEN 'ALGEBRA_II'
WHEN 311 THEN 'ALGEBRA_II'
WHEN 505 THEN 'EARTH_SCIENCE'
WHEN 506 THEN 'BIOLOGY'
WHEN 507 THEN 'CHEMISTRY'
WHEN 508 THEN 'PHYSICS'
ELSE (
CASE
WHEN p.subject_area = 'L' THEN 'ELA'
WHEN p.subject_area = 'M' AND p.test_subject_id NOT IN ('307','312','308','309','310','311') THEN 'MATHEMATICS'
WHEN p.subject_area = 'S' AND p.test_subject_id NOT IN ('505','506','507','508') THEN 'SCIENCE' END
)
END AS CONTENT_AREA,
CASE WHEN CAST(p.test_subject_id AS VARCHAR) IN ('307','312','308','309','310','311','505','506','507','508') THEN 'EOCT'
WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
ELSE CAST(e.grade_level AS VARCHAR)
END AS GRADE,
t.scaled_score AS SCALE_SCORE,
t.proficiency AS ACHIEVEMENT_LEVEL,
t.district_id AS DISTRICT_NUMBER,
t.school_id AS SCHOOL_NUMBER,
i.institution_name AS DISTRICT_NAME,
s.school_name AS SCHOOL_NAME,
m.last_name AS LAST_NAME,
m.first_name AS FIRST_NAME,
x.ssid AS SSID,
x.OldStudentID
FROM student_test AS t
JOIN test_program p
ON t.test_prog_id = p.test_prog_id
JOIN ct_test_subjects AS c
ON p.test_subject_id = c.test_subject_id
JOIN student_enrollment AS e
ON t.school_year = e.school_year
AND t.district_id = e.district_id
AND t.school_id = e.school_id
AND t.school_number = e.school_number
AND t.student_id = e.student_id
AND t.entry_date = e.entry_date
JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS x
ON t.student_id = x.OldStudentID
JOIN student_master AS m
ON t.student_id = m.student_id
JOIN v_be_school AS s
ON t.district_id = s.district_id
AND t.school_id = s.school_id
AND t.school_number = s.school_nbr
JOIN v_be_institution AS i
ON s.district_id = i.institution_id
WHERE ((t.school_year = 2008 AND p.subject_area = 'M')
OR t.school_year = 2009 AND e.full_academic_year = 'N')
AND t.test_medium <> 'U'
AND t.test_participation IN ('','1') AND t.test_non_participation = ''
AND t.scaled_score BETWEEN 130 AND 199
AND ((e.grade_level >= 3 AND p.subject_area IN ('L','M'))
OR (e.grade_level >= 4 AND p.subject_area = 'S'))
AND e.full_academic_year = 'N'
) AS sub1
) AS sub2
WHERE preferred_score = 1
--------------------------------------------------------
------ 3. SAGE IDENTIFIERS 2014 (N = 75,988; 1min:12sec)
--> CHECK: Payson High (SchoolID = 827) Language Arts 11 (TestSubjectID = 110) N = 102
SELECT DISTINCT
SchoolYear,
TestSubjectID,
i.StudentID,
i.OldStudentID
FROM (
--> IN 2014 StudentTests BUT StudentID CHANGED
SELECT SchoolYear,
TestSubjectID,
t.StudentID
FROM SAGE.StudentTests AS t
JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS i
ON t.StudentID = i.OldStudentID
WHERE SchoolYear = 2014
AND ParticipationCode NOT IN (204,304)
AND ScaleScore IS NOT NULL
AND IsFullAcademicYear = 'Y'
AND SGP IS NULL
AND ((GradeLevel >= 3 AND SubjectArea IN ('L','M'))
OR (GradeLevel >= 4 AND SubjectArea = 'S'))
AND CASE WHEN t.StudentID <> i.StudentID THEN 1 ELSE 0 END = 1
UNION
--> IN 2014 reportable_tests BUT NOT FAY
SELECT school_year AS SchoolYear,
test_subject_id AS TestSubjectID,
r.student_id AS StudentID
FROM SAGE.reportable_tests AS r
WHERE school_year = 2014
AND scale_score IS NOT NULL
AND ParticipationCode NOT IN (204,304)
AND full_academic_year = 'N'
UNION
--> IN 2014 valid_tests BUT NOT IN reportable_tests
SELECT v.school_year AS SchoolYear,
v.test_subject_id AS TestSubjectID,
v.student_id AS StudentID
FROM SAGE.valid_tests AS v
LEFT JOIN SAGE.reportable_tests AS r
ON v.school_year = r.school_year
AND v.student_id = r.student_id
AND v.test_subject_id = r.test_subject_id
WHERE v.school_year = 2014
AND v.scale_score IS NOT NULL
AND v.ParticipationCode NOT IN (204,304)
AND r.school_year IS NULL
AND r.student_id IS NULL
AND r.test_subject_id IS NULL
) AS sub
JOIN [ACSDBPROD-1\ACS].SSID.dbo.RootSSIDs AS i
ON sub.StudentID = i.OldStudentID
Because there were found to be some duplicates between the 2015 FAY and 2015 non-FAY data files, the duplicate removal process must be run with this data file as well.
Utah_Data_LONG_2015_nonFAY <- fread(
"Data/Base_Files/SGP 2015 Supplemental - Missing SGP Longfile (plus nonFAY 2008-2009).csv", colClasses=rep("character", 22))
Utah_Data_LONG_2015_nonFAY[, SCHOOL_ENROLLMENT_STATUS := 'Enrolled School: No']
Utah_Data_LONG_2015_nonFAY[, DISTRICT_ENROLLMENT_STATUS := 'Enrolled District: No']
Utah_Data_LONG_2015_nonFAY[, STATE_ENROLLMENT_STATUS := 'Enrolled State: Yes']
### Final cleaning needed for file Randy uploaded
# TRANSFORM PROFICIENCY LEVEL VARIABLES TO R FACTORS
Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL <-
factor(Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL, levels=0:4,
labels=c(NA, "Level 1", "Level 2", "Level 3", "Level 4"), ordered=TRUE)
Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL_FULL <- Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL
levels(Utah_Data_LONG_2015_nonFAY$ACHIEVEMENT_LEVEL) <- c(NA, "BP", "BP", "P", "A")
# INVALIDate cases with GRADE = 2
Utah_Data_LONG_2015_nonFAY[which(GRADE==2), VALID_CASE := "INVALID_CASE"]
Utah_Data_LONG_2015_nonFAY[, NOTES := "Student Non-FAY."]
Utah_Data_LONG_2015_nonFAY[which(YEAR != 2015), NOTES := "Provided in 2015 Non-FAY as additional prior data."]
## Final changes to Non-FAY data and Utah_SGP object
nonFAY <- rbindlist(
list(UT, UT_PRIORS, Utah_Data_LONG_2015[YEAR=='2015'], Utah_Data_LONG_2015_nonFAY), fill=TRUE)
setkeyv(nonFAY, c(key(UT), "NOTES"))
setkeyv(nonFAY, key(UT))
dim(nonFAY[!duplicated(nonFAY) & NOTES != 'Student Non-FAY.'])
Utah_Data_LONG_2015_nonFAY <- nonFAY[!duplicated(nonFAY) & NOTES == 'Student Non-FAY.' & YEAR == '2015']
save(Utah_Data_LONG_2015_nonFAY, file="Data/Base_Files/Utah_Data_LONG_2015_nonFAY.Rdata")
Note that the naming conventions associated with Utah data provided by USOE perfectly conform with the naming conventions used by the SGP Package.
Although the SGP Package accommodates different state data naming conventions through the use of a meta-data lookup table embedded
within the SGPstateData Rdata object in the package, this is not necessary for Utah because the
naming conventions have been established in the SQL
code in data prep Step 1a above. The variable name lookup table
for Utah is shown below with names.provided giving the state specific names, names.sgp the SGP Package specific names, names.type giving the
type associated with the variable which is used within summarizeSGP
to
construct group summaries, names.info providing meta-data associated with the variable, and names.output a Boolean indicator of whether variable
should be used with summarizeSGP
.
names.provided | names.sgp | names.type | names.info | names.output |
---|---|---|---|---|
ID | ID | individual | Unique student identifier | |
LAST_NAME | LAST_NAME | label | Student last name | |
FIRST_NAME | FIRST_NAME | label | Student first name | |
CONTENT_AREA | CONTENT_AREA | content | Content area associated with record | |
YEAR | YEAR | time | Year associated with record | |
GRADE | GRADE | institution_level | Grade/Test level associated with record | |
SCALE_SCORE | SCALE_SCORE | measure | Scale score | |
VALID_CASE | VALID_CASE | individual_inclusion | Valid case indicator | |
TEST_NAME | TEST_NAME | misc | Test name | |
ACHIEVEMENT_LEVEL | ACHIEVEMENT_LEVEL | measure | Achievement level | |
SCHOOL_LEVEL | SCHOOL_LEVEL | institution_type | Elementary/Middle/High School level flag | |
SCHOOL_NAME | SCHOOL_NAME | label | School name | |
SCHOOL_NUMBER | SCHOOL_NUMBER | institution | School number | |
DISTRICT_NAME | DISTRICT_NAME | label | School name | |
DISTRICT_NUMBER | DISTRICT_NUMBER | institution | District number | |
STATE_ENROLLMENT_STATUS | STATE_ENROLLMENT_STATUS | institution_inclusion | State inclusion/accountability indicator | |
DISTRICT_ENROLLMENT_STATUS | DISTRICT_ENROLLMENT_STATUS | institution_inclusion | District inclusion/accountability indicator | |
SCHOOL_ENROLLMENT_STATUS | SCHOOL_ENROLLMENT_STATUS | institution_inclusion | School inclusion/accountability indicator | |
SUBGROUP | SUBGROUP | demographic | Student super subgroup status | TRUE |
GENDER | GENDER | demographic | Student gender | TRUE |
ETHNICITY | ETHNICITY | demographic | Student ethnicity | TRUE |
FRL_STATUS | FRL_STATUS | demographic | Student FRL status | TRUE |
IEP_STATUS | IEP_STATUS | demographic | Student IEP status | TRUE |
ELL_STATUS | ELL_STATUS | demographic | Student ELL status | TRUE |
GT_STATUS | GT_STATUS | demographic | Student Gifted Talented Status | TRUE |
TEST_MEDIUM | TEST_MEDIUM | misc | Test medium | |
HIGH_NEED_STATUS | HIGH_NEED_STATUS | demographic | High need status flag | TRUE |
SGP - Student Growth Percentiles SGP Blog | SGP GitHub Repo | SGP on CRAN