Re: China - H7N9 Human Isolates on Deposit at GISAID
By appearances, even the human sequences can fly.
A Duplicate "RoundTrip" H7N9 isolate of ChinaHangzhou1_C1_38M_2013_03_24_f was posted at GISAID today from GenBank. H7N9 Count at GISAID is now at 35, although duplicates reduce the actual distinct set sequences and the accuracy of the database. De-Duplication and "house-holding/connecting" are mainstay early design imperatives among Big Data projects involving data-sharing between disparate organisations.
Statistical anomalies prevail in a database without curation.
This apparent 20% increase in sample size (Human cases from 4 to 5) is a substantial variation for most calculations. And isn't that where we're all focused . . . on the variation? But we need to see ACTUAL variation, not skewing due to avoidable data quality issues.
Because we've seen the resounding negative effects of data duplication in critical scenarios, a request for information concerning the maintenance schedule for this ongoing issue has been made to GISAID.
At this stage of the H7N9 species transition where acceleration is probable, taking this opportunity now to correct these data transmission errors may indeed provide the footing to tame the transmission of the actual disease. Some will say that data is always dirty, perpetually imperfect. But we all know that with a little work, like this devoted FT forum, most data can be improved and some data even shows the upside of being perfected.
And I think we all can agree that the Power of Perfect Information is ineffable.
"RoundTrip" FlightPath
A grateful acknowledgement to the researchers and GISAID team who have resolved the data transmission issue and removed the duplicate H7N9 sequence as of 2013-04-11-20:00, instantly making this request obsolete.
By appearances, even the human sequences can fly.
A Duplicate "RoundTrip" H7N9 isolate of ChinaHangzhou1_C1_38M_2013_03_24_f was posted at GISAID today from GenBank. H7N9 Count at GISAID is now at 35, although duplicates reduce the actual distinct set sequences and the accuracy of the database. De-Duplication and "house-holding/connecting" are mainstay early design imperatives among Big Data projects involving data-sharing between disparate organisations.
Statistical anomalies prevail in a database without curation.
This apparent 20% increase in sample size (Human cases from 4 to 5) is a substantial variation for most calculations. And isn't that where we're all focused . . . on the variation? But we need to see ACTUAL variation, not skewing due to avoidable data quality issues.
Because we've seen the resounding negative effects of data duplication in critical scenarios, a request for information concerning the maintenance schedule for this ongoing issue has been made to GISAID.
At this stage of the H7N9 species transition where acceleration is probable, taking this opportunity now to correct these data transmission errors may indeed provide the footing to tame the transmission of the actual disease. Some will say that data is always dirty, perpetually imperfect. But we all know that with a little work, like this devoted FT forum, most data can be improved and some data even shows the upside of being perfected.
And I think we all can agree that the Power of Perfect Information is ineffable.
Comment