Influenza databases need root and branch reform.
GIGO (Garbage in Garbage out) is a computing expression, and warning, about database design. In the design of any database you need to consider what you want to be able to get out of it at the design stage and make sure your data entry and table structure are suitable for the queries you wish to use to output the reports you need. Nice theory - but in this case the organic growth of the database was not so much designed but evolved and is consequently now showing its age and structural weaknesses.
A little History on how we got to this point may be useful. In the early 1930s, long before anyone was considering databases, a team at the MRC (UK Medical Research Council) were working on canine distemper and shifted research to ferrets from puppies (quicker to breed and less problems from the anti-vivisectionists). When this resulted in a successful vaccine they shifted their attention to influenza. One of the researchers, named Wilson Smith, happened to catch flu and his swabs were used to infect the Mill Hill ferrets who turned out to be an excellent animal model for flu. Wilson Smith?s flu then became the labs standard research strain and was designated WS/33 and a naming format was born. Early strains of flu were classified using ferret test anti-sera to see if they reacted with known strains which allowed antigenically novel H & N to be identified and named by placing the next unused number after the letter giving us the H1N1 designation for WS/33 and H2 and N2 for the first strains whose H and/or N did not react to WS/33 anti-serra. As more Hs and Ns were discovered new numbers emerged and with the advent of sequencing the strains within each serotype could be distinguished. At the end of all this WS/33 had evolved into A/WS/1933(H1N1) however, as well as date and serotype, location was recognised as an important parameter and the naming standard now follows the template ?Type (A,B,C or D)/Collection Location/unique identifier/year (Serotype)?. A/Hong Kong/4801/2014 (H3N2) being typical of the format and in fact has been chosen as a recommended vaccine strain for the Northern Hemisphere in 2017/18. All of this is fairly logical and, without the benefit of hind sight, it is hard to criticise anything done in getting to this point.
So where is the problem? Things have moved on significantly both in terms of computing power and lab analysis resulting in rapid, and relatively low cost sequencing, and the ability to apply much more sophisticated queries on a vast and rapidly growing sequence database which brings me back to GIGO. The data is dirty (AKA Garbage) in it lacks consistency of format, a pre-requisite for getting useful data from queries. Looking again at the ?Type/Species/Location/ID/year (Serotype)?. The observant will have noticed I slipped in a new field, which is omitted by convention, if the species is Homo sapiens.
The first field ?'Type'? is not a problem as it is a straight picklist selection.
?'Species'? and '?Collection Location?' are becoming a problem due to no clear definition of what should be used as the unit. Often the submitting lab will use the Country, US labs often use a State?s Initials (NY, MI) or name, others a city or administrative region. In the Species field you can get everything from '?wild bird'? to the specie's Latin name with classifications like ?wader?, ?Duck?, ?Mallard?, ?Gull?, ?Galliform?, Poultry, Chicken, dog and environmental all common field entries.
The '?Unique identifier'? is not a problem as long as you know it is only unique when part of the full name taking the Type, Date and location with it. It is usually just a sequential number, sometimes with the submitting lab?s initials.
The ?'Year'? data is clean-ish in it is sometimes two digit or four (17, 2017), which is easily trapped for in code, but should be standardised on 2017 as it will become a problem next year when the 1918 sequences become 100 years old. Also in this day and age it should be narrowed down to the day of collection - as minimum.
There are also now several other fields in current databases including that database's unique sequence code, The submitting lab & date of submission, the nucleic acid strand being sequenced (H,N,M,NP etc) and the sequence of bases that comprise it. Also the resultant Amino Acid sequence post translation. None of these are problems however the Meta data field is. It is often left blank and if completed follows no structure that makes it easy to query, the ?why all this matters? will be dealt with in the next section.
Why change is needed. Now we have a sizeable database we are back to GIGO what questions do we want to answer and how can we construct a query to extract that data and, moving on from that, how can we improve the structure to allow more, and more useful, information to be retrieved from the raw data.
The current state of play is you access a form similar to this one to construct your querey
(Link to https://www.fludb.org/brc/influenza_...ator=influenza)
But you don?t always get back all the relevant sequences if the exact wording of Host or Country do not match the way it was input. Coding tries to trap for the most common errors but in practice I have often had to try slightly different parameters and some forms get you to type into the species field, which can get some very bizarre results. Dates can also give odd results as submission dates are sometimes used as sample dates but may be years later.
What changes would I Iike to see made and why?
Starting with the changes.
The 'Species' field needs clear guidelines agreed by the databases, sequencing labs and sample collectors. In practice someone netting waders may not be able to identify everything they catch but taking a mobile phone picture of anything needing clarification should not present a problem so this is just sloppiness at this point. Environmental samples could have a meta-data field for supplementary info e.g. ?duck dropping from Beijing wet market?
Location and Date badly need dragging into the 21st century. If you can afford to sequence data and put a team into the field to collect that data then you should be able to provide them with simple GPS/time stamper so you know exactly where and when they were working. This is very valuable data as it allows effective GIS (Geographical Information System) integration. GIS allows cool stuff like animating your map, plotting the geographical spread with time so you can see each new H7N9 sequence found pop up as time advances while also having them appear on a building phylogenetic tree. There are many other epidemiological use of GIS but the lack of granularity in just having China, or USA as your location negates most of them. Likewise for having one year as the limit of temporal granularity.
The Meta-data holds potentially very useful information but its free text structures means it has to be searched by keyword. Improvements in AI may help in time but some kind of advice on format & content to the submitting labs could make life a lot easier now and for any future AI helpers. In human cases of flu data such as age, sex, clinical outcome, preconditions (immunocompromised or pregnant) and probable infection source are the kinds of data sometimes found. If you are trying to correlate viral genetics to clinical severity you may wish to set up a query to select for China/2014 to 2017/H7N9/Human and then divide the output into mild, severe and fatal based on the clinical Metadata and then statistically compare the Amino Acid sequences across the three data sets for statistically significant SNPs.
All of this is just a personal opinion of mine and is offered as a basis for discussion. There are many other changes I would like, for instance, many of the sequences are partial. They may have been collected by someone doing research on antigenic sites, in which case only H and N may have been needed, but if funding institutions instituted it as best practice the RNA for the internal proteins could have been included, with little additional cost or effort, and may be of great use to someone looking at correlations between the viral RNP SNP?s as follow up research to the China/H7N9 example I gave in the last paragraph.
As the subject matter of this post is fairly esoteric I have used quite a lot of jargon and it has, of course, been over simplified in the interest of brevity and does not cover weakness in the selection of what to sequence, which has caused its usefulness to be very limited for some application. I have written explanatory posts -? now long buried somewhere in the threads of this site -? on many peripheral topics and explanations of the genetics and their significance which I would be happy to try and find, and add links to, if anyone interested needs background to explain the jargon used or the difficulties arising from the extant system.
GIGO (Garbage in Garbage out) is a computing expression, and warning, about database design. In the design of any database you need to consider what you want to be able to get out of it at the design stage and make sure your data entry and table structure are suitable for the queries you wish to use to output the reports you need. Nice theory - but in this case the organic growth of the database was not so much designed but evolved and is consequently now showing its age and structural weaknesses.
A little History on how we got to this point may be useful. In the early 1930s, long before anyone was considering databases, a team at the MRC (UK Medical Research Council) were working on canine distemper and shifted research to ferrets from puppies (quicker to breed and less problems from the anti-vivisectionists). When this resulted in a successful vaccine they shifted their attention to influenza. One of the researchers, named Wilson Smith, happened to catch flu and his swabs were used to infect the Mill Hill ferrets who turned out to be an excellent animal model for flu. Wilson Smith?s flu then became the labs standard research strain and was designated WS/33 and a naming format was born. Early strains of flu were classified using ferret test anti-sera to see if they reacted with known strains which allowed antigenically novel H & N to be identified and named by placing the next unused number after the letter giving us the H1N1 designation for WS/33 and H2 and N2 for the first strains whose H and/or N did not react to WS/33 anti-serra. As more Hs and Ns were discovered new numbers emerged and with the advent of sequencing the strains within each serotype could be distinguished. At the end of all this WS/33 had evolved into A/WS/1933(H1N1) however, as well as date and serotype, location was recognised as an important parameter and the naming standard now follows the template ?Type (A,B,C or D)/Collection Location/unique identifier/year (Serotype)?. A/Hong Kong/4801/2014 (H3N2) being typical of the format and in fact has been chosen as a recommended vaccine strain for the Northern Hemisphere in 2017/18. All of this is fairly logical and, without the benefit of hind sight, it is hard to criticise anything done in getting to this point.
So where is the problem? Things have moved on significantly both in terms of computing power and lab analysis resulting in rapid, and relatively low cost sequencing, and the ability to apply much more sophisticated queries on a vast and rapidly growing sequence database which brings me back to GIGO. The data is dirty (AKA Garbage) in it lacks consistency of format, a pre-requisite for getting useful data from queries. Looking again at the ?Type/Species/Location/ID/year (Serotype)?. The observant will have noticed I slipped in a new field, which is omitted by convention, if the species is Homo sapiens.
The first field ?'Type'? is not a problem as it is a straight picklist selection.
?'Species'? and '?Collection Location?' are becoming a problem due to no clear definition of what should be used as the unit. Often the submitting lab will use the Country, US labs often use a State?s Initials (NY, MI) or name, others a city or administrative region. In the Species field you can get everything from '?wild bird'? to the specie's Latin name with classifications like ?wader?, ?Duck?, ?Mallard?, ?Gull?, ?Galliform?, Poultry, Chicken, dog and environmental all common field entries.
The '?Unique identifier'? is not a problem as long as you know it is only unique when part of the full name taking the Type, Date and location with it. It is usually just a sequential number, sometimes with the submitting lab?s initials.
The ?'Year'? data is clean-ish in it is sometimes two digit or four (17, 2017), which is easily trapped for in code, but should be standardised on 2017 as it will become a problem next year when the 1918 sequences become 100 years old. Also in this day and age it should be narrowed down to the day of collection - as minimum.
There are also now several other fields in current databases including that database's unique sequence code, The submitting lab & date of submission, the nucleic acid strand being sequenced (H,N,M,NP etc) and the sequence of bases that comprise it. Also the resultant Amino Acid sequence post translation. None of these are problems however the Meta data field is. It is often left blank and if completed follows no structure that makes it easy to query, the ?why all this matters? will be dealt with in the next section.
Why change is needed. Now we have a sizeable database we are back to GIGO what questions do we want to answer and how can we construct a query to extract that data and, moving on from that, how can we improve the structure to allow more, and more useful, information to be retrieved from the raw data.
The current state of play is you access a form similar to this one to construct your querey
(Link to https://www.fludb.org/brc/influenza_...ator=influenza)
But you don?t always get back all the relevant sequences if the exact wording of Host or Country do not match the way it was input. Coding tries to trap for the most common errors but in practice I have often had to try slightly different parameters and some forms get you to type into the species field, which can get some very bizarre results. Dates can also give odd results as submission dates are sometimes used as sample dates but may be years later.
What changes would I Iike to see made and why?
Starting with the changes.
The 'Species' field needs clear guidelines agreed by the databases, sequencing labs and sample collectors. In practice someone netting waders may not be able to identify everything they catch but taking a mobile phone picture of anything needing clarification should not present a problem so this is just sloppiness at this point. Environmental samples could have a meta-data field for supplementary info e.g. ?duck dropping from Beijing wet market?
Location and Date badly need dragging into the 21st century. If you can afford to sequence data and put a team into the field to collect that data then you should be able to provide them with simple GPS/time stamper so you know exactly where and when they were working. This is very valuable data as it allows effective GIS (Geographical Information System) integration. GIS allows cool stuff like animating your map, plotting the geographical spread with time so you can see each new H7N9 sequence found pop up as time advances while also having them appear on a building phylogenetic tree. There are many other epidemiological use of GIS but the lack of granularity in just having China, or USA as your location negates most of them. Likewise for having one year as the limit of temporal granularity.
The Meta-data holds potentially very useful information but its free text structures means it has to be searched by keyword. Improvements in AI may help in time but some kind of advice on format & content to the submitting labs could make life a lot easier now and for any future AI helpers. In human cases of flu data such as age, sex, clinical outcome, preconditions (immunocompromised or pregnant) and probable infection source are the kinds of data sometimes found. If you are trying to correlate viral genetics to clinical severity you may wish to set up a query to select for China/2014 to 2017/H7N9/Human and then divide the output into mild, severe and fatal based on the clinical Metadata and then statistically compare the Amino Acid sequences across the three data sets for statistically significant SNPs.
All of this is just a personal opinion of mine and is offered as a basis for discussion. There are many other changes I would like, for instance, many of the sequences are partial. They may have been collected by someone doing research on antigenic sites, in which case only H and N may have been needed, but if funding institutions instituted it as best practice the RNA for the internal proteins could have been included, with little additional cost or effort, and may be of great use to someone looking at correlations between the viral RNP SNP?s as follow up research to the China/H7N9 example I gave in the last paragraph.
As the subject matter of this post is fairly esoteric I have used quite a lot of jargon and it has, of course, been over simplified in the interest of brevity and does not cover weakness in the selection of what to sequence, which has caused its usefulness to be very limited for some application. I have written explanatory posts -? now long buried somewhere in the threads of this site -? on many peripheral topics and explanations of the genetics and their significance which I would be happy to try and find, and add links to, if anyone interested needs background to explain the jargon used or the difficulties arising from the extant system.
Comment