No announcement yet.

Influenza databases need root and branch reform.

  • Filter
  • Time
  • Show
Clear All
new posts

  • Influenza databases need root and branch reform.

    Influenza databases need root and branch reform.

    GIGO (Garbage in Garbage out) is a computing expression, and warning, about database design. In the design of any database you need to consider what you want to be able to get out of it at the design stage and make sure your data entry and table structure are suitable for the queries you wish to use to output the reports you need. Nice theory - but in this case the organic growth of the database was not so much designed but evolved and is consequently now showing its age and structural weaknesses.

    A little History on how we got to this point may be useful. In the early 1930s, long before anyone was considering databases, a team at the MRC (UK Medical Research Council) were working on canine distemper and shifted research to ferrets from puppies (quicker to breed and less problems from the anti-vivisectionists). When this resulted in a successful vaccine they shifted their attention to influenza. One of the researchers, named Wilson Smith, happened to catch flu and his swabs were used to infect the Mill Hill ferrets who turned out to be an excellent animal model for flu. Wilson Smith?s flu then became the labs standard research strain and was designated WS/33 and a naming format was born. Early strains of flu were classified using ferret test anti-sera to see if they reacted with known strains which allowed antigenically novel H & N to be identified and named by placing the next unused number after the letter giving us the H1N1 designation for WS/33 and H2 and N2 for the first strains whose H and/or N did not react to WS/33 anti-serra. As more Hs and Ns were discovered new numbers emerged and with the advent of sequencing the strains within each serotype could be distinguished. At the end of all this WS/33 had evolved into A/WS/1933(H1N1) however, as well as date and serotype, location was recognised as an important parameter and the naming standard now follows the template ?Type (A,B,C or D)/Collection Location/unique identifier/year (Serotype)?. A/Hong Kong/4801/2014 (H3N2) being typical of the format and in fact has been chosen as a recommended vaccine strain for the Northern Hemisphere in 2017/18. All of this is fairly logical and, without the benefit of hind sight, it is hard to criticise anything done in getting to this point.

    So where is the problem? Things have moved on significantly both in terms of computing power and lab analysis resulting in rapid, and relatively low cost sequencing, and the ability to apply much more sophisticated queries on a vast and rapidly growing sequence database which brings me back to GIGO. The data is dirty (AKA Garbage) in it lacks consistency of format, a pre-requisite for getting useful data from queries. Looking again at the ?Type/Species/Location/ID/year (Serotype)?. The observant will have noticed I slipped in a new field, which is omitted by convention, if the species is Homo sapiens.

    The first field ?'Type'? is not a problem as it is a straight picklist selection.
    ?'Species'? and '?Collection Location?' are becoming a problem due to no clear definition of what should be used as the unit. Often the submitting lab will use the Country, US labs often use a State?s Initials (NY, MI) or name, others a city or administrative region. In the Species field you can get everything from '?wild bird'? to the specie's Latin name with classifications like ?wader?, ?Duck?, ?Mallard?, ?Gull?, ?Galliform?, Poultry, Chicken, dog and environmental all common field entries.
    The '?Unique identifier'? is not a problem as long as you know it is only unique when part of the full name taking the Type, Date and location with it. It is usually just a sequential number, sometimes with the submitting lab?s initials.
    The ?'Year'? data is clean-ish in it is sometimes two digit or four (17, 2017), which is easily trapped for in code, but should be standardised on 2017 as it will become a problem next year when the 1918 sequences become 100 years old. Also in this day and age it should be narrowed down to the day of collection - as minimum.
    There are also now several other fields in current databases including that database's unique sequence code, The submitting lab & date of submission, the nucleic acid strand being sequenced (H,N,M,NP etc) and the sequence of bases that comprise it. Also the resultant Amino Acid sequence post translation. None of these are problems however the Meta data field is. It is often left blank and if completed follows no structure that makes it easy to query, the ?why all this matters? will be dealt with in the next section.

    Why change is needed. Now we have a sizeable database we are back to GIGO what questions do we want to answer and how can we construct a query to extract that data and, moving on from that, how can we improve the structure to allow more, and more useful, information to be retrieved from the raw data.

    The current state of play is you access a form similar to this one to construct your querey flu database query.JPG

    (Link to

    But you don?t always get back all the relevant sequences if the exact wording of Host or Country do not match the way it was input. Coding tries to trap for the most common errors but in practice I have often had to try slightly different parameters and some forms get you to type into the species field, which can get some very bizarre results. Dates can also give odd results as submission dates are sometimes used as sample dates but may be years later.

    What changes would I Iike to see made and why?
    Starting with the changes.
    The 'Species' field needs clear guidelines agreed by the databases, sequencing labs and sample collectors. In practice someone netting waders may not be able to identify everything they catch but taking a mobile phone picture of anything needing clarification should not present a problem so this is just sloppiness at this point. Environmental samples could have a meta-data field for supplementary info e.g. ?duck dropping from Beijing wet market?
    Location and Date badly need dragging into the 21st century. If you can afford to sequence data and put a team into the field to collect that data then you should be able to provide them with simple GPS/time stamper so you know exactly where and when they were working. This is very valuable data as it allows effective GIS (Geographical Information System) integration. GIS allows cool stuff like animating your map, plotting the geographical spread with time so you can see each new H7N9 sequence found pop up as time advances while also having them appear on a building phylogenetic tree. There are many other epidemiological use of GIS but the lack of granularity in just having China, or USA as your location negates most of them. Likewise for having one year as the limit of temporal granularity.
    The Meta-data holds potentially very useful information but its free text structures means it has to be searched by keyword. Improvements in AI may help in time but some kind of advice on format & content to the submitting labs could make life a lot easier now and for any future AI helpers. In human cases of flu data such as age, sex, clinical outcome, preconditions (immunocompromised or pregnant) and probable infection source are the kinds of data sometimes found. If you are trying to correlate viral genetics to clinical severity you may wish to set up a query to select for China/2014 to 2017/H7N9/Human and then divide the output into mild, severe and fatal based on the clinical Metadata and then statistically compare the Amino Acid sequences across the three data sets for statistically significant SNPs.

    All of this is just a personal opinion of mine and is offered as a basis for discussion. There are many other changes I would like, for instance, many of the sequences are partial. They may have been collected by someone doing research on antigenic sites, in which case only H and N may have been needed, but if funding institutions instituted it as best practice the RNA for the internal proteins could have been included, with little additional cost or effort, and may be of great use to someone looking at correlations between the viral RNP SNP?s as follow up research to the China/H7N9 example I gave in the last paragraph.
    As the subject matter of this post is fairly esoteric I have used quite a lot of jargon and it has, of course, been over simplified in the interest of brevity and does not cover weakness in the selection of what to sequence, which has caused its usefulness to be very limited for some application. I have written explanatory posts -? now long buried somewhere in the threads of this site -? on many peripheral topics and explanations of the genetics and their significance which I would be happy to try and find, and add links to, if anyone interested needs background to explain the jargon used or the difficulties arising from the extant system.
    Last edited by JJackson; July 9, 2017, 04:46 PM.

  • #2
    The first post in this thread was a call for a cultural, and structural, change in the way we collect and store flu sequence data. It was focused principally on the structural aspect, as a first step, but this post is looking at radically changing the culture of sequence analysis and a reversal of priorities.

    What makes it into the database was initially any sequences generated in the course of someone’s research. This rather random aggregation did not produced the kind of data needed to answer questions like ‘what are the predominant seasonal flu strains?’ which we need to know if we are going to pick the best vaccine candidates for next season. To counter this a number of sentinel sample collections systems have been set up to generate regular ‘polling data’ from a representative group of GPs and hospitals with a wide geographical spread. Now we have enough data of the right type to watch the progress of competing serotypes and strains and to tentatively predict next year’s flu strain. More accurately we are trying to assign relative probabilities for the strains we know off and have data for. Problem solved – but only for human seasonal flu which may be important to us – but it is a largely irrelevant dataset as far as flu genetics is concerned.

    The real action is all taking place in the Anseriformes (waterfowl) with some interaction with other bird groups and, occasionally, mammals. [1] The vast majority of the sequence data collected to date came from humans, or commercially important animals (Pigs, Poultry and a few Ducks). If you remove all of these and select only wild birds there isn’t much left and what there is was not systematically collected with a view to answering useful questions.

    Which brings us to the crux of the problem 'why have the database?'
    If it is merely somewhere to collect the left-over sequence data generated as a by-product of research it is never going to be very useful in its own right. If however a sentinel system similar to the one collecting human seasonal flu data is setup then – in time – we will be able to answer a lot of questions. This system would need regular sampling globally with an emphasis on sites in SE Asia and along the principal migratory flyways. This is a fundamental and radical change it changes the priority to feeding the database and then generating research on the existing data, not the other way around.

    Why would we want to do this?
    The short answer is for the future – more on which later.

    It is an enormous undertaking that would require netting, sampling and sequencing monthly at 50+ sites. If you net 20 birds per trip then you have tested 12,000 birds in a year and if 10% have flu then the database grows at a little over 1000 full genome sequences per annum. How does that compare to 2016? Using the form and link provided in the first post I asked that question [2] and the answer was 8 nettings providing 11 infected birds and their sequence data. Of the 11 three more were not water fowl so a total of 8 relevant genomes globally for the year. This is not enough data to gain any kind of useful picture about the primary reservoir of influenza genetics (the GISAID flu database will have some more unique sequences but not enough to negate the point).

    Even a cursory glance through the Scientific library section on this site (we are eternally indebted to Tetano, and others, for their tireless working in building this valuable resource) will show just how often researchers are trying to tie mutational to functional changes. This is usually based on hints from previous research, but very rarely based on the existing data set, there just isn’t enough of it to draw any useful conclusions.

    So how could a more comprehensive data set be used – if we had one?
    This is an extract from Wikipedia’s Big Data page [3] (I have ‘snipped’ out a few bits and added some highlights).

    Decoding the human genome originally took 10 years to process, now it can be achieved in less than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten years (snip)

    Google's DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any 'friction points,' or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources from Google's search server to scale social experiments that would usually take years, instantly.[110][111]

    23andme's DNA database contains genetic information of over 1,000,000 people worldwide.[112] (snip). Ahmad Hariri, professor of psychology and neuroscience at Duke University who has been using 23andMe in his research since 2009 states that the most important aspect of the company's new service is that it makes genetic research accessible and relatively cheap for scientists.[113] A study that identified 15 genome sites linked to depression in 23andMe's database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper.[118]
    We are some way from qualifying as ‘Big Data’ but it is the use of AI and analytic tools that have already been developed for this field that I view as being the ultimate goal. The advances in computing power and data analysis proceed at great pace and the priority now should be to start building a database on which they can be used.

    Over the last decade the form in the first post has not changed much but the data linked to the search results is far more comprehensive. Where there is data available you can see, for each Amino Acid, its frequency relative to others at that position, any know ramifications, which publications discuss it (with links to relevant Pubmed articles), link out to a rotatable 3D protein model and other relevant databases.
    Big data AI analytics should have evolved to the point where it can start to find correlations between form and effect, biological fitness to form, fitness restrained domains and – further down the line – link these to tertiary protein structural changes at points of viral, drug and immune system importance suggesting pathways for drug design. All of which we need to begin to prepare for by building a database with not just sequences but enough sequences of the right type.

    As I am interested in Zoonotic emergence I have used data collection relevant to related research in this post, but the principal is general. The thing Big Data analytics is good at is pattern recognition and finding correlation within a data dump, traits making it ideally suited to nucleotide analysis.

    Footnotes, Supplementary data and links.

    While an explanation of the Machine learning/Iterative Neural nets/Clustering/Feature Selection/Big data interface are beyond the scope of this post (and probably my ability to cogently explain) a basic understanding would be very helpful in evaluating this post. If you are not conversant with this area a search on any of the terms you are unfamiliar with would be a good start. These videos may also help.

    This is a Vlab discussion forum held at Stanford on Deep Learning in 2014 (you need to watch the dates on such things because the field changes so fast) it is 1hr 24min in total but includes a basic 15min primer starting at around minute 7.

    The second link is to 'The joy of Stats' a 1hr documentary made by Hans Rosling for the BBC in 2010.
    I am happy to recommend all of it but the most relevant section begins around 37 minutes in and then moves on to what I described as "further down the line" at 52min.

    [1] In the ‘H7N9 discussion thread’ I developed a concept I called the ‘Zoonotic edge’ and the overlap between the gene constellations found in the Primary avian group and in the semi-independent secondary groups. The Venn diagram below (taken from one of these posts) is my attempt at graphically depicting this relationship. The fuller explanation of how this fits in to the bigger picture is explained in the full text.

    In post #29 the 2nd & 3rd paragraphs give another example of how little H7N9 data there was and why the lack of a denominator further degrades its value.

    Post #47 Discusses where to swab, when taking samples, and more on wild bird data collection.

    Post #52 Answers a question following from #47 and then begins to lead into the arguments being made in this thread.

    Post #72 Is a long response to a post by NS1 (still on the subject of a lack of systematic data collection) which looks at why it is important and introduces the Avian groups and Zoonotic edge concept. This post initiated a fair bit of discussion and posts #79 & #94 are clarifications and continuations of the arguments in post #72 - the three, taken together, form a lot of the background to this thread.

    Post 29 to 94 all occurred over 4 days at the end of Jan. 2014 and I have only linked to those of my posts which are relevant to this thread.


    [2] The search parameters sampled Type A flu (all sub-types), full genome, Avian and 2016 (collection & submission dates). This returned 34 hits, 23 of which were poultry, leaving 11. Their database had about half a million Type A genomes of which about 60% were complete.

    Last edited by JJackson; July 9, 2017, 05:05 PM.


    • #3

      les donn?es sont issues de pr?l?vements humains, animaux d'?levage et sauvages, il manque, de mon point de vue, des pr?l?vements r?guliers, d'animaux commensaux potentielement porteur sains et non en milieu d'?levage mais urbain.
      Si pour certains le but est de pr?venir l'?pid?mie donc de trouver le cas primaire, il pourrait ?tre tr?s instructif de savoir ce que vivent les commensaux des humains dans les grands ports et grandes villes . Je dis cela car , si les commensaux sont porteurs de souches qui par r?assortiment peuvent induire un virus probl?matique , on aurait beaucoup avanc? en ayant identifi? le danger.

      j'avais sugg?r? de laisser les sp?cialistes de la FAO poursuivre ce qui a ?t? initi? sur la faune sauvage, mais, en parral?le, ce serait bien que les ports d'une certaines ampleur , disons les 20 plus important au monde dans la zone orange de la carte ci dessous, et les villes de plus de 5 millions d'habitants (un exemple ) fassent une s?rie de pr?l?vement tous les ans sur les animaux commensaux et pr?sent presque partout: le colvert et un rongeur la souris ou le rat . Je sais pourquoi cela n'a jamais ?t? initi? , la soit disant qualification du pays au niveau du code O.I.E


      • #4
        It is interesting that we are having this conversation on this thread because my French is not good enough to get much more than the gist of your posts. To understand them I use Google translate and get the original and machine translation side by side and let the app read the english aloud while I read your french. Where the machine translation falters I can often guess the correct translation from the French, due to our shared Latin roots, also it is probably a very good way to improve my french.
        The technology to do all this is ultimately Big-data neural-net based machine learning. In the links provided in the previous post how this was achieved was explained by Google ? as well as where it is all going. The link is to the late Hans Rosling?s ?Joy of Stats? and the Google translate section starts at minute 40.
        The one bit I am not clear on in your post is the end. Why would the OIE code (I am assuming you mean the Terrestrial Animal Health Code) block sampling? I thought it covered cross boarder transports.
        I am also a believer in the wisdom of the OIE ?ONE Health? doctrine, which I hope you would already have guessed.


        • #5

          concernant les statistiques , le Pr?sident Pompidou tenait ces propos: c'est comme un bikini, si cela laisse entrevoir des choses , cela ne montre rien de l'essentiel. Mes propos ont pour fondement le souhait d'enfin permettre de d?ployer l'usage de la m?thode HACCP

          le mot anglais "hazard" a ?t? traduit de bien des mani?res, dont celui de risque, notamment, par des organismes qui ne pouvaient pas ne pas savoir les pourquoi des fautes intentionnelles de traduction ...

          les entreprises de la Silicon valley seraient bien inspir?es de se pr?occuper, plus, et de la mise ? disposition de vrais dictionnaires ( les dictionnaires d'autrefois ont montr? la m?thode ) et de l'?volution des syst?mes de traduction . Ce fil m'a aussi permis d'entrevoir ce que de multiples traductions pouvait induire .

          Concernant le code O.I.E, s'il permet d'entrevoir ce que la sagesse peut ?tre, il permet aussi d'entrevoir les limites, en cas de guerre, notamment ?conomique...

          Concernant, la gestion actuelle, nous avions eu, cet hiver, la surprise avec le statut indemne du Luxembourg et de la Belgique, mais l? avec l'introduction officielle du concept de "grippe" animale d'?t?, au moment ou officiellement le risque ?tait ?valu? comme faible ou nul , bien des acteurs sont d?stabilis?s.

          Ce me semble un bon reflet des limites de la strat?gie actuelle d'usage de la sagesse du Code .
          Pour mieux faire entrevoir la situation , je vous offre ces deux liens, ? d?guster ensemble:

          Sur 41 contr?les diligent?s par la DDCSPP depuis novembre 2016, ? 50 % ont r?v?l? des anomalies majeures ?. Comment expliquer ce taux ?lev? ?

          Pr?s de 7 000 appelants ? saisir au Salon des migrateurs de Cayeux-sur-Mer

          j'ai beaucoup aim? l'humour de ce titre et peut ?tre de son auteur, il est ?voqu? le fait de saisir ces animaux. La langue fran?aise permet bien des choses, quand on prend le temps d'en apprendre les usages...


          • #6
            Your Pompidou quote put me in mind of a quote by another Frenchman

            ?You would make a ship sail against the wind and currents by lighting a bonfire under her decks? I pray you excuse me. I have no time to listen to such nonsense.?
            (attributed to Napoleon Bonaparte on being asked to fund development of a steam ship, but see notes [1]).

            I think both men were wrong because neither had enough knowledge to see the technology's potential.

            We will have to ?agree to differ? on the subject of machine translation. As the man from Google explains in the video (link in post #2 starts at minute 40) traditional direct translation methods work extremely well between two languages but breakdown when used in many-to-many language translation. The machine answer is to translate the concept implied by the word, or phrase, used into an internal language of the machine?s own invention and then translate this into the language you choose. It will get better with time as it gains experience, particularly if we take the time to point out its more obvious errors.

            Documents where the exact wording is crucial will still need manual error checking. There is little we can do about the deliberate misuse of words ? exemplified by newspaper headlines like ?'Cancer Hazard from ...'? when the text shows a small statistical increase in risk ? beyond drawing attention to it.[2]

            You may find this of interest. It is a comment of David Sencer?s posted in a discussion (at the now, sadly, closed Effectmeasure Blog) on the practical workings of the IHR. In this case two countries exceed their powers under the IHR by closing their boarders to a third. The reason given was fear of cholera and the economic target was lemons (which were rotting in lorries at the boarder). David was one of the Judges of arbitration. [3]

            The ?50% anomalies? article put in mind the beginning of another quote ?Politics is the slow boring of hard boards? Max Weber, but it works just as well with ?Progress? instead of ?Politics?.

            Game, Poultry and Livestock fairs are a real danger in disease spread and have been ?ground zero? in the epidemiological investigation of a number of zoonotic outbreaks reported in these threads.

            Notes & Links.
            [1] I knew the quote I posted but I checked its providence prior to posting and he probably never said it. In 1803 Robert Fulton had a working prototype and, through an intermediary, went to Bonaparte for further funding provoking this reply.

            "Il y a dans toutes les capitales de l?Europe, une foule d?aventuriers et d?hommes ? projets qui courent le monde, offrant ? tous les souverains de pr?tendues d?couvertes qui n?existent que dans leur imagination. Ce sont autant de charlatans ou d?imposteurs, qui n?ont d?autre but que d?attraper de l?argent. Cet Am?ricain est du nombre. Ne m?en parlez pas davantage."

            Which I assume became

            Quoi! Monsieur, Vous feriez voguer un navire contre vents et mar?es en allumant un feu sous ses ponts? Je vous prie de m'excuser, je n'ai gu?re le temps d'?couter pareille absurdit?!

            [2] If exact meaning is important in your work you may be interest in the writings of Jody Lanard & Peter Sandman, if you do not already know them.
            They are a husband and wife team and work in risk communication (she is more on the health side while he is more commercial). They have a blog ( ) and an interest in zoonosis. Jody is a member here, but I don?t think she posts but used to comment at Effectmeasure. This CDC link is an example of their work on Crisis Communication.

            [3] The Reveres, who jointly wrote the Effectmeasure blog, wrote a series of post when the IHR(2005) was released. I wrote a response (at the link). At the bottom of the first post are links to the original blog post and David Sencer's anecdote.

            On the role of WHO, IHR (2005) & Reveres' posts -

            I copied the Sencer extracts to the Experts Forum here at the time and you can access directly there.
            Last edited by JJackson; September 21, 2017, 07:41 AM.


            • #7

              si je sais la capacit? de la technologie, dont j'use, certains dirons, dont j'abuse, soyons concret :

              par ce lien nous disposons* de la somme des d?finitions de 1606 ? 1932


              depuis, il en est apparu combien ?
              certaines sont dans des normes locales nationales et ou internationales, d'autres dans des lois, etc . On en publie de nouvelles, bien trop souvent, mais, sans abroger les anciennes, donc :

              la machine , c'est comme les data bases; la qualit? du r?sultat d?pend, notamment, de ce qui lui a ?t? fourni .

              Concernant la gestion de crise, c'est toujours un bonheur quand il est ?voqu? un sujet ? fort poids culturel, le bon exemple me semble, en Angleterre:


              * le manque c'est ceci :