Announcement

**JJackson** · July 8, 2017, 12:46 PM

The first post in this thread was a call for a cultural, and structural, change in the way we collect and store flu sequence data. It was focused principally on the structural aspect, as a first step, but this post is looking at radically changing the culture of sequence analysis and a reversal of priorities.

What makes it into the database was initially any sequences generated in the course of someone’s research. This rather random aggregation did not produced the kind of data needed to answer questions like ‘what are the predominant seasonal flu strains?’ which we need to know if we are going to pick the best vaccine candidates for next season. To counter this a number of sentinel sample collections systems have been set up to generate regular ‘polling data’ from a representative group of GPs and hospitals with a wide geographical spread. Now we have enough data of the right type to watch the progress of competing serotypes and strains and to tentatively predict next year’s flu strain. More accurately we are trying to assign relative probabilities for the strains we know off and have data for. Problem solved – but only for human seasonal flu which may be important to us – but it is a largely irrelevant dataset as far as flu genetics is concerned.

The real action is all taking place in the Anseriformes (waterfowl) with some interaction with other bird groups and, occasionally, mammals. [1] The vast majority of the sequence data collected to date came from humans, or commercially important animals (Pigs, Poultry and a few Ducks). If you remove all of these and select only wild birds there isn’t much left and what there is was not systematically collected with a view to answering useful questions.

Which brings us to the crux of the problem 'why have the database?'
If it is merely somewhere to collect the left-over sequence data generated as a by-product of research it is never going to be very useful in its own right. If however a sentinel system similar to the one collecting human seasonal flu data is setup then – in time – we will be able to answer a lot of questions. This system would need regular sampling globally with an emphasis on sites in SE Asia and along the principal migratory flyways. This is a fundamental and radical change it changes the priority to feeding the database and then generating research on the existing data, not the other way around.

Why would we want to do this?
The short answer is for the future – more on which later.

It is an enormous undertaking that would require netting, sampling and sequencing monthly at 50+ sites. If you net 20 birds per trip then you have tested 12,000 birds in a year and if 10% have flu then the database grows at a little over 1000 full genome sequences per annum. How does that compare to 2016? Using the form and link provided in the first post I asked that question [2] and the answer was 8 nettings providing 11 infected birds and their sequence data. Of the 11 three more were not water fowl so a total of 8 relevant genomes globally for the year. This is not enough data to gain any kind of useful picture about the primary reservoir of influenza genetics (the GISAID flu database will have some more unique sequences but not enough to negate the point).

Even a cursory glance through the Scientific library section on this site (we are eternally indebted to Tetano, and others, for their tireless working in building this valuable resource) will show just how often researchers are trying to tie mutational to functional changes. This is usually based on hints from previous research, but very rarely based on the existing data set, there just isn’t enough of it to draw any useful conclusions.

So how could a more comprehensive data set be used – if we had one?
This is an extract from Wikipedia’s Big Data page [3] (I have ‘snipped’ out a few bits and added some highlights).

Decoding the human genome originally took 10 years to process, now it can be achieved in less than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten years (snip)

Google's DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any 'friction points,' or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources from Google's search server to scale social experiments that would usually take years, instantly.[110][111]

23andme's DNA database contains genetic information of over 1,000,000 people worldwide.[112] (snip). Ahmad Hariri, professor of psychology and neuroscience at Duke University who has been using 23andMe in his research since 2009 states that the most important aspect of the company's new service is that it makes genetic research accessible and relatively cheap for scientists.[113] A study that identified 15 genome sites linked to depression in 23andMe's database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper.[118]

We are some way from qualifying as ‘Big Data’ but it is the use of AI and analytic tools that have already been developed for this field that I view as being the ultimate goal. The advances in computing power and data analysis proceed at great pace and the priority now should be to start building a database on which they can be used.

Over the last decade the form in the first post has not changed much but the data linked to the search results is far more comprehensive. Where there is data available you can see, for each Amino Acid, its frequency relative to others at that position, any know ramifications, which publications discuss it (with links to relevant Pubmed articles), link out to a rotatable 3D protein model and other relevant databases.
Big data AI analytics should have evolved to the point where it can start to find correlations between form and effect, biological fitness to form, fitness restrained domains and – further down the line – link these to tertiary protein structural changes at points of viral, drug and immune system importance suggesting pathways for drug design. All of which we need to begin to prepare for by building a database with not just sequences but enough sequences of the right type.

As I am interested in Zoonotic emergence I have used data collection relevant to related research in this post, but the principal is general. The thing Big Data analytics is good at is pattern recognition and finding correlation within a data dump, traits making it ideally suited to nucleotide analysis.

Footnotes, Supplementary data and links.

While an explanation of the Machine learning/Iterative Neural nets/Clustering/Feature Selection/Big data interface are beyond the scope of this post (and probably my ability to cogently explain) a basic understanding would be very helpful in evaluating this post. If you are not conversant with this area a search on any of the terms you are unfamiliar with would be a good start. These videos may also help.

This is a Vlab discussion forum held at Stanford on Deep Learning in 2014 (you need to watch the dates on such things because the field changes so fast) it is 1hr 24min in total but includes a basic 15min primer starting at around minute 7.
https://www.youtube.com/watch?v=czLI...#t=1702.424479

The second link is to 'The joy of Stats' a 1hr documentary made by Hans Rosling for the BBC in 2010.
I am happy to recommend all of it but the most relevant section begins around 37 minutes in and then moves on to what I described as "further down the line" at 52min.

The Joy of Stats | Gapminder

http://www.gapminder.org/videos/the-joy-of-stats/

[1] In the ‘H7N9 discussion thread’ I developed a concept I called the ‘Zoonotic edge’ and the overlap between the gene constellations found in the Primary avian group and in the semi-independent secondary groups. The Venn diagram below (taken from one of these posts) is my attempt at graphically depicting this relationship. The fuller explanation of how this fits in to the bigger picture is explained in the full text.
https://flutrackers.com/forum/forum/...-5-2014-closed

In post #29 the 2nd & 3rd paragraphs give another example of how little H7N9 data there was and why the lack of a denominator further degrades its value.

Post #47 Discusses where to swab, when taking samples, and more on wild bird data collection.

Post #52 Answers a question following from #47 and then begins to lead into the arguments being made in this thread.

Post #72 Is a long response to a post by NS1 (still on the subject of a lack of systematic data collection) which looks at why it is important and introduces the Avian groups and Zoonotic edge concept. This post initiated a fair bit of discussion and posts #79 & #94 are clarifications and continuations of the arguments in post #72 - the three, taken together, form a lot of the background to this thread.

Post 29 to 94 all occurred over 4 days at the end of Jan. 2014 and I have only linked to those of my posts which are relevant to this thread.

[2] The search parameters sampled Type A flu (all sub-types), full genome, Avian and 2016 (collection & submission dates). This returned 34 hits, 23 of which were poultry, leaving 11. Their database had about half a million Type A genomes of which about 60% were complete.

[3] https://en.wikipedia.org/wiki/Big_data

**bertrand789** · July 8, 2017, 03:55 PM

merci,

les donn?es sont issues de pr?l?vements humains, animaux d'?levage et sauvages, il manque, de mon point de vue, des pr?l?vements r?guliers, d'animaux commensaux potentielement porteur sains et non en milieu d'?levage mais urbain.
Si pour certains le but est de pr?venir l'?pid?mie donc de trouver le cas primaire, il pourrait ?tre tr?s instructif de savoir ce que vivent les commensaux des humains dans les grands ports et grandes villes . Je dis cela car , si les commensaux sont porteurs de souches qui par r?assortiment peuvent induire un virus probl?matique , on aurait beaucoup avanc? en ayant identifi? le danger.

j'avais sugg?r? de laisser les sp?cialistes de la FAO poursuivre ce qui a ?t? initi? sur la faune sauvage, mais, en parral?le, ce serait bien que les ports d'une certaines ampleur , disons les 20 plus important au monde dans la zone orange de la carte ci dessous, et les villes de plus de 5 millions d'habitants (un exemple ) fassent une s?rie de pr?l?vement tous les ans sur les animaux commensaux et pr?sent presque partout: le colvert et un rongeur la souris ou le rat . Je sais pourquoi cela n'a jamais ?t? initi? , la soit disant qualification du pays au niveau du code O.I.E

**JJackson** · July 9, 2017, 05:12 PM

It is interesting that we are having this conversation on this thread because my French is not good enough to get much more than the gist of your posts. To understand them I use Google translate and get the original and machine translation side by side and let the app read the english aloud while I read your french. Where the machine translation falters I can often guess the correct translation from the French, due to our shared Latin roots, also it is probably a very good way to improve my french.
The technology to do all this is ultimately Big-data neural-net based machine learning. In the links provided in the previous post how this was achieved was explained by Google ? as well as where it is all going. The link is to the late Hans Rosling?s ?Joy of Stats? and the Google translate section starts at minute 40.
The one bit I am not clear on in your post is the end. Why would the OIE code (I am assuming you mean the Terrestrial Animal Health Code) block sampling? I thought it covered cross boarder transports.
I am also a believer in the wisdom of the OIE ?ONE Health? doctrine, which I hope you would already have guessed.

**bertrand789** · July 10, 2017, 01:39 AM

merci,

concernant les statistiques , le Pr?sident Pompidou tenait ces propos: c'est comme un bikini, si cela laisse entrevoir des choses , cela ne montre rien de l'essentiel. Mes propos ont pour fondement le souhait d'enfin permettre de d?ployer l'usage de la m?thode HACCP
https://en.wikipedia.org/wiki/Hazard...control_points

le mot anglais "hazard" a ?t? traduit de bien des mani?res, dont celui de risque, notamment, par des organismes qui ne pouvaient pas ne pas savoir les pourquoi des fautes intentionnelles de traduction ...

La méthode HACCP

http://bazin-conseil.fr/haccp2.html

La méthode HACCP, outil de base de l'analyse des risques pour les aliments, et sa mise en application pratique (PRP, PRPO et CCP)

les entreprises de la Silicon valley seraient bien inspir?es de se pr?occuper, plus, et de la mise ? disposition de vrais dictionnaires ( les dictionnaires d'autrefois ont montr? la m?thode ) et de l'?volution des syst?mes de traduction . Ce fil m'a aussi permis d'entrevoir ce que de multiples traductions pouvait induire .

Concernant le code O.I.E, s'il permet d'entrevoir ce que la sagesse peut ?tre, il permet aussi d'entrevoir les limites, en cas de guerre, notamment ?conomique...

Economic warfare - Wikipedia

https://en.wikipedia.org/wiki/Economic_warfare

Concernant, la gestion actuelle, nous avions eu, cet hiver, la surprise avec le statut indemne du Luxembourg et de la Belgique, mais l? avec l'introduction officielle du concept de "grippe" animale d'?t?, au moment ou officiellement le risque ?tait ?valu? comme faible ou nul , bien des acteurs sont d?stabilis?s.

Ce me semble un bon reflet des limites de la strat?gie actuelle d'usage de la sagesse du Code .
Pour mieux faire entrevoir la situation , je vous offre ces deux liens, ? d?guster ensemble:

Sur 41 contr?les diligent?s par la DDCSPP depuis novembre 2016, ? 50 % ont r?v?l? des anomalies majeures ?. Comment expliquer ce taux ?lev? ?
http://www.sudouest.fr/2017/07/07/co...96906-2062.php

Pr?s de 7 000 appelants ? saisir au Salon des migrateurs de Cayeux-sur-Mer

http://www.courrier-picard.fr/42042/...cayeux-sur-mer

j'ai beaucoup aim? l'humour de ce titre et peut ?tre de son auteur, il est ?voqu? le fait de saisir ces animaux. La langue fran?aise permet bien des choses, quand on prend le temps d'en apprendre les usages...

**JJackson** · July 10, 2017, 12:58 PM

Your Pompidou quote put me in mind of a quote by another Frenchman

?You would make a ship sail against the wind and currents by lighting a bonfire under her decks? I pray you excuse me. I have no time to listen to such nonsense.?
(attributed to Napoleon Bonaparte on being asked to fund development of a steam ship, but see notes [1]).

I think both men were wrong because neither had enough knowledge to see the technology's potential.

We will have to ?agree to differ? on the subject of machine translation. As the man from Google explains in the video (link in post #2 starts at minute 40) traditional direct translation methods work extremely well between two languages but breakdown when used in many-to-many language translation. The machine answer is to translate the concept implied by the word, or phrase, used into an internal language of the machine?s own invention and then translate this into the language you choose. It will get better with time as it gains experience, particularly if we take the time to point out its more obvious errors.

Documents where the exact wording is crucial will still need manual error checking. There is little we can do about the deliberate misuse of words ? exemplified by newspaper headlines like ?'Cancer Hazard from ...'? when the text shows a small statistical increase in risk ? beyond drawing attention to it.[2]

You may find this of interest. It is a comment of David Sencer?s posted in a discussion (at the now, sadly, closed Effectmeasure Blog) on the practical workings of the IHR. In this case two countries exceed their powers under the IHR by closing their boarders to a third. The reason given was fear of cholera and the economic target was lemons (which were rotting in lorries at the boarder). David was one of the Judges of arbitration. [3]

The ?50% anomalies? article put in mind the beginning of another quote ?Politics is the slow boring of hard boards? Max Weber, but it works just as well with ?Progress? instead of ?Politics?.

Game, Poultry and Livestock fairs are a real danger in disease spread and have been ?ground zero? in the epidemiological investigation of a number of zoonotic outbreaks reported in these threads.

Notes & Links.
[1] I knew the quote I posted but I checked its providence prior to posting and he probably never said it. In 1803 Robert Fulton had a working prototype and, through an intermediary, went to Bonaparte for further funding provoking this reply.

"Il y a dans toutes les capitales de l?Europe, une foule d?aventuriers et d?hommes ? projets qui courent le monde, offrant ? tous les souverains de pr?tendues d?couvertes qui n?existent que dans leur imagination. Ce sont autant de charlatans ou d?imposteurs, qui n?ont d?autre but que d?attraper de l?argent. Cet Am?ricain est du nombre. Ne m?en parlez pas davantage."

Which I assume became

Quoi! Monsieur, Vous feriez voguer un navire contre vents et mar?es en allumant un feu sous ses ponts? Je vous prie de m'excuser, je n'ai gu?re le temps d'?couter pareille absurdit?!

[2] If exact meaning is important in your work you may be interest in the writings of Jody Lanard & Peter Sandman, if you do not already know them.
They are a husband and wife team and work in risk communication (she is more on the health side while he is more commercial). They have a blog ( http://www.psandman.com ) and an interest in zoonosis. Jody is a member here, but I don?t think she posts but used to comment at Effectmeasure. This CDC link is an example of their work on Crisis Communication. https://www.cdc.gov/nceh/tracking/conf04/pdfs/thu/ses4A/j_lanard.pdf

[3] The Reveres, who jointly wrote the Effectmeasure blog, wrote a series of post when the IHR(2005) was released. I wrote a response (at the link). At the bottom of the first post are links to the original blog post and David Sencer's anecdote.

On the role of WHO, IHR (2005) & Reveres' posts - https://flutrackers.com/forum/forum/...-reveres-posts

I copied the Sencer extracts to the Experts Forum here at the time and you can access directly there.
https://flutrackers.com/forum/forum/...measure?t=6895

**bertrand789** · July 10, 2017, 05:21 PM

merci,

si je sais la capacit? de la technologie, dont j'use, certains dirons, dont j'abuse, soyons concret :

par ce lien nous disposons* de la somme des d?finitions de 1606 ? 1932

http://artflsrv02.uchicago.edu/philologic4/publicdicos/query?report=bibliography&head=viande

depuis, il en est apparu combien ?
certaines sont dans des normes locales nationales et ou internationales, d'autres dans des lois, etc . On en publie de nouvelles, bien trop souvent, mais, sans abroger les anciennes, donc :

la machine , c'est comme les data bases; la qualit? du r?sultat d?pend, notamment, de ce qui lui a ?t? fourni .

Concernant la gestion de crise, c'est toujours un bonheur quand il est ?voqu? un sujet ? fort poids culturel, le bon exemple me semble, en Angleterre:

https://en.wikipedia.org/wiki/Abbotsbury_Swannery

* le manque c'est ceci : https://fr.wikipedia.org/wiki/Dictionnaire_de_Tr%C3%A9voux

Announcement

Influenza databases need root and branch reform.

Influenza databases need root and branch reform.

Comment

Comment

Comment

Comment

Comment

Comment