Feeds:
Posts
Comments

Archive for the ‘data mining’ Category

From within the strange loop of self-reference the question “What is Data?” emerges.  Ok, maybe more practically the question arises from our technologically advancing world where data is everywhere, spouting from everything.  We claim to have a “data science” and now operate “big data” and have evolving laws about data collection and data use.   Quite an intellectual infrastructure for something that lacks identity or even a remotely robust and reliable definition.  Should we entrust our understanding and experience of the world to this infrastructure?   This question seems stupid and ignorant.  However, we have taken up a confused approach in all aspects of our lives by putting data ontologically on the same level as real, physical, actual stuff.    So now the question must be asked and must be answered and its implications drawn out.

Data is and Data is not.   Data is not data.   Data is not the thing the data represents or is attached to.   Data is but a ephemeral puff of exhaust from an limitless, unknowable universe of things and their relations. Let us explore.

Observe a few definitions and usage patterns:

Data According to Google

Data According to Google

https://www.google.com/webhp?sourceid=chrome-instant&rlz=1CAZZAD_enUS639US640&ion=1&espv=2&ie=UTF-8#q=data+definition

The latin roots point to the looming mystery.  “Give” -> “Something Given”.   Even back in history data was “something”.   Almost an anti-definition.

Perhaps we can find clues from clues:

Crossword Puzzle Clues for

Crossword Puzzle Clues for “Data”

http://www.wolframalpha.com/input/?i=data&a=*C.data-_*Word-

Has there been a crossword puzzle word with broader or more ambiguity than that?   “Food for thought?”  seems to hit the nail on the head.   The clues boil down to data is: numbers, holdings, information, facts, figures, fodder, food, grist, bits.   Sometimes crunched and processed, sometimes raw.  Food for thoughts, disks, banks, charts and computers.

????????????????????????

Youtube usually can tell us anything, here’s a video directly answering What Is Data:

Strong start in that video, Qualitative and Quantitative… and then by the end the video unwinds the definitions to include basically everything.

Maybe a technical lesson on data types will help elucidate the situation:

Data Types

Perhaps sticking to computers as a frame of reference helps us.   Data is stuff stored in a database specified by data types.  What exactly is stored?   Bits on a magnetic or electric device (hard drive or memory chip) are arranged according to structure defined by this “data” which is defined or created or detected by sensors and programs…   So is the data the bit?  the electric symbol?  the magnetic structures on the disk?  a pure idea regardless of physical substrate?

The confusing self-referential nature of the situation is wonderfully exploited by Tupper’s formula:

Tupper's formula

http://mathworld.wolfram.com/TuppersSelf-ReferentialFormula.html

What exactly is that?  it’s a pixel rendering (bits in memory turned into electrons shot a screen or LED excitations) of a formula (which is a collection of symbols) that when fed through a brain or a computer programmed by a brain end up producing a picture of a formula….

The further we dig the less convergence we seem to have.   Yet we have a “data science” in the world and employ “data scientists” and we tell each other to “look at the data” to figure out “the truth.”

Sometimes philosophy is useful in such confusing situations:

Information is notoriously a polymorphic phenomenon and a polysemantic concept so, as an explicandum, it can be associated with several explanations, depending on the level of abstraction adopted and the cluster of requirements and desiderata orientating a theory.

http://plato.stanford.edu/entries/information-semantic/

Er, that doesn’t seem like a convergence.  By all means we should read that entire essay, it’s certainly full of data.

Ok, maybe someone can define Data Science and in that we can figure out what is being studied:

https://beta.oreilly.com/ideas/what-is-data-science

That’s a really long article that points to data science as a duct taped loosely linked set of tools, processes, disciplines, activities to turn data into products and tell stories.   There’s clearly no simple definition or identification of the actual substance of data found there or in any other description of data science readily available.

There’s a certain impossibility of definition and identification looming.   Data isn’t something concrete.  It’s “of” everything.  It appears to be a shadowy representational trace of phenomena and relations and objects that is itself encoded in phenomena and relations and objects.

There’s a wonderful aside in the great book “Things to Make and Do in the Fourth Dimension” by Matt Parker

Finite Nature of Data

Finite Nature of Data

https://books.google.com/books?id=wK2MAwAAQBAJ&lpg=PP1&dq=fourth%20dimension%20math&pg=PP1#v=onepage&q=fourth%20dimension%20math&f=false

Data seems to have a finite, discrete property to it and yet is still very slippery.  It is reductive – a compression of the infinite patterns in the universe, it is also a pattern. Compressed traces of actual things.   Data is wisps of existence, a subset of existence.   Data is an optical and sensory illusion that is an artifact of the limitedness of the sensor and irreducibility of connections between things.

Data is not a thing.   It is of things, about things, traces of things, made up of things.

There can be no data science.   There is no scientific method possible.   Science is done with data, but cannot be done on data.  One doesn’t do experiments on data, experiments emit and transcode data, but data itself cannot be experimental.

Data is art.   Data is an interpretive literature.  It is a mathematics – an infinite regress of finite compressions.

Data is undefined and belongs in the set of unexplainables: art, infinity, time, being, event.

Data = Art Data = Art

Read Full Post »

In Defense of The Question Is The Thing

I’ve oft been accused of being all vision with little to no practical finishing capability. That is, people see me as a philosopher not a doer. Perhaps a defense of myself and philosophy/approach isn’t necessary and the world is fine to have tacticians and philosophers and no one is very much put off by this.

I am not satisfied. The usual notion of doing and what is done and what constitutes application is misguided and misunderstood.

The universe is determined yet unpredictable (see complexity theory, cellular automota). Everything that happens and is has anticedents (see behaviorism, computation, physics). Initiatial conditions have dramatic effect on system behavior over time (see chaos theory). These three statements are roughly equivalent or at least very tightly related. And they form the basis of my defense of what it means to do.

“Now I’m not antiperformance, but I find it very precarious for a culture only to be able to measure performance and never be able to credit the questions themselves.” – Robert Irwin, page 90, seeing is forgetting the name of thing one sees

The Question Is The Thing! And by The Question that means the context or the situation or the environment or the purpose. and I don’t mean The Question or purpose as assigned by some absolute authority agent. It is the sense of a particular or relevative instance we consider a question. What is the question at hand?

Identifying and really asking the question at hand drives the activity to and fro. To do is to ask. The very act of seriously asking a question delivers the do, the completion. So what people mistake in me as “vision” is really an insatiable curiousity and need to ask the right question. To do without the question is nothing, it’s directionless motion and random walk. To seriously ask a question every detail of the context is important. To begin answering the question requires the environment to be staged and the materials provided for answers to emerge.

There is no real completion without a constant re-asking of the question. Does this answer the question? Did that answer the question?

So bring it to something a lot of people associate me with: web and software development. In the traditional sense I haven’t written a tremendous amount of code myself. Sure I’ve shipped lots of pet projects, chunks of enterprise systems, scripts here and there, and the occassional well crafted app and large scale system. There’s a view though that unless you wrote every line of code or contributed some brilliant algorithm line for line, you haven’t done anything. The fact is there’s a ton of code written every day on this planet and very little of it would i consider “doing something”. Most of it lacks a question, it’s not asking a question, a real, big, juicy, ambitious question.

Asking the question in software development requires setting the entire environment up to answer it. Literally the configuration of programmer desks, designer tools, lighting, communication cadence, resources, mixing styles and on and on. I do by asking the question and configuring the environment. The act of shipping software takes care of itself if the right question is seriously asked within an environment that let’s answers emerge.

Great questions tend to take the shape of How Does This Really Change the World for the User? What new capability does this give the world? How does this extend the ability of a user to X? What is the user trying to do in the world?

Great environments to birth answers are varied and don’t stay static. The tools, the materials all need to change per the unique nature of the question.

Often the question begs us to create less. Write less code. Tear code out. Leave things alone. Let time pass. Write documentation. Do anything but add more stuff that stuffs the answers further back.

The question and emergent answers aren’t timeless or stuck in time. The context changes the question or shape of the question may change.

Is this to say I’m anti shipping (or anti performance as Irwin put it)? No. Lets put it this way we move too much and ask too little and actual don’t change the world that much. Do the least amount to affect the most is more of what I think is the approach.

The question is The Thing much more than thing that results from work. The question has all the power. It starts and ends there.

Read Full Post »

Some people hate buzzwords, like Big Data.   I’m ok with it.  Because unlike many buzzwords it actually kind of describes exactly what it should.   It’s a world increasingly dependent on algorithmic decision making, data trails, profiles, digital finger prints, anomaly tracking… not everything we do is tracked, but enough is that it definitely exceeds our ability to process it and do genuinely useful things with it.

Now, is it because of the tools/technology that makes Big Data so challenging to businesses?   I suppose somewhat.  I think it it’s more behavioral than anything.  Humans are very good at intuitive pattern recognition.   We’re taking in Big Data every second – through our senses, working around our neural systems and so on.    We do it this without being “aware”.   With explicit Data Collection and explicit Analysis like we do in business we betray our intuitions or rather our intuition betrays us.

How so?

We often go spelunking through big data intuiting things that aren’t real.  We’re collecting so much data that it’s pretty easy to find patterns, whether they matter or not.  We’re so convinced there’s something to find there, we often Invent A Pattern.

With the ability to collect so much data our intuition tells us if we collect more data we’ll find more patterns.  Just Keep Collecting.

And then!  we have another problem.   we’re somewhat limited by our explicit training.

We’re so accustomed to certain interfaces with explicitly collected data – Spreadsheets, Relational Database GUIs, Stats programs, that we find it hard to imagine working with data in any other way.   We’re not very good at transcoding data into more useful forms and our tools weren’t really built to make that easier.   We’re now running into this “A Picture is Worth a Thousand Words” or some version of Computational Irreducibility.   Our training has taught us to go looking for shortcuts or formulas to compress Big Data into Little Formula (you know take a dataset of 18 variables and reduce it to a 2-axis chart with an up and to the right linear regression line).

Fact is, that’s just not how it works.   Sometimes Big Data needs a Big Picture cause it’s a really complicated network of interactions.  Or it needs a full simulation and so on.

Another way to put this… businesses are so accustomed to the idea of Explainability.   Businesses thrive on Business Plans, Forecasts, etc.   so they force a overly simplistic reductionist analysis of the business and drive everything against that type of plan.   Driving against that type of plan ends up shaping internal tools and products to be equally reductionist.

To get the most out of Big Data we literally have to retrain ourselves against our deepest built in approaches to data collection and analysis.   First, don’t get caught up in specific toolsets.   Re-imagine what it means to analyze data.   How can we transcode data into a different picture that illuminates real, useful patterns without reducing it to patterns we can explain?

Sometimes, the best way to do this is to give away the data to hoards and hoards of humans and see what crafty things they do with it.  Then step back and see how it all fits together.

I believe this is what Facebook has done.  Rather than analyze the graph endlessly for their own product dev efforts, they gave the graph out to others and saw what they created with it.   That has been a far more efficient, parallel processing of that data.

It’s almost like flipping the idea of data analysis and business planning on its head.   You figure out what the data “means” by seeing how people put it to use in whatever ways they like.

Read Full Post »

UPDATE: I missed SWs blog post.  Brilliant!

Early versions of this approach go back nearly 50 years, to the first phase of artificial intelligence research. And incremental progress has been made—notably as tracked for the past 20 years in the annual TREC (Text Retrieval Conference) question answering competition. IBM’s Jeopardy system is very much in this tradition—though with more sophisticated systems engineering, and with special features aimed at the particular (complex) task of competing on Jeopardy.

Wolfram|Alpha is a completely different kind of thing—something much more radical, based on a quite different paradigm. The key point is that Wolfram|Alpha is not dealing with documents, or anything derived from them. Instead, it is dealing directly with raw, precise, computable knowledge. And what’s inside it is not statistical representations of text, but actual representations of knowledge.

 

Maybe you’ve seen the latest NOVA episode about Watson, the AI machine that played Jeopardy against former champions.

The first blush answer would be: NO.

The linguistics are simply not there yet.

However, if Jeopardy questions were more “computational” vs. linguistic and fact retrivial the answer might be: YES.

Wolfram|Alpha has the raw power to do it, but it lacks the data and linguistic system to do it.

IBM was clever to combine the history of Jeopardy questions with tons of documents.    It’s similar, but not the same as, common sense engine from Cyc.    It’s not fully computational knowledge.  It’s semantic.   It’s cleverness comes from the depth of the question training set and the document training set.

It would breakdown quickly if it were seeing questions about facts that had never been printed in a document before.   An example would be “How far away will the moon be tomorrow?”

Wolfram|Alpha can answer that!   Now, what’s challenging is that there is a much bigger universe of questions that have never been asked than those that have!   So Wolfram|Alpha already has far more knowledge.   However, its linguistics are not strong enough to clearly demonstrate that AND it will probably never catch up!   Because Wolfram|Alpha can answer questions that have never been asked so people will always ask it questions that will trip it up… they will always push the linguistics.

In the end, a combination of Watson, Wolfram|Alpha and Cyc could be very fun indeed!

Perhaps we should hack that up?

Read Full Post »

I recently watched the PBS documentary, art and copy. it’s a feature about advertising focusing mostly on the big agencies and agency personalities. absolutely fascinating. partly because these are big personalities but mostly because the campaigns featured are ones almost all of us know well and probably love.

there’s a stat at the end… 186000 employees at ad agencies worldwide. 26000 agencies. but 4 holding companies produce 80% of the advertising spend.

what that implies there just isn’t that much advertising that gets big, mass consumer popularity (and likely nor does the products behind the advertising).

so the questions for me:

is most advertising unappealing? just noise?

do people only have so much attention to give? the populace can’t support more than a few campaigns getting big?

are most folks in advertising biz just not very good?

is the ad biz really about unglamorous, small campaigns that work for small companies?

is the old ad model going to last? more and more big brands didn’t need an agency and an ad budget at all to go big (google, facebook, twitter, crocs…)

when should a biz use a big traditional campaign?

I don’t question whether a well capitalized, well executed branding campaign works. they do. I think it’s hard to get all the right things to make it happen and only those with the deepest pockets, best products and most aggressive teams will ever have a shot.

I think that’s why other advertising approaches are more appropriate for most businesses and growing in spend online advertising, for the most part, isn’t artful. it’s math. it’s about getting frequency and follow up and flow just right. science based advertising works better for the majority of products and services where there’s little differentiation or brand value between competitors. price and location (at time of purchase) are the keys, not artful impact.

also worth noting is that the current context in which online is viewed doesn’t lend itself well to bigger more potent messages like tv or radio. I think some of that has to do with the fact that tv and radio are more passive consumption around visuals and sound of people rather than text about the world. and tv and radio are usually consumed with others generating more shared experiences. the built in fragmented personalization of the web means known of us ever have the same basic experience.

I’ve worked on a lot of online campaigns that tried to do the big budget big branding thing. no shortage of good ideas and mostly good execution. the consumers just never respond.

there are no best way to do it.

one thing I think the folks in the documentary have in common with the successful math based online advertisers and agencies is a willingness to try and be wrong. too many folks think there’s a best way to do it and that you can know that a priori. you can’t.

as one of the agency celebrates. fail harder.

Read Full Post »

Higher order conditioning (you know, the stuff we’re really good at) involves the paradigm where previously successfully conditioned stimulus (CS) operates as the unconditioned stimulus (US) for further conditioning.  And, like classical conditioning in general, high-order classical conditioning is often linked to known biological predisposition of the organism trained.

But the implication of traditional classical conditioning are less obvious; a real hit in the head to those that insist that life should be about using ‘common sense’.

Take the instance of McDonald’s being sued (2010) to stop giving out toys with happy meals (used to entice kids and adults when paired with bad food) at least, one would claim in such a court fight.   Toys which are also reinforcers that keep the people coming back, (operant conditioning) not necessarily for the food but for the ‘free’ giveaways.   Is that really any different than places providing good service, clean restrooms, good food, social amenities, or cigarettes  being the delivery mechanism for nicotine, etc.?

Well, some obviously think so.

The food, an Unconditioned Stimulus (US), is paired with a toy, a Conditioned Stimulus (CS) and a potential reinforcer for the children (yea, a new toy to have and hold) and the children (yea, a new toy for them to have and hold and keep them satisfied or quiet, whichever is the case).  The parents buy the food (US) and get the toy (CS) at the same time and they become linked. There is also the gambler’s bet operating in this type of example.  The conditioning that takes place is rarely some part of the awareness of anyone other than the people in the delivery business, thus, proving once again, that you do not have to have awareness to be conditioned or to avoid conditioning.  The awareness is a irrelevant.

Soon the family or the child attends McDonald’s and is not hungry for the nuggets, burger or shake, etc. and wants just the toy!   Not going to happen so the spending entity — grandma, parent, older person… buys the kid’s meal to appease the kid (enablement) and someone ends up eating the extra food or the food that the kid didn’t want but was purchased to get the new (surprise) toy.   Great!  The kid isn’t going to get more rotund but the adults are because they are now stuck eating the kids meal and their meal… after all there are poor people starving somewhere in the world.   (huh?)

Anyhow, some parents and food focused groups are saying that they want McDonald’s to stop the practice which “hooks” the parents and the kids on going to McDonalds.  McDonalds’ is protesting the suit.

Basically, we are all conditioned and that example is no different than other types of conditioning.  Making is less or more obvious is not the substantive question other than for the media.  The real issue is better food with less fat for children from a distribution place that many are conditioned to eat at.  But that is not the prima fascia case being made. Any changes in delivery mechanisms will require changes in fast food services content [food] which, in some cases, neither the children or the parents (and certainly not the fast food distributors) want to consider.

The result is conditioned helplessness of the parents..  Something to consider when selecting a restaurant next Friday night… or a business where repeat customers are part of the planned strategy…

Read Full Post »

Oh how I miss the day of drawn out conversation in life and business!

Instead of two hour chats by forced physical constraints modern technology has ushered in the stunted conversation. Dropped cell connections, Twitter, instant messaging, emoticons, email clients, reduced lunch hours….. All of these cut the conversation up. There’s no flow anymore. Just chunks.

Flow is so essential to forming complex thought and behavior.

I’m certain stunted conversations are not optimal for human thought. Overtime I worry if we over come this. Then again maybe the complete thoughts and conversations we think we have aren’t all that important…..

What say you?

Read Full Post »

From Decision Science News:

What of the adage “the best predictor of future performance is past performance”? It seems less true than Sting’s observation “History will teach us nothing“. Let’s continue the investigation.

DSN did a nice analysis on a ton of baseball game out comes to see whether a team who had just won a game was more likely to win the next game.   There have been other studies like this involving basketball players “hot streaks.” Similar results revealed… well, it’s a crap shoot shot to shot, game to game.

Now, over the long haul winning records, shot percentages indicate there is some skill involved.  But at the micro level it just ain’t true!

Now why do we as fans, observers, interested parties believe in hot streaks, win streaks, etc. etc?   is it a side effect of some other useful thing we do in associating events?  or is there really some direct value in assuming immediate past performance indicates a similar future performance?

what can we test to figure that out?

the nba hot streak article has some insights….

Read Full Post »

In my early discussions and presentations regarding Wolfram|Alpha I often used Computational Journalism as the initial non-engineering use case.   Most folks weren’t quite sure what I meant initially by Computational Journalism until I explained how, as a toe in the water step, one could easily and automatically enhance articles and features with generated knowledge and visuals.   It seems I won’t need to explain in great depth the utility and inevitability of computational journalism because enough conference summaries, op-eds and journalists are starting to popularize the concept.

Here’s a great piece from PBS.

A new set of tools would help reporters find patterns in otherwise unstructured or unsearchable information. For instance, the Obama administration posted letters from dozens of interest groups providing advice on issues, but the letters were not searchable. A text-extraction tool would allow reporters to feed PDF documents into a Web service and return a version that could be indexed and searched. The software might also make it easy to tag documents with metadata such as people’s names, places and dates. Another idea is to improve automatic transcription software for audio and video files, often available (but not transcribed) for government meetings and many court hearings.

Wired UK goes a bit deeper into some specific companies and projects.

And here’s a nice presentation by Kurt Cagle that gives a good overview of some of the computational foundational technology out there.

I don’t think it’s unreasonable to think that the vast majority of daily news will be completely machine generated and machine broadcast.  Journalists will be increasingly involved in bigger, deeper features and defining the computational logic to generate the news stream.

Read Full Post »

If been asked many times about the size of Facebook’s infrastructure.  Folks love to get a gauge of how much hardware/bandwidth is required to run high trafficked sites.

Here’s a recent report of the set up. Read the details there.  In short, 30,000 or so servers with tons of optimizations to networking, mysql, PHP, web server, and lots and lots of caching.

There’s an interesting point here.  30,000 servers to handle 300 million registers users and their 200 billion pageviews a month.  That puts about 7 million pageviews per server.   Almost every company I have worked with as WAY over built hardware and infrastructure.  I’ve seen people deploy new servers for every 100,000 pageviews per month.   Modern web servers and dbs, with the right set up, can handle far more load than most webmasters and IT folks realize.

One subtle point that’s hard to figure out from this data… the amount of compute/CPU time/power required to parse the metrics for this site.  Beyond serving the site up there’s a considerable amount of business intelligence to work through.  Logging and log parsing, without even the analysis part, has got to be a major effort not accounted for in these infrastructure details.

Read Full Post »

Older Posts »