Implementations generally relate to selecting soundtracks. In some implementations, a method includes determining one or more sound mood attributes of one or more soundtracks, where the one or more sound mood attributes are based on one or more sound characteristics. The method further includees determining one or more visual mood attributes of one or more visual media items, where the one or more visual mood attributes are based on one or more visual characteristics. The method further includes selecting one or more of the soundtracks based on the one or more sound mood attributes and the one or more visual mood attributes. The method further includes generating an association among the one or more selected soundtracks and the one or more visual media items, wherein the association enables the one or more selected soundtracks to be played while the Pone or more visual media items are displayed.
Methods, systems, and media for personalizing computerized services based on mood and/or behavior information from multiple data sources are provided. In some implementations, the method comprises: obtaining information associated with an objective of a user of a computing device from multiple data sources; determining that a portion of information from each of the data sources is relevant to the user having the objective, wherein the portion of information is indicative of a physical or emotional state of the user of the computing device; assigning the user of the computing device into a group of users based at least in part on the objective and the portion of information from each of the data sources; determining a target profile associated with the user based at least in part on the objective and the assigned group; generating a current profile for the user of the computing device based on the portion of information from each of the data sources; comparing the current profile with the target profile to determine a recommended action, wherein the recommended action is determined to have a likelihood of impacting the physical or emotional state of the user; determining one or more devices connected to the computing device, wherein each of the one or more devices has one or more device capabilities; and causing the recommended action to be executed on one or more of the computing device and the devices connected to the computing device based on the one or more device capabilities.
A dynamic text-to-speech (TTS) process and system are described. In response to receiving a command to provide information to a user, a device retrieves information and determines user and environment attributes including: (i) a distance between the device and the user when the user uttered the query; and (ii) voice features of the user. Based on the user and environment attributes, the device determines a likely mood of the user, and a likely environment in which the user and user device are located in. An audio output template matching the likely mood and voice features of the user is selected. The audio output template is also compatible with the environment in which the user and device are located. The retrieved information is converted into an audio signal using the selected audio output template and output by the device.
A given set of videos are sequenced in an aesthetically pleasing manner using models learned from human curated playlists. Semantic features associated with each video in the curated playlists are identified and a first order Markov chain model is learned from curated playlists. In one method, a directed graph using the Markov model is induced, wherein sequencing is obtained by finding the shortest path through the directed graph. In another method a sampling based approach is implemented to produce paths on the digraph. Multiple samples are generated and the best scoring sample is returned as the output. In a third method, a relevance based random walk sampling algorithm is modified to produce a reordering of the playlist.
Described herein are methods and system for analyzing music audio. An example method includes obtaining a music audio track, calculating acoustic features of the music audio track, calculating geometric features of the music audio track in view of the acoustic features, and determining a mood of the music audio track in view of the geometric features.
Emoticons or other images are inserted into text messages during chat sessions without leaving the chat session by entering an input sequence onto an input area of a touchscreen on an electronic device, thereby causing an emoticon library to be presented to a user. The user selects an emoticon, and the emoticon library either closes automatically or closes after the user enters a closing input sequence. The opening and closing input sequences are, for example, any combination of swipes and taps along or on the input area. Users are also able to add content to chat sessions and generate mood messages to chat sessions.
The present invention relates to anticipatory lighting from device screens based on user profiles. Systems, methods, and computer readable storage mediums are provided for determining the mood of a user, deriving an appropriate lighting scheme, and then implementing the lighting scheme on all devices within a predetermined proximity to the user. Furthermore, when the user begins a task, the devices can track the user and use the lighting from the nearby screens to offer functional lighting.
Methods, systems, and media for ambient background noise modification are provided. In some implementations, the method comprises: identifying at least one noise present in an environment of a user having a user device, an activity the user includes currently engaged in, and a physical or emotional state of the user; determining a target ambient noise to be produced in the environment based at least in part on the identified noise, the activity the user is currently engaged in, and the physical or emotional state of the user; identifying at least one device associated with the user device to be used to produce the target ambient noise; determining sound outputs corresponding to each of the one or more identified devices, wherein a combination of the sound outputs produces an approximation of one or more characteristics of the target ambient noise; and causing the one or more identified devices to produce the determined sound outputs.
This document describes automated nursing assessments. Automation of the nursing assessment involves a nursing-assessment device that makes determinations of a person’s mood, physical state, psychosocial state, and neurological state. To determine a mood and physical state of a person, video of the person is captured while the person is positioned in front of an everyday object, such as a mirror. The captured video is then processed according to human condition recognition techniques, which produces indications of the person’s mood and physical state, such as whether the person is happy, sad, healthy, sick, vital signs, and so on. In addition to mood and physical state, the person’s psychosocial and neurological state are also determined. To do so, questions are asked of the person. These questions are determined from a plurality of psychosocial and neurological state assessment questions, which include queries regarding how the person feels, what the person has been doing, and so on. The determined questions are asked through audible or visual interfaces of the nursing-assessment device. The person’s responses are then analyzed. The analysis involves processing the received answers according to psychosocial and neurological state assessment techniques to produce indications of the person’s psychosocial and neurological state.
Systems and methods are provided for identifying and rendering content relevant to a user’s current mental state and context. In an aspect, a system includes a state component that determines a state of a user during a current session of the user with the media system based on navigation of the media system by the user during the current session, media items provided by the media system that are played for watching by the user during the current session, and a manner via which the user interacts with or reacts to the played media items. In an aspect, the state of the user includes a mood of the user. A selection component then selects a media item provided by the media provider based on the state of the user, and a rendering component effectuates rendering of the media item to the user during the current session.
Disclosed herein is an “activity assistant” and an “activity assistant user interface” that provides users with dynamically-selected “activities” that are intelligently tailored to the user's world. For example, a graphical UI includes selectable context elements, each of which corresponds to a user-attribute whose value provides a signal to the activity assistant. In response to selecting a parameter associated with at least one of the selectable context elements, a first signal is generated and provided to the activity assistant. In response to providing the signal, one or more activities are populated and ordered based, at least in part, on the signal, and subsequently displayed. The parameters may include a current mood of a user, a current location of the user, associations with other users, and a time during which the user desires to carry out the activity in some examples.
A method includes providing, by an audio playback interface, an initial playlist comprising audio tracks. The method includes receiving a user preference associated with an initial audio track during a listening session, wherein the user preference is indicative of a listening mood of a user and comprises one or more of a user behavior or a natural language input. The method includes generating a representation of the user preference in a joint audio- text embedding space by applying a two-tower model comprising an audio embedding network and a text embedding network. A proximity of two embeddings is indicative of semantic similarity. The method includes training a machine learning model to generate an updated playlist responsive to the listening mood of the user during the listening session. The method includes applying the machine learning model to generate the updated playlist. The method includes substituting the initial playlist with the updated playlist.
A method for social interacting, including using a portable messaging device for designating, from time to time, a plurality of friends, selecting a mood, sending one or more representations of the selected mood to each of the plurality of designated friends, further selecting an updated mood, and further sending one or more representations of the updated mood to each of the plurality of designated friends, to supersede the previously sent one or more representations of the mood. A user interface is also described and claimed.
Implementations described herein relate to causing emoji(s) that are associated with a given emotion class expressed by a spoken utterance to be visually rendered for presentation to a user at a display of a client device of the user. Processor(s) of the client device may receive audio data that captures the spoken utterance, process the audio data to generate textual data that is predicted to correspond to the spoken utterance, and cause a transcription of the textual data to be visually rendered for presentation to the user via the display. Further, the processor(s) may determine, based on processing the textual data, whether the spoken utterance expresses a given emotion class. In response to determining that the spoken utterance expresses the given emotion class, the processor(s) may cause emoji(s) that are stored in association with the given emotion class to be visually rendered for presentation to the user via the display.
Meetings held in virtual environments can allow participants to conveniently express emotions to a meeting organizer and/or other participants. The avatar representing a meeting participant can be enhanced to include an expression symbol selected by that participant. The participant can choose among a set of expression symbols offered for the meeting.
A computing device is described that includes a camera configured to capture an image of a user of the computing device, a memory configured to store the image of the user, at least one processor, and at least one module. The at least one module is operable by the at least one processor to obtain, from the memory, an indication of the image of the user of the computing device, determine, based on the image, a first emotion classification tag, and identify, based on the first emotion classification tag, at least one graphical image from a database of pre-classified images that has an emotional classification that is associated with the first emotion classification tag. The at least one module is further operable by the at least one processor to output, for display, the at least one graphical image.
A device may detect a negative emotion of a user and identify, based on detecting the negative emotion of the user, a task being performed by the user in relation to an item. The device may obtain, based on identifying the task, information to aid the user in performing the identified task in relation to the item. The information may include at least one of information, obtained from a memory associated with the device, in a help document, a user manual, or an instruction manual relating to performing the task in relation to the item; information, obtained from a network, identifying a document relating to performing the task in relation to the item; or information identifying a video relating to performing the task in relation to the item. The device may provide the obtained information to the user.
Systems and methods for capturing the emotion of a user when viewing particular media content. The method implemented on a computer system having one or more processors and memory includes detecting display of a media content item, e.g. a video clip, an audio clip, a photo or text message. While the media content item is being displayed, the viewer expression e.g. emotion is detected corresponding to a predefined viewer expression i.e. by using a database to compare the expressions with each other; as well as the identifying a portion of the media content item (e.g. the scene of the video clip) that corresponds with the viewer's expression i.e. emotion. The viewer expression or emotion is based on one of: a facial expression, a body movement, a voice, or an arm, leg or finger gesture and is presumed to be a viewer reaction to the portion of the media content item.
Described techniques may be utilized to receive a transcription stream including transcribed text that has been transcribed from speech, and to receive a summary request for a summary to be provided on a display of a device. Extracted text may be identified from the transcribed text and in response to the summary request. The extracted text may be processed using a summarization machine learning (ML) model to obtain a summary of the extracted text, and the summary may be displayed on the display of the device. When an image is captured, an augmented summary may be generated that includes the image together with a visual indication of one or more of an emotion, an entity, or an intent associated with the image, the summary, or the extracted text.
Systems and methods for capturing media content in accordance with viewer expression are disclosed. In some implementations, a method is performed at a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors. The method includes: (1) while a media content item is being presented to a user, capturing a momentary reaction of the user; (2) comparing the captured user reaction with one or more previously captured reactions of the user; (3) identifying the user reaction as one of a plurality of reaction types based on the comparison; (4) identifying the portion of the media content item corresponding to the momentary reaction; and (5) storing an association between the identified user reaction and the portion of the media content item.
The technology relates to methods for detecting and classifying emotions in textual communication and using this information to suggest graphical indicia such as emoji, stickers or GIFs to a user. Two main types of models are fully supervised models and few-shot models. In addition to fully supervised and few-shot models, other types of models focusing on the back-end (server) side or client (on-device) side may also be employed. Server-side models are larger-scale models that can enable higher degrees of accuracy, such as for use cases where models can be hosted on cloud servers where computational and storage resources are relatively abundant. On-device models are smaller-scale models, which enable use on resource-constrained devices such as mobile phones, smart watches or other wearables (e.g., head mounted displays), in-home devices, embedded devices, etc.